Commit Graph

13992 Commits

Author SHA1 Message Date
Andrew Morton
e91455217d mm/util.c: fix strndup_user() comment
The kerneldoc misdescribes strndup_user()'s return value.

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Timur Tabi <timur@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-05 16:02:31 -10:00
Greg Thelen
0b3d6e6f2d mm: writeback: use exact memcg dirty counts
Since commit a983b5ebee ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") memcg dirty and writeback counters are managed
as:

 1) per-memcg per-cpu values in range of [-32..32]

 2) per-memcg atomic counter

When a per-cpu counter cannot fit in [-32..32] it's flushed to the
atomic.  Stat readers only check the atomic.  Thus readers such as
balance_dirty_pages() may see a nontrivial error margin: 32 pages per
cpu.

Assuming 100 cpus:
   4k x86 page_size:  13 MiB error per memcg
  64k ppc page_size: 200 MiB error per memcg

Considering that dirty+writeback are used together for some decisions the
errors double.

This inaccuracy can lead to undeserved oom kills.  One nasty case is
when all per-cpu counters hold positive values offsetting an atomic
negative value (i.e.  per_cpu[*]=32, atomic=n_cpu*-32).
balance_dirty_pages() only consults the atomic and does not consider
throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
13..200 MiB range then there's absolutely no dirty throttling, which
burdens vmscan with only dirty+writeback pages thus resorting to oom
kill.

It could be argued that tiny containers are not supported, but it's more
subtle.  It's the amount the space available for file lru that matters.
If a container has memory.max-200MiB of non reclaimable memory, then it
will also suffer such oom kills on a 100 cpu machine.

The following test reliably ooms without this patch.  This patch avoids
oom kills.

  $ cat test
  mount -t cgroup2 none /dev/cgroup
  cd /dev/cgroup
  echo +io +memory > cgroup.subtree_control
  mkdir test
  cd test
  echo 10M > memory.max
  (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
  (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)

  $ cat memcg-writeback-stress.c
  /*
   * Dirty pages from all but one cpu.
   * Clean pages from the non dirtying cpu.
   * This is to stress per cpu counter imbalance.
   * On a 100 cpu machine:
   * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
   * - per memcg atomic is -99*32 pages
   * - thus the complete dirty limit: sum of all counters 0
   * - balance_dirty_pages() only sees atomic count -99*32 pages, which
   *   it max()s to 0.
   * - So a workload can dirty -99*32 pages before balance_dirty_pages()
   *   cares.
   */
  #define _GNU_SOURCE
  #include <err.h>
  #include <fcntl.h>
  #include <sched.h>
  #include <stdlib.h>
  #include <stdio.h>
  #include <sys/stat.h>
  #include <sys/sysinfo.h>
  #include <sys/types.h>
  #include <unistd.h>

  static char *buf;
  static int bufSize;

  static void set_affinity(int cpu)
  {
  	cpu_set_t affinity;

  	CPU_ZERO(&affinity);
  	CPU_SET(cpu, &affinity);
  	if (sched_setaffinity(0, sizeof(affinity), &affinity))
  		err(1, "sched_setaffinity");
  }

  static void dirty_on(int output_fd, int cpu)
  {
  	int i, wrote;

  	set_affinity(cpu);
  	for (i = 0; i < 32; i++) {
  		for (wrote = 0; wrote < bufSize; ) {
  			int ret = write(output_fd, buf+wrote, bufSize-wrote);
  			if (ret == -1)
  				err(1, "write");
  			wrote += ret;
  		}
  	}
  }

  int main(int argc, char **argv)
  {
  	int cpu, flush_cpu = 1, output_fd;
  	const char *output;

  	if (argc != 2)
  		errx(1, "usage: output_file");

  	output = argv[1];
  	bufSize = getpagesize();
  	buf = malloc(getpagesize());
  	if (buf == NULL)
  		errx(1, "malloc failed");

  	output_fd = open(output, O_CREAT|O_RDWR);
  	if (output_fd == -1)
  		err(1, "open(%s)", output);

  	for (cpu = 0; cpu < get_nprocs(); cpu++) {
  		if (cpu != flush_cpu)
  			dirty_on(output_fd, cpu);
  	}

  	set_affinity(flush_cpu);
  	if (fsync(output_fd))
  		err(1, "fsync(%s)", output);
  	if (close(output_fd))
  		err(1, "close(%s)", output);
  	free(buf);
  }

Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
collect exact per memcg counters.  This avoids the aforementioned oom
kills.

This does not affect the overhead of memory.stat, which still reads the
single atomic counter.

Why not use percpu_counter? memcg already handles cpus going offline, so
no need for that overhead from percpu_counter.  And the percpu_counter
spinlocks are more heavyweight than is required.

It probably also makes sense to use exact dirty and writeback counters
in memcg oom reports.  But that is saved for later.

Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>	[4.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-05 16:02:31 -10:00
Aneesh Kumar K.V
c6f3c5ee40 mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()
With some architectures like ppc64, set_pmd_at() cannot cope with a
situation where there is already some (different) valid entry present.

Use pmdp_set_access_flags() instead to modify the pfn which is built to
deal with modifying existing PMD entries.

This is similar to commit cae85cb8ad ("mm/memory.c: fix modifying of
page protection by insert_pfn()")

We also do similar update w.r.t insert_pfn_pud eventhough ppc64 don't
support pud pfn entries now.

Without this patch we also see the below message in kernel log "BUG:
non-zero pgtables_bytes on freeing mm:"

Link: http://lkml.kernel.org/r/20190402115125.18803-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reported-by: Chandan Rajendra <chandan@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-05 16:02:31 -10:00
Catalin Marinas
298a32b132 kmemleak: powerpc: skip scanning holes in the .bss section
Commit 2d4f567103 ("KVM: PPC: Introduce kvm_tmp framework") adds
kvm_tmp[] into the .bss section and then free the rest of unused spaces
back to the page allocator.

kernel_init
  kvm_guest_init
    kvm_free_tmp
      free_reserved_area
        free_unref_page
          free_unref_page_prepare

With DEBUG_PAGEALLOC=y, it will unmap those pages from kernel.  As the
result, kmemleak scan will trigger a panic when it scans the .bss
section with unmapped pages.

This patch creates dedicated kmemleak objects for the .data, .bss and
potentially .data..ro_after_init sections to allow partial freeing via
the kmemleak_free_part() in the powerpc kvm_free_tmp() function.

Link: http://lkml.kernel.org/r/20190321171917.62049-1-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Qian Cai <cai@lca.pw>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Tested-by: Qian Cai <cai@lca.pw>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-05 16:02:30 -10:00
Qian Cai
5b56d996dd mm/compaction.c: abort search if isolation fails
Running LTP oom01 in a tight loop or memory stress testing put the system
in a low-memory situation could triggers random memory corruption like
page flag corruption below due to in fast_isolate_freepages(), if
isolation fails, next_search_order() does not abort the search immediately
could lead to improper accesses.

UBSAN: Undefined behaviour in ./include/linux/mm.h:1195:50
index 7 is out of range for type 'zone [5]'
Call Trace:
 dump_stack+0x62/0x9a
 ubsan_epilogue+0xd/0x7f
 __ubsan_handle_out_of_bounds+0x14d/0x192
 __isolate_free_page+0x52c/0x600
 compaction_alloc+0x886/0x25f0
 unmap_and_move+0x37/0x1e70
 migrate_pages+0x2ca/0xb20
 compact_zone+0x19cb/0x3620
 kcompactd_do_work+0x2df/0x680
 kcompactd+0x1d8/0x6c0
 kthread+0x32c/0x3f0
 ret_from_fork+0x35/0x40
------------[ cut here ]------------
kernel BUG at mm/page_alloc.c:3124!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
RIP: 0010:__isolate_free_page+0x464/0x600
RSP: 0000:ffff888b9e1af848 EFLAGS: 00010007
RAX: 0000000030000000 RBX: ffff888c39fcf0f8 RCX: 0000000000000000
RDX: 1ffff111873f9e25 RSI: 0000000000000004 RDI: ffffed1173c35ef6
RBP: ffff888b9e1af898 R08: fffffbfff4fc2461 R09: fffffbfff4fc2460
R10: fffffbfff4fc2460 R11: ffffffffa7e12303 R12: 0000000000000008
R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000007
FS:  0000000000000000(0000) GS:ffff888ba8e80000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc7abc00000 CR3: 0000000752416004 CR4: 00000000001606a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 compaction_alloc+0x886/0x25f0
 unmap_and_move+0x37/0x1e70
 migrate_pages+0x2ca/0xb20
 compact_zone+0x19cb/0x3620
 kcompactd_do_work+0x2df/0x680
 kcompactd+0x1d8/0x6c0
 kthread+0x32c/0x3f0
 ret_from_fork+0x35/0x40

Link: http://lkml.kernel.org/r/20190320192648.52499-1-cai@lca.pw
Fixes: dbe2d4e4f1 ("mm, compaction: round-robin the order while searching the free lists for a target")
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
2019-04-04 11:56:15 +01:00
Mel Gorman
6b0868c820 mm/compaction.c: correct zone boundary handling when resetting pageblock skip hints
Mikhail Gavrilo reported the following bug being triggered in a Fedora
kernel based on 5.1-rc1 but it is relevant to a vanilla kernel.

 kernel: page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
 kernel: ------------[ cut here ]------------
 kernel: kernel BUG at include/linux/mm.h:1021!
 kernel: invalid opcode: 0000 [#1] SMP NOPTI
 kernel: CPU: 6 PID: 116 Comm: kswapd0 Tainted: G         C        5.1.0-0.rc1.git1.3.fc31.x86_64 #1
 kernel: Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 1201 12/07/2018
 kernel: RIP: 0010:__reset_isolation_pfn+0x244/0x2b0
 kernel: Code: fe 06 e8 0f 8e fc ff 44 0f b6 4c 24 04 48 85 c0 0f 85 dc fe ff ff e9 68 fe ff ff 48 c7 c6 58 b7 2e 8c 4c 89 ff e8 0c 75 00 00 <0f> 0b 48 c7 c6 58 b7 2e 8c e8 fe 74 00 00 0f 0b 48 89 fa 41 b8 01
 kernel: RSP: 0018:ffff9e2d03f0fde8 EFLAGS: 00010246
 kernel: RAX: 0000000000000034 RBX: 000000000081f380 RCX: ffff8cffbddd6c20
 kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8cffbddd6c20
 kernel: RBP: 0000000000000001 R08: 0000009898b94613 R09: 0000000000000000
 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000100000
 kernel: R13: 0000000000100000 R14: 0000000000000001 R15: ffffca7de07ce000
 kernel: FS:  0000000000000000(0000) GS:ffff8cffbdc00000(0000) knlGS:0000000000000000
 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 kernel: CR2: 00007fc1670e9000 CR3: 00000007f5276000 CR4: 00000000003406e0
 kernel: Call Trace:
 kernel:  __reset_isolation_suitable+0x62/0x120
 kernel:  reset_isolation_suitable+0x3b/0x40
 kernel:  kswapd+0x147/0x540
 kernel:  ? finish_wait+0x90/0x90
 kernel:  kthread+0x108/0x140
 kernel:  ? balance_pgdat+0x560/0x560
 kernel:  ? kthread_park+0x90/0x90
 kernel:  ret_from_fork+0x27/0x50

He bisected it down to e332f741a8 ("mm, compaction: be selective about
what pageblocks to clear skip hints").  The problem is that the patch in
question was sloppy with respect to the handling of zone boundaries.  In
some instances, it was possible for PFNs outside of a zone to be examined
and if those were not properly initialised or poisoned then it would
trigger the VM_BUG_ON.  This patch corrects the zone boundary issues when
resetting pageblock skip hints and Mikhail reported that the bug did not
trigger after 30 hours of testing.

Link: http://lkml.kernel.org/r/20190327085424.GL3189@techsingularity.net
Fixes: e332f741a8 ("mm, compaction: be selective about what pageblocks to clear skip hints")
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
2019-04-04 11:56:01 +01:00
Peter Zijlstra
57b78a62e7 x86/uaccess, kasan: Fix KASAN vs SMAP
KASAN inserts extra code for every LOAD/STORE emitted by te compiler.
Much of this code is simple and safe to run with AC=1, however the
kasan_report() function, called on error, is most certainly not safe
to call with AC=1.

Therefore wrap kasan_report() in user_access_{save,restore}; which for
x86 SMAP, saves/restores EFLAGS and clears AC before calling the real
function.

Also ensure all the functions are without __fentry__ hook. The
function tracer is also not safe.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 11:02:24 +02:00
Peter Zijlstra
0a8caf211b asm-generic/tlb: Remove tlb_table_flush()
There are no external users of this API (nor should there be); remove it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 10:33:02 +02:00
Peter Zijlstra
fa0aafb8ac asm-generic/tlb: Remove tlb_flush_mmu_free()
As the comment notes; it is a potentially dangerous operation. Just
use tlb_flush_mmu(), that will skip the (double) TLB invalidate if
it really isn't needed anyway.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 10:33:01 +02:00
Peter Zijlstra
b3fa8ed4e4 asm-generic/tlb: Remove CONFIG_HAVE_GENERIC_MMU_GATHER
Since all architectures are now using it, it is redundant.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 10:32:59 +02:00
Peter Zijlstra
1808d65b55 asm-generic/tlb: Remove arch_tlb*_mmu()
Now that all architectures are converted to the generic code, remove
the arch hooks.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 10:32:58 +02:00
Martin Schwidefsky
952a31c9e6 asm-generic/tlb: Introduce CONFIG_HAVE_MMU_GATHER_NO_GATHER=y
Add the Kconfig option HAVE_MMU_GATHER_NO_GATHER to the generic
mmu_gather code. If the option is set the mmu_gather will not
track individual pages for delayed page free anymore. A platform
that enables the option needs to provide its own implementation
of the __tlb_remove_page_size() function to free pages.

No change in behavior intended.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: linux@armlinux.org.uk
Cc: npiggin@gmail.com
Link: http://lkml.kernel.org/r/20180918125151.31744-2-schwidefsky@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 10:32:55 +02:00
Peter Zijlstra
96bc9567cb asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
Make issuing a TLB invalidate for page-table pages the normal case.

The reason is twofold:

 - too many invalidates is safer than too few,
 - most architectures use the linux page-tables natively
   and would thus require this.

Make it an opt-out, instead of an opt-in.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 10:32:47 +02:00
Peter Zijlstra
ed6a79352c asm-generic/tlb, arch: Provide CONFIG_HAVE_MMU_GATHER_PAGE_SIZE
Move the mmu_gather::page_size things into the generic code instead of
PowerPC specific bits.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 10:32:40 +02:00
Lars Persson
d2b2c6dd22 mm/migrate.c: add missing flush_dcache_page for non-mapped page migrate
Our MIPS 1004Kc SoCs were seeing random userspace crashes with SIGILL
and SIGSEGV that could not be traced back to a userspace code bug.  They
had all the magic signs of an I/D cache coherency issue.

Now recently we noticed that the /proc/sys/vm/compact_memory interface
was quite efficient at provoking this class of userspace crashes.

Studying the code in mm/migrate.c there is a distinction made between
migrating a page that is mapped at the instant of migration and one that
is not mapped.  Our problem turned out to be the non-mapped pages.

For the non-mapped page the code performs a copy of the page content and
all relevant meta-data of the page without doing the required D-cache
maintenance.  This leaves dirty data in the D-cache of the CPU and on
the 1004K cores this data is not visible to the I-cache.  A subsequent
page-fault that triggers a mapping of the page will happily serve the
process with potentially stale code.

What about ARM then, this bug should have seen greater exposure? Well
ARM became immune to this flaw back in 2010, see commit c01778001a
("ARM: 6379/1: Assume new page cache pages have dirty D-cache").

My proposed fix moves the D-cache maintenance inside move_to_new_page to
make it common for both cases.

Link: http://lkml.kernel.org/r/20190315083502.11849-1-larper@axis.com
Fixes: 97ee052461 ("flush cache before installing new page at migraton")
Signed-off-by: Lars Persson <larper@axis.com>
Reviewed-by: Paul Burton <paul.burton@mips.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:37 -07:00
Qian Cai
f5777bc2d9 mm/page_isolation.c: fix a wrong flag in set_migratetype_isolate()
Due to has_unmovable_pages() taking an incorrect irqsave flag instead of
the isolation flag in set_migratetype_isolate(), there are issues with
HWPOSION and error reporting where dump_page() is not called when there
is an unmovable page.

Link: http://lkml.kernel.org/r/20190320204941.53731-1-cai@lca.pw
Fixes: d381c54760 ("mm: only report isolation failures when offlining memory")
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>	[5.0.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:37 -07:00
Qian Cai
c4efe484b5 mm/memory_hotplug.c: fix notification in offline error path
When start_isolate_page_range() returned -EBUSY in __offline_pages(), it
calls memory_notify(MEM_CANCEL_OFFLINE, &arg) with an uninitialized
"arg".  As the result, it triggers warnings below.  Also, it is only
necessary to notify MEM_CANCEL_OFFLINE after MEM_GOING_OFFLINE.

  page:ffffea0001200000 count:1 mapcount:0 mapping:0000000000000000
  index:0x0
  flags: 0x3fffe000001000(reserved)
  raw: 003fffe000001000 ffffea0001200008 ffffea0001200008 0000000000000000
  raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
  page dumped because: unmovable page
  WARNING: CPU: 25 PID: 1665 at mm/kasan/common.c:665
  kasan_mem_notifier+0x34/0x23b
  CPU: 25 PID: 1665 Comm: bash Tainted: G        W         5.0.0+ #94
  Hardware name: HP ProLiant DL180 Gen9/ProLiant DL180 Gen9, BIOS U20
  10/25/2017
  RIP: 0010:kasan_mem_notifier+0x34/0x23b
  RSP: 0018:ffff8883ec737890 EFLAGS: 00010206
  RAX: 0000000000000246 RBX: ff10f0f4435f1000 RCX: f887a7a21af88000
  RDX: dffffc0000000000 RSI: 0000000000000020 RDI: ffff8881f221af88
  RBP: ffff8883ec737898 R08: ffff888000000000 R09: ffffffffb0bddcd0
  R10: ffffed103e857088 R11: ffff8881f42b8443 R12: dffffc0000000000
  R13: 00000000fffffff9 R14: dffffc0000000000 R15: 0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000560fbd31d730 CR3: 00000004049c6003 CR4: 00000000001606a0
  Call Trace:
   notifier_call_chain+0xbf/0x130
   __blocking_notifier_call_chain+0x76/0xc0
   blocking_notifier_call_chain+0x16/0x20
   memory_notify+0x1b/0x20
   __offline_pages+0x3e2/0x1210
   offline_pages+0x11/0x20
   memory_block_action+0x144/0x300
   memory_subsys_offline+0xe5/0x170
   device_offline+0x13f/0x1e0
   state_store+0xeb/0x110
   dev_attr_store+0x3f/0x70
   sysfs_kf_write+0x104/0x150
   kernfs_fop_write+0x25c/0x410
   __vfs_write+0x66/0x120
   vfs_write+0x15a/0x4f0
   ksys_write+0xd2/0x1b0
   __x64_sys_write+0x73/0xb0
   do_syscall_64+0xeb/0xb78
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f14f75cc3b8
  RSP: 002b:00007ffe84d01d68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
  RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f14f75cc3b8
  RDX: 0000000000000008 RSI: 0000563f8e433d70 RDI: 0000000000000001
  RBP: 0000563f8e433d70 R08: 000000000000000a R09: 00007ffe84d018f0
  R10: 000000000000000a R11: 0000000000000246 R12: 00007f14f789e780
  R13: 0000000000000008 R14: 00007f14f7899740 R15: 0000000000000008

Link: http://lkml.kernel.org/r/20190320204255.53571-1-cai@lca.pw
Fixes: 7960509329 ("mm, memory_hotplug: print reason for the offlining failure")
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>	[5.0.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:37 -07:00
Oscar Salvador
5ae2efb1de mm/debug.c: fix __dump_page when mapping->host is not set
While debugging something, I added a dump_page() into do_swap_page(),
and I got the splat from below.  The issue happens when dereferencing
mapping->host in __dump_page():

  ...
  else if (mapping) {
	pr_warn("%ps ", mapping->a_ops);
	if (mapping->host->i_dentry.first) {
		struct dentry *dentry;
		dentry = container_of(mapping->host->i_dentry.first, struct dentry, d_u.d_alias);
		pr_warn("name:\"%pd\" ", dentry);
	}
  }
  ...

Swap address space does not contain an inode information, and so
mapping->host equals NULL.

Although the dump_page() call was added artificially into
do_swap_page(), I am not sure if we can hit this from any other path, so
it looks worth fixing it.  We can easily do that by checking
mapping->host first.

Link: http://lkml.kernel.org/r/20190318072931.29094-1-osalvador@suse.de
Fixes: 1c6fb1d89e ("mm: print more information about mapping in __dump_page")
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:37 -07:00
Yang Shi
a7f40cfe3b mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified
When MPOL_MF_STRICT was specified and an existing page was already on a
node that does not follow the policy, mbind() should return -EIO.  But
commit 6f4576e368 ("mempolicy: apply page table walker on
queue_pages_range()") broke the rule.

And commit c863379849 ("mm: mempolicy: mbind and migrate_pages support
thp migration") didn't return the correct value for THP mbind() too.

If MPOL_MF_STRICT is set, ignore vma_migratable() to make sure it
reaches queue_pages_to_pte_range() or queue_pages_pmd() to check if an
existing page was already on a node that does not follow the policy.
And, non-migratable vma may be used, return -EIO too if MPOL_MF_MOVE or
MPOL_MF_MOVE_ALL was specified.

Tested with https://github.com/metan-ucw/ltp/blob/master/testcases/kernel/syscalls/mbind/mbind02.c

[akpm@linux-foundation.org: tweak code comment]
Link: http://lkml.kernel.org/r/1553020556-38583-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: 6f4576e368 ("mempolicy: apply page table walker on queue_pages_range()")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reported-by: Cyril Hrubis <chrubis@suse.cz>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Rafael Aquini <aquini@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:37 -07:00
Nicolas Boichat
6d6ea1e967 mm: add support for kmem caches in DMA32 zone
Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
v6.

This is a followup to the discussion in [1], [2].

IOMMUs using ARMv7 short-descriptor format require page tables (level 1
and 2) to be allocated within the first 4GB of RAM, even on 64-bit
systems.

For L1 tables that are bigger than a page, we can just use
__get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
use GFP_DMA).

For L2 tables that only take 1KB, it would be a waste to allocate a full
page, so we considered 3 approaches:
 1. This series, adding support for GFP_DMA32 slab caches.
 2. genalloc, which requires pre-allocating the maximum number of L2 page
    tables (4096, so 4MB of memory).
 3. page_frag, which is not very memory-efficient as it is unable to reuse
    freed fragments until the whole page is freed. [3]

This series is the most memory-efficient approach.

stable@ note:
  We confirmed that this is a regression, and IOMMU errors happen on 4.19
  and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
  most likely starts from commit ad67f5a654 ("arm64: replace ZONE_DMA
  with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
  platforms (and maybe others?).

[1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
[2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
[3] https://patchwork.codeaurora.org/patch/671639/

This patch (of 3):

IOMMUs using ARMv7 short-descriptor format require page tables to be
allocated within the first 4GB of RAM, even on 64-bit systems.  On arm64,
this is done by passing GFP_DMA32 flag to memory allocation functions.

For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
a full page using get_free_pages, so we considered 3 approaches:
 1. This patch, adding support for GFP_DMA32 slab caches.
 2. genalloc, which requires pre-allocating the maximum number of L2
    page tables (4096, so 4MB of memory).
 3. page_frag, which is not very memory-efficient as it is unable
    to reuse freed fragments until the whole page is freed.

This change makes it possible to create a custom cache in DMA32 zone using
kmem_cache_create, then allocate memory using kmem_cache_alloc.

We do not create a DMA32 kmalloc cache array, as there are currently no
users of kmalloc(..., GFP_DMA32).  These calls will continue to trigger a
warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.

This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
unnecessary).

Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
Signed-off-by: Nicolas Boichat <drinkcat@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Sasha Levin <Alexander.Levin@microsoft.com>
Cc: Huaisheng Ye <yehs1@lenovo.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Yong Wu <yong.wu@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Tomasz Figa <tfiga@google.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:37 -07:00
Qian Cai
9b7ea46a82 mm/hotplug: fix offline undo_isolate_page_range()
Commit f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded
memory to zones until online") introduced move_pfn_range_to_zone() which
calls memmap_init_zone() during onlining a memory block.
memmap_init_zone() will reset pagetype flags and makes migrate type to
be MOVABLE.

However, in __offline_pages(), it also call undo_isolate_page_range()
after offline_isolated_pages() to do the same thing.  Due to commit
2ce13640b3 ("mm: __first_valid_page skip over offline pages") changed
__first_valid_page() to skip offline pages, undo_isolate_page_range()
here just waste CPU cycles looping around the offlining PFN range while
doing nothing, because __first_valid_page() will return NULL as
offline_isolated_pages() has already marked all memory sections within
the pfn range as offline via offline_mem_sections().

Also, after calling the "useless" undo_isolate_page_range() here, it
reaches the point of no returning by notifying MEM_OFFLINE.  Those pages
will be marked as MIGRATE_MOVABLE again once onlining.  The only thing
left to do is to decrease the number of isolated pageblocks zone counter
which would make some paths of the page allocation slower that the above
commit introduced.

Even if alloc_contig_range() can be used to isolate 16GB-hugetlb pages
on ppc64, an "int" should still be enough to represent the number of
pageblocks there.  Fix an incorrect comment along the way.

[cai@lca.pw: v4]
  Link: http://lkml.kernel.org/r/20190314150641.59358-1-cai@lca.pw
Link: http://lkml.kernel.org/r/20190313143133.46200-1-cai@lca.pw
Fixes: 2ce13640b3 ("mm: __first_valid_page skip over offline pages")
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>	[4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:37 -07:00
Qian Cai
44dc1b1fab mm/debug.c: add a cast to u64 for atomic64_read()
atomic64_read() on ppc64le returns "long int", so fix the same way as
commit d549f545e6 ("drm/virtio: use %llu format string form
atomic64_t") by adding a cast to u64, which makes it work on all arches.

    In file included from ./include/linux/printk.h:7,
                     from ./include/linux/kernel.h:15,
                     from mm/debug.c:9:
    mm/debug.c: In function 'dump_mm':
    ./include/linux/kern_levels.h:5:18: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 19 has type 'long int' [-Wformat=]
     #define KERN_SOH "A"  /* ASCII Start Of Header */
                      ^~~~~~
    ./include/linux/kern_levels.h:8:20: note: in expansion of macro
    'KERN_SOH'
     #define KERN_EMERG KERN_SOH "0" /* system is unusable */
                        ^~~~~~~~
    ./include/linux/printk.h:297:9: note: in expansion of macro 'KERN_EMERG'
      printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
             ^~~~~~~~~~
    mm/debug.c:133:2: note: in expansion of macro 'pr_emerg'
      pr_emerg("mm %px mmap %px seqnum %llu task_size %lu"
      ^~~~~~~~
    mm/debug.c:140:17: note: format string is defined here
       "pinned_vm %llx data_vm %lx exec_vm %lx stack_vm %lx"
                  ~~~^
                  %lx

Link: http://lkml.kernel.org/r/20190310183051.87303-1-cai@lca.pw
Fixes: 70f8a3ca68 ("mm: make mm->pinned_vm an atomic64 counter")
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:37 -07:00
Jan Kara
cae85cb8ad mm/memory.c: fix modifying of page protection by insert_pfn()
Aneesh has reported that PPC triggers the following warning when
excercising DAX code:

  IP set_pte_at+0x3c/0x190
  LR insert_pfn+0x208/0x280
  Call Trace:
     insert_pfn+0x68/0x280
     dax_iomap_pte_fault.isra.7+0x734/0xa40
     __xfs_filemap_fault+0x280/0x2d0
     do_wp_page+0x48c/0xa40
     __handle_mm_fault+0x8d0/0x1fd0
     handle_mm_fault+0x140/0x250
     __do_page_fault+0x300/0xd60
     handle_page_fault+0x18

Now that is WARN_ON in set_pte_at which is

        VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));

The problem is that on some architectures set_pte_at() cannot cope with
a situation where there is already some (different) valid entry present.

Use ptep_set_access_flags() instead to modify the pfn which is built to
deal with modifying existing PTE.

Link: http://lkml.kernel.org/r/20190311084537.16029-1-jack@suse.cz
Fixes: b2770da642 "mm: add vm_insert_mixed_mkwrite()"
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Chandan Rajendra <chandan@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:37 -07:00
Qian Cai
c412a769d2 kasan: fix variable 'tag' set but not used warning
set_tag() compiles away when CONFIG_KASAN_SW_TAGS=n, so make
arch_kasan_set_tag() a static inline function to fix warnings below.

  mm/kasan/common.c: In function '__kasan_kmalloc':
  mm/kasan/common.c:475:5: warning: variable 'tag' set but not used [-Wunused-but-set-variable]
    u8 tag;
       ^~~

Link: http://lkml.kernel.org/r/20190307185244.54648-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:36 -07:00
Matteo Croce
00206a69ee percpu: stop printing kernel addresses
Since commit ad67b74d24 ("printk: hash addresses printed with %p"),
at boot "____ptrval____" is printed instead of actual addresses:

    percpu: Embedded 38 pages/cpu @(____ptrval____) s124376 r0 d31272 u524288

Instead of changing the print to "%px", and leaking kernel addresses,
just remove the print completely, cfr. e.g. commit 071929dbdd
("arm64: Stop printing the virtual memory layout").

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2019-03-18 10:36:36 -07:00
Linus Torvalds
f67e3fb489 device-dax for 5.1
* Replace the /sys/class/dax device model with /sys/bus/dax, and include
   a compat driver so distributions can opt-in to the new ABI.
 
 * Allow for an alternative driver for the device-dax address-range
 
 * Introduce the 'kmem' driver to hotplug / assign a device-dax
   address-range to the core-mm.
 
 * Arrange for the device-dax target-node to be onlined so that the newly
   added memory range can be uniquely referenced by numa apis.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJchWpGAAoJEB7SkWpmfYgCJk8P/0Q1DINszUDO/vKjJ09cDs9P
 Jw3it6GBIL50rDOu9QdcprSpwYDD0h1mLAV/m6oa3bVO+p4uWGvnxaxRx2HN2c/v
 vhZFtUDpHlqR63vzWMNVKRprYixCRJDUr6xQhhCcE3ak/ELN6w7LWfikKVWv15UL
 MfR96IQU38f+xRda/zSXnL9606Dvkvu/inEHj84lRcHIwj3sQAUalrE8bR3O32gZ
 bDg/l5kzT49o8ZXUo/TegvRSSSZpJmOl2DD0RW+ax5q3NI2bOXFrVDUKBKxf/hcQ
 E/V9i57TrqQx0GqRhnU7rN/v53cFZGGs31TEEIB/xs3bzCnADxwXcjL5b5K005J6
 vJjBA2ODBewHFK3uVx46Hy1iV4eCtZWj4QrMnrjdSrjXOfbF5GTbWOhPFgoq7TWf
 S7VqFEf3I2gDPaMq4o8Ej1kLH4HMYeor2NSOZjyvGn87rSZ3ZIQguwbaNIVl+itz
 gdDt0ZOU0BgOBkV+rZIeZDaGdloWCHcDPL15CkZaOZyzdWhfEZ7dod6ad+9udilU
 EUPH62RgzXZtfm5zpebYyjNVLbb9pLZ0nT+UypyGR6zqWx1SqU3mXi63NFXPco+x
 XA9j//edPeI6NHg2CXLEh8DLuCg3dG1zWRJANkiF+niBwyCR8CHtGWAoY6soXbKe
 2UrXGcIfXxyJ8V9v8v4q
 =hfa3
 -----END PGP SIGNATURE-----

Merge tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull device-dax updates from Dan Williams:
 "New device-dax infrastructure to allow persistent memory and other
  "reserved" / performance differentiated memories, to be assigned to
  the core-mm as "System RAM".

  Some users want to use persistent memory as additional volatile
  memory. They are willing to cope with potential performance
  differences, for example between DRAM and 3D Xpoint, and want to use
  typical Linux memory management apis rather than a userspace memory
  allocator layered over an mmap() of a dax file. The administration
  model is to decide how much Persistent Memory (pmem) to use as System
  RAM, create a device-dax-mode namespace of that size, and then assign
  it to the core-mm. The rationale for device-dax is that it is a
  generic memory-mapping driver that can be layered over any "special
  purpose" memory, not just pmem. On subsequent boots udev rules can be
  used to restore the memory assignment.

  One implication of using pmem as RAM is that mlock() no longer keeps
  data off persistent media. For this reason it is recommended to enable
  NVDIMM Security (previously merged for 5.0) to encrypt pmem contents
  at rest. We considered making this recommendation an actively enforced
  requirement, but in the end decided to leave it as a distribution /
  administrator policy to allow for emulation and test environments that
  lack security capable NVDIMMs.

  Summary:

   - Replace the /sys/class/dax device model with /sys/bus/dax, and
     include a compat driver so distributions can opt-in to the new ABI.

   - Allow for an alternative driver for the device-dax address-range

   - Introduce the 'kmem' driver to hotplug / assign a device-dax
     address-range to the core-mm.

   - Arrange for the device-dax target-node to be onlined so that the
     newly added memory range can be uniquely referenced by numa apis"

NOTE! I'm not entirely happy with the whole "PMEM as RAM" model because
we currently have special - and very annoying rules in the kernel about
accessing PMEM only with the "MC safe" accessors, because machine checks
inside the regular repeat string copy functions can be fatal in some
(not described) circumstances.

And apparently the PMEM modules can cause that a lot more than regular
RAM.  The argument is that this happens because PMEM doesn't necessarily
get scrubbed at boot like RAM does, but that is planned to be added for
the user space tooling.

Quoting Dan from another email:
 "The exposure can be reduced in the volatile-RAM case by scanning for
  and clearing errors before it is onlined as RAM. The userspace tooling
  for that can be in place before v5.1-final. There's also runtime
  notifications of errors via acpi_nfit_uc_error_notify() from
  background scrubbers on the DIMM devices. With that mechanism the
  kernel could proactively clear newly discovered poison in the volatile
  case, but that would be additional development more suitable for v5.2.

  I understand the concern, and the need to highlight this issue by
  tapping the brakes on feature development, but I don't see PMEM as RAM
  making the situation worse when the exposure is also there via DAX in
  the PMEM case. Volatile-RAM is arguably a safer use case since it's
  possible to repair pages where the persistent case needs active
  application coordination"

* tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  device-dax: "Hotplug" persistent memory for use like normal RAM
  mm/resource: Let walk_system_ram_range() search child resources
  mm/memory-hotplug: Allow memory resources to be children
  mm/resource: Move HMM pr_debug() deeper into resource code
  mm/resource: Return real error codes from walk failures
  device-dax: Add a 'modalias' attribute to DAX 'bus' devices
  device-dax: Add a 'target_node' attribute
  device-dax: Auto-bind device after successful new_id
  acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
  device-dax: Add /sys/class/dax backwards compatibility
  device-dax: Add support for a dax override driver
  device-dax: Move resource pinning+mapping into the common driver
  device-dax: Introduce bus + driver model
  device-dax: Start defining a dax bus model
  device-dax: Remove multi-resource infrastructure
  device-dax: Kill dax_region base
  device-dax: Kill dax_region ida
2019-03-16 13:05:32 -07:00
Linus Torvalds
8b0f9fa2e0 filemap: add a comment about FAULT_FLAG_RETRY_NOWAIT behavior
I thought Josef Bacik's patch to drop the mmap_sem was buggy, because
when looking at the error cases, there was one case where we returned
VM_FAULT_RETRY without actually dropping the mmap_sem.

Josef had to explain to me (using small words) that yes, that's actually
what we're supposed to do, and his patch was correct.  Which not only
convinced me he knew what he was doing and I should stop arguing with
him, but also that I should add a comment to the case I was confused
about.

Patiently-pointed-out-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-15 11:26:07 -07:00
Josef Bacik
6b4c9f4469 filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page
lock.  The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.

The problem with this is the assumption that the readahead does anything.
In the case that the box is under extreme memory or IO pressure we may end
up not reading anything at all for readahead, which means we will end up
reading in the page under the mmap_sem.

Even if the readahead does something, it could get throttled because of io
pressure on the system and the process is in a lower priority cgroup.

Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions.  Consider some large company that does a
lot of web traffic.  This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons.  Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.

Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time.  This includes while
issuing readahead, locking the page, or needing to call ->readpage because
readahead did not occur.  Then once we have a fully uptodate page we can
return with VM_FAULT_RETRY and come back again to find our nicely in-cache
page that was gotten outside of the mmap_sem.

This patch also adds a new helper for locking the page with the mmap_sem
dropped.  This doesn't make sense currently as generally speaking if the
page is already locked it'll have been read in (unless there was an error)
before it was unlocked.  However a forthcoming patchset will change this
with the ability to abort read-ahead bio's if necessary, making it more
likely that we could contend for a page lock and still have a not uptodate
page.  This allows us to deal with this case by grabbing the lock and
issuing the IO without the mmap_sem held, and then returning
VM_FAULT_RETRY to come back around.

[josef@toxicpanda.com: v6]
  Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
[kirill@shutemov.name: fix race in filemap_fault()]
  Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-15 11:21:25 -07:00
Josef Bacik
a75d4c3337 filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6.

Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions.  Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.

We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such.  This involves running
'ps' or other such utilities.  These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem).  Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.

This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer.  This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal.  But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.

In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO.  This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock.  With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.

The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation.  We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().

I've tested these patches with xfstests and there are no regressions.

This patch (of 3):

If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it.  This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed.  Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache.  Then use the normal page locking and readpage logic already in
filemap_fault.  This simplifies the no page in page cache case
significantly.

[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
  Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-15 11:21:25 -07:00
Josef Bacik
2a1180f1bd filemap: pass vm_fault to the mmap ra helpers
All of the arguments to these functions come from the vmf.

Cut down on the amount of arguments passed by simply passing in the vmf
to these two helpers.

Link: http://lkml.kernel.org/r/20181211173801.29535-3-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 14:36:20 -07:00
Dennis Zhou
d33d9f3dd9 percpu: use chunk scan_hint to skip some scanning
Just like blocks, chunks now maintain a scan_hint. This can be used to
skip some scanning by promoting the scan_hint to be the contig_hint.
The chunk's scan_hint is primarily updated on the backside and relies on
full scanning when a block becomes free or the free region spans across
blocks.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13 12:25:31 -07:00
Dennis Zhou
92c14cab43 percpu: convert chunk hints to be based on pcpu_block_md
As mentioned in the last patch, a chunk's hints are no different than a
block just responsible for more bits. This converts chunk level hints to
use a pcpu_block_md to maintain them. This lets us reuse the same hint
helper functions as a block. The left_free and right_free are unused by
the chunk's pcpu_block_md.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13 12:25:31 -07:00
Dennis Zhou
047924c968 percpu: make pcpu_block_md generic
In reality, a chunk is just a block covering a larger number of bits.
The hints themselves are one in the same. Rather than maintaining the
hints separately, first introduce nr_bits to genericize
pcpu_block_update() to correctly maintain block->right_free. The next
patch will convert chunk hints to be managed as a pcpu_block_md.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13 12:25:31 -07:00
Dennis Zhou
da3afdd5bb percpu: use block scan_hint to only scan forward
Blocks now remember the latest scan_hint. This can be used on the
allocation path as when a contig_hint is broken, we can promote the
scan_hint to the contig_hint and scan forward from there. This works
because pcpu_block_refresh_hint() is only called on the allocation path
while block free regions are updated manually in
pcpu_block_update_hint_free().

Signed-off-by: Dennis Zhou <dennis@kernel.org>
2019-03-13 12:25:31 -07:00
Dennis Zhou
b89462a9c5 percpu: remember largest area skipped during allocation
Percpu allocations attempt to do first fit by scanning forward from the
first_free of a block. However, fragmentation from allocation requests
can cause holes not seen by block hint update functions. To address
this, create a local version of bitmap_find_next_zero_area_off() that
remembers the largest area skipped over. The caveat is that it only sees
regions skipped over due to not fitting, not regions skipped due to
alignment.

Prior to updating the scan_hint, a scan backwards is done to try and
recover free bits skipped due to alignment. While this can cause
scanning to miss earlier possible free areas, smaller allocations will
eventually fill those holes due to first fit.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
2019-03-13 12:25:31 -07:00
Dennis Zhou
382b88e961 percpu: add block level scan_hint
Fragmentation can cause both blocks and chunks to have an early
first_firee bit available, but only able to satisfy allocations much
later on. This patch introduces a scan_hint to help mitigate some
unnecessary scanning.

The scan_hint remembers the largest area prior to the contig_hint. If
the contig_hint == scan_hint, then scan_hint_start > contig_hint_start.
This is necessary for scan_hint discovery when refreshing a block.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13 12:25:31 -07:00
Dennis Zhou
b239f7daf5 percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE
Previously, block size was flexible based on the constraint that the
GCD(PCPU_BITMAP_BLOCK_SIZE, PAGE_SIZE) > 1. However, this carried the
overhead that keeping a floating number of populated free pages required
scanning over the free regions of a chunk.

Setting the block size to be fixed at PAGE_SIZE lets us know when an
empty page becomes used as we will break a full contig_hint of a block.
This means we no longer have to scan the whole chunk upon breaking a
contig_hint which empty page management piggybacked off. A later patch
takes advantage of this to optimize the allocation path by only scanning
forward using the scan_hint introduced later too.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13 12:25:31 -07:00
Dennis Zhou
8744d85942 percpu: relegate chunks unusable when failing small allocations
In certain cases, requestors of percpu memory may want specific
alignments. However, it is possible to end up in situations where the
contig_hint matches, but the alignment does not. This causes excess
scanning of chunks that will fail. To prevent this, if a small
allocation fails (< 32B), the chunk is moved to the empty list. Once an
allocation is freed from that chunk, it is placed back into rotation.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13 12:25:31 -07:00
Dennis Zhou
3e54097beb percpu: manage chunks based on contig_bits instead of free_bytes
When a chunk becomes fragmented, it can end up having a large number of
small allocation areas free. The free_bytes sorting of chunks leads to
unnecessary checking of chunks that cannot satisfy the allocation.
Switch to contig_bits sorting to prevent scanning chunks that may not be
able to service the allocation request.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13 12:25:31 -07:00
Dennis Zhou
d9f3a01eeb percpu: introduce helper to determine if two regions overlap
While block hints were always accurate, it's possible when spanning
across blocks that we miss updating the chunk's contig_hint. Rather than
rely on correctness of the boundaries of hints, do a full overlap
comparison.

A future patch introduces the scan_hint which makes the contig_hint
slightly fuzzy as they can at times be smaller than the actual hint.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
2019-03-13 12:25:31 -07:00
Dennis Zhou
8c43004af0 percpu: do not search past bitmap when allocating an area
pcpu_find_block_fit() guarantees that a fit is found within
PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit as
it compares against the block's contig_hint. This can lead to
incorrectly scanning past the end of the bitmap. The behavior was okay
given the check after for bit_off >= end and the correctness of the
hints from pcpu_find_block_fit().

This patch fixes this by bounding the end offset by the number of bits
in a chunk.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13 12:25:31 -07:00
Dennis Zhou
8e5a2b9893 percpu: update free path with correct new free region
When updating the chunk's contig_hint on the free path of a hint that
does not touch the page boundaries, it was incorrectly using the
starting offset of the free region and the block's contig_hint. This
could lead to incorrect assumptions about fit given a size and better
alignment of the start. Fix this by using (end - start) as this is only
called when updating a hint within a block.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
2019-03-13 12:25:31 -07:00
Linus Torvalds
a667cb7a94 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - a few misc things

 - the rest of MM

-  remove flex_arrays, replace with new simple radix-tree implementation

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (38 commits)
  Drop flex_arrays
  sctp: convert to genradix
  proc: commit to genradix
  generic radix trees
  selinux: convert to kvmalloc
  md: convert to kvmalloc
  openvswitch: convert to kvmalloc
  of: fix kmemleak crash caused by imbalance in early memory reservation
  mm: memblock: update comments and kernel-doc
  memblock: split checks whether a region should be skipped to a helper function
  memblock: remove memblock_{set,clear}_region_flags
  memblock: drop memblock_alloc_*_nopanic() variants
  memblock: memblock_alloc_try_nid: don't panic
  treewide: add checks for the return value of memblock_alloc*()
  swiotlb: add checks for the return value of memblock_alloc*()
  init/main: add checks for the return value of memblock_alloc*()
  mm/percpu: add checks for the return value of memblock_alloc*()
  sparc: add checks for the return value of memblock_alloc*()
  ia64: add checks for the return value of memblock_alloc*()
  arch: don't memset(0) memory returned by memblock_alloc()
  ...
2019-03-12 10:39:53 -07:00
Mike Rapoport
a2974133b7 mm: memblock: update comments and kernel-doc
* Remove comments mentioning bootmem
* Extend "DOC: memblock overview"
* Add kernel-doc comments for several more functions

[akpm@linux-foundation.org: fix copy-n-paste error]
Link: http://lkml.kernel.org/r/1549626347-25461-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:02 -07:00
Mike Rapoport
c9a688a3e9 memblock: split checks whether a region should be skipped to a helper function
__next_mem_range() and __next_mem_range_rev() duplicate the code that
checks whether a region should be skipped because of node or flags
incompatibility.

Split this code into a helper function.

Link: http://lkml.kernel.org/r/1549455025-17706-3-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:02 -07:00
Mike Rapoport
fe145124db memblock: remove memblock_{set,clear}_region_flags
The memblock API provides dedicated helpers to set or clear a flag on a
memory region, e.g.  memblock_{mark,clear}_hotplug().

The memblock_{set,clear}_region_flags() functions are used only by the
memblock internal function that adjusts the region flags.  Drop these
functions and use open-coded implementation instead.

Link: http://lkml.kernel.org/r/1549455025-17706-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:02 -07:00
Mike Rapoport
26fb3dae0a memblock: drop memblock_alloc_*_nopanic() variants
As all the memblock allocation functions return NULL in case of error
rather than panic(), the duplicates with _nopanic suffix can be removed.

Link: http://lkml.kernel.org/r/1548057848-15136-22-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>		[printk]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:02 -07:00
Mike Rapoport
c0dbe825a9 memblock: memblock_alloc_try_nid: don't panic
As all the memblock_alloc*() users are now checking the return value and
panic() in case of error, the panic() call can be removed from the core
memblock allocator, namely memblock_alloc_try_nid().

Link: http://lkml.kernel.org/r/1548057848-15136-21-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:02 -07:00
Mike Rapoport
8a7f97b902 treewide: add checks for the return value of memblock_alloc*()
Add check for the return value of memblock_alloc*() functions and call
panic() in case of error.  The panic message repeats the one used by
panicing memblock allocators with adjustment of parameters to include
only relevant ones.

The replacement was mostly automated with semantic patches like the one
below with manual massaging of format strings.

  @@
  expression ptr, size, align;
  @@
  ptr = memblock_alloc(size, align);
  + if (!ptr)
  + 	panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, size, align);

[anders.roxell@linaro.org: use '%pa' with 'phys_addr_t' type]
  Link: http://lkml.kernel.org/r/20190131161046.21886-1-anders.roxell@linaro.org
[rppt@linux.ibm.com: fix format strings for panics after memblock_alloc]
  Link: http://lkml.kernel.org/r/1548950940-15145-1-git-send-email-rppt@linux.ibm.com
[rppt@linux.ibm.com: don't panic if the allocation in sparse_buffer_init fails]
  Link: http://lkml.kernel.org/r/20190131074018.GD28876@rapoport-lnx
[akpm@linux-foundation.org: fix xtensa printk warning]
Link: http://lkml.kernel.org/r/1548057848-15136-20-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Reviewed-by: Guo Ren <ren_guo@c-sky.com>		[c-sky]
Acked-by: Paul Burton <paul.burton@mips.com>		[MIPS]
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>	[s390]
Reviewed-by: Juergen Gross <jgross@suse.com>		[Xen]
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
Acked-by: Max Filippov <jcmvbkbc@gmail.com>		[xtensa]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:02 -07:00
Mike Rapoport
f655f40537 mm/percpu: add checks for the return value of memblock_alloc*()
Add panic() calls if memblock_alloc() returns NULL.

The panic() format duplicates the one used by memblock itself and in
order to avoid explosion with long parameters list replace open coded
allocation size calculations with a local variable.

Link: http://lkml.kernel.org/r/1548057848-15136-17-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:02 -07:00
Mike Rapoport
c366ea89fa memblock: make memblock_find_in_range_node() and choose_memblock_flags() static
These functions are not used outside memblock.  Make them static.

Link: http://lkml.kernel.org/r/1548057848-15136-12-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:01 -07:00
Mike Rapoport
92d12f9544 memblock: refactor internal allocation functions
Currently, memblock has several internal functions with overlapping
functionality.  They all call memblock_find_in_range_node() to find free
memory and then reserve the allocated range and mark it with kmemleak.
However, there is difference in the allocation constraints and in
fallback strategies.

The allocations returning physical address first attempt to find free
memory on the specified node within mirrored memory regions, then retry
on the same node without the requirement for memory mirroring and
finally fall back to all available memory.

The allocations returning virtual address start with clamping the
allowed range to memblock.current_limit, attempt to allocate from the
specified node from regions with mirroring and with user defined minimal
address.  If such allocation fails, next attempt is done with node
restriction lifted.  Next, the allocation is retried with minimal
address reset to zero and at last without the requirement for mirrored
regions.

Let's consolidate various fallbacks handling and make them more
consistent for physical and virtual variants.  Most of the fallback
handling is moved to memblock_alloc_range_nid() and it now handles node
and mirror fallbacks.

The memblock_alloc_internal() uses memblock_alloc_range_nid() to get a
physical address of the allocated range and converts it to virtual
address.

The fallback for allocation below the specified minimal address remains
in memblock_alloc_internal() because memblock_alloc_range_nid() is used
by CMA with exact requirement for lower bounds.

The memblock_phys_alloc_nid() function is completely dropped as it is not
used anywhere outside memblock and its only usage can be replaced by a
call to memblock_alloc_range_nid().

[rppt@linux.ibm.com: fix parameter order in memblock_phys_alloc_try_nid()]
  Link: http://lkml.kernel.org/r/20190203113915.GC8620@rapoport-lnx
Link: http://lkml.kernel.org/r/1548057848-15136-11-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:01 -07:00
Mike Rapoport
0ba9e6edd4 memblock: drop memblock_alloc_base()
The memblock_alloc_base() function tries to allocate a memory up to the
limit specified by its max_addr parameter and panics if the allocation
fails.  Replace its usage with memblock_phys_alloc_range() and make the
callers check the return value and panic in case of error.

Link: http://lkml.kernel.org/r/1548057848-15136-10-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:01 -07:00
Mike Rapoport
42b46aeff2 memblock: drop __memblock_alloc_base()
The __memblock_alloc_base() function tries to allocate a memory up to
the limit specified by its max_addr parameter.  Depending on the value
of this parameter, the __memblock_alloc_base() can is replaced with the
appropriate memblock_phys_alloc*() variant.

Link: http://lkml.kernel.org/r/1548057848-15136-9-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Rob Herring <robh@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:01 -07:00
Mike Rapoport
ecc3e771f4 memblock: memblock_phys_alloc(): don't panic
Make the memblock_phys_alloc() function an inline wrapper for
memblock_phys_alloc_range() and update the memblock_phys_alloc() callers
to check the returned value and panic in case of error.

Link: http://lkml.kernel.org/r/1548057848-15136-8-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:01 -07:00
Mike Rapoport
337555744e memblock: memblock_phys_alloc_try_nid(): don't panic
The memblock_phys_alloc_try_nid() function tries to allocate memory from
the requested node and then falls back to allocation from any node in
the system.  The memblock_alloc_base() fallback used by this function
panics if the allocation fails.

Replace the memblock_alloc_base() fallback with the direct call to
memblock_alloc_range_nid() and update the memblock_phys_alloc_try_nid()
callers to check the returned value and panic in case of error.

Link: http://lkml.kernel.org/r/1548057848-15136-7-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:01 -07:00
Mike Rapoport
8a770c2a83 memblock: emphasize that memblock_alloc_range() returns a physical address
Rename memblock_alloc_range() to memblock_phys_alloc_range() to
emphasize that it returns a physical address.

While on it, remove the 'enum memblock_flags' parameter from this
function as its only user anyway sets it to MEMBLOCK_NONE, which is the
default for the most of memblock allocations.

Link: http://lkml.kernel.org/r/1548057848-15136-6-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:01 -07:00
Mike Rapoport
53d818d274 memblock: drop memblock_alloc_base_nid()
memblock_alloc_base_nid() is a oneliner wrapper for
memblock_alloc_range_nid() without any side effect.

Replace it's usage by the direct calls to memblock_alloc_range_nid().

Link: http://lkml.kernel.org/r/1548057848-15136-5-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:01 -07:00
Souptick Joarder
b57e622e6d mm/hmm: convert to use vm_fault_t
Convert to use vm_fault_t type as return type for fault handler.

kbuild reported warning during testing of
*mm-create-the-new-vm_fault_t-type.patch* available in below link -
https://patchwork.kernel.org/patch/10752741/

  kernel/memremap.c:46:34: warning: incorrect type in return expression
                           (different base types)
  kernel/memremap.c:46:34: expected restricted vm_fault_t
  kernel/memremap.c:46:34: got int

This patch has fixed the warnings and also hmm_devmem_fault() is
converted to return vm_fault_t to avoid further warnings.

[sfr@canb.auug.org.au: drm/nouveau/dmem: update for struct hmm_devmem_ops member change]
  Link: http://lkml.kernel.org/r/20190220174407.753d94e5@canb.auug.org.au
Link: http://lkml.kernel.org/r/20190110145900.GA1317@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:00 -07:00
Linus Torvalds
d14d7f14f1 xen: fixes and features for 5.1-rc1
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCXIYrgwAKCRCAXGG7T9hj
 viyuAP4/bKpQ8QUp2V6ddkyEG4NTkA7H87pqQQsxJe9sdoyRRwD5AReS7oitoRS/
 cm6SBpwdaPRX/hfVvT2/h1GWxkvDFgA=
 =8Zfa
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.1a-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen updates from Juergen Gross:
 "xen fixes and features:

   - remove fallback code for very old Xen hypervisors

   - three patches for fixing Xen dom0 boot regressions

   - an old patch for Xen PCI passthrough which was never applied for
     unknown reasons

   - some more minor fixes and cleanup patches"

* tag 'for-linus-5.1a-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen: fix dom0 boot on huge systems
  xen, cpu_hotplug: Prevent an out of bounds access
  xen: remove pre-xen3 fallback handlers
  xen/ACPI: Switch to bitmap_zalloc()
  x86/xen: dont add memory above max allowed allocation
  x86: respect memory size limiting via mem= parameter
  xen/gntdev: Check and release imported dma-bufs on close
  xen/gntdev: Do not destroy context while dma-bufs are in use
  xen/pciback: Don't disable PCI_COMMAND on PCI device reset.
  xen-scsiback: mark expected switch fall-through
  xen: mark expected switch fall-through
2019-03-11 17:08:14 -07:00
Linus Torvalds
a50243b1dd 5.1 Merge Window Pull Request
This has been a slightly more active cycle than normal with ongoing core
 changes and quite a lot of collected driver updates.
 
 - Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe
 
 - A new data transfer mode for HFI1 giving higher performance
 
 - Significant functional and bug fix update to the mlx5 On-Demand-Paging MR
   feature
 
 - A chip hang reset recovery system for hns
 
 - Change mm->pinned_vm to an atomic64
 
 - Update bnxt_re to support a new 57500 chip
 
 - A sane netlink 'rdma link add' method for creating rxe devices and fixing
   the various unregistration race conditions in rxe's unregister flow
 
 - Allow lookup up objects by an ID over netlink
 
 - Various reworking of the core to driver interface:
   * Drivers should not assume umem SGLs are in PAGE_SIZE chunks
   * ucontext is accessed via udata not other means
   * Start to make the core code responsible for object memory
     allocation
   * Drivers should convert struct device to struct ib_device
     via a helper
   * Drivers have more tools to avoid use after unregister problems
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlyAJYYACgkQOG33FX4g
 mxrWwQ/+OyAx4Moru7Aix0C6GWxTJp/wKgw21CS3reZxgLai6x81xNYG/s2wCNjo
 IccObVd7mvzyqPdxOeyHBsJBbQDqWvoD6O2duH8cqGMgBRgh3CSdUep2zLvPpSAx
 2W1SvWYCLDnCuarboFrCA8c4AN3eCZiqD7z9lHyFQGjy3nTUWzk1uBaOP46uaiMv
 w89N8EMdXJ/iY6ONzihvE05NEYbMA8fuvosKLLNdghRiHIjbMQU8SneY23pvyPDd
 ZziPu9NcO3Hw9OVbkwtJp47U3KCBgvKHmnixyZKkikjiD+HVoABw2IMwcYwyBZwP
 Bic/ddONJUvAxMHpKRnQaW7znAiHARk21nDG28UAI7FWXH/wMXgicMp6LRcNKqKF
 vqXdxHTKJb0QUR4xrYI+eA8ihstss7UUpgSgByuANJ0X729xHiJtlEvPb1DPo1Dz
 9CB4OHOVRl5O8sA5Jc6PSusZiKEpvWoyWbdmw0IiwDF5pe922VLl5Nv88ta+sJ38
 v2Ll5AgYcluk7F3599Uh9D7gwp5hxW2Ph3bNYyg2j3HP4/dKsL9XvIJPXqEthgCr
 3KQS9rOZfI/7URieT+H+Mlf+OWZhXsZilJG7No0fYgIVjgJ00h3SF1/299YIq6Qp
 9W7ZXBfVSwLYA2AEVSvGFeZPUxgBwHrSZ62wya4uFeB1jyoodPk=
 =p12E
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "This has been a slightly more active cycle than normal with ongoing
  core changes and quite a lot of collected driver updates.

   - Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe

   - A new data transfer mode for HFI1 giving higher performance

   - Significant functional and bug fix update to the mlx5
     On-Demand-Paging MR feature

   - A chip hang reset recovery system for hns

   - Change mm->pinned_vm to an atomic64

   - Update bnxt_re to support a new 57500 chip

   - A sane netlink 'rdma link add' method for creating rxe devices and
     fixing the various unregistration race conditions in rxe's
     unregister flow

   - Allow lookup up objects by an ID over netlink

   - Various reworking of the core to driver interface:
       - drivers should not assume umem SGLs are in PAGE_SIZE chunks
       - ucontext is accessed via udata not other means
       - start to make the core code responsible for object memory
         allocation
       - drivers should convert struct device to struct ib_device via a
         helper
       - drivers have more tools to avoid use after unregister problems"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (280 commits)
  net/mlx5: ODP support for XRC transport is not enabled by default in FW
  IB/hfi1: Close race condition on user context disable and close
  RDMA/umem: Revert broken 'off by one' fix
  RDMA/umem: minor bug fix in error handling path
  RDMA/hns: Use GFP_ATOMIC in hns_roce_v2_modify_qp
  cxgb4: kfree mhp after the debug print
  IB/rdmavt: Fix concurrency panics in QP post_send and modify to error
  IB/rdmavt: Fix loopback send with invalidate ordering
  IB/iser: Fix dma_nents type definition
  IB/mlx5: Set correct write permissions for implicit ODP MR
  bnxt_re: Clean cq for kernel consumers only
  RDMA/uverbs: Don't do double free of allocated PD
  RDMA: Handle ucontext allocations by IB/core
  RDMA/core: Fix a WARN() message
  bnxt_re: fix the regression due to changes in alloc_pbl
  IB/mlx4: Increase the timeout for CM cache
  IB/core: Abort page fault handler silently during owning process exit
  IB/mlx5: Validate correct PD before prefetch MR
  IB/mlx5: Protect against prefetch of invalid MR
  RDMA/uverbs: Store PR pointer before it is overwritten
  ...
2019-03-09 15:53:03 -08:00
Linus Torvalds
f86727f8bd Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm cleanup from Ingo Molnar:
 "A single GUP cleanup"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  mm/gup: Remove the 'write' parameter from gup_fast_permitted()
2019-03-07 17:43:58 -08:00
Linus Torvalds
8d521d94da Merge branch 'for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou:
 "There are 2 minor changes to the percpu allocator this merge window:

   - for loop condition that could be out of bounds on multi-socket UP

   - cosmetic removal of pcpu_group_offsets[0] in UP code as it is 0

  There has been an interest in having better alignment with percpu
  allocations. This has caused a performance regression in at least one
  reported workload. I have a series out which adds scan hints to the
  allocator as well as some other performance oriented changes. I hope
  to have this queued for v5.2 soon"

* 'for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
  percpu: km: no need to consider pcpu_group_offsets[0]
  percpu: use nr_groups as check condition
2019-03-07 10:06:46 -08:00
Oscar Salvador
ea2c3f6f55 mm,mremap: bail out earlier in mremap_to under map pressure
When using mremap() syscall in addition to MREMAP_FIXED flag, mremap()
calls mremap_to() which does the following:

1) unmaps the destination region where we are going to move the map
2) If the new region is going to be smaller, we unmap the last part
   of the old region

Then, we will eventually call move_vma() to do the actual move.

move_vma() checks whether we are at least 4 maps below max_map_count
before going further, otherwise it bails out with -ENOMEM.  The problem
is that we might have already unmapped the vma's in steps 1) and 2), so
it is not possible for userspace to figure out the state of the vmas
after it gets -ENOMEM, and it gets tricky for userspace to clean up
properly on error path.

While it is true that we can return -ENOMEM for more reasons (e.g: see
may_expand_vm() or move_page_tables()), I think that we can avoid this
scenario if we check early in mremap_to() if the operation has high
chances to succeed map-wise.

Should that not be the case, we can bail out before we even try to unmap
anything, so we make sure the vma's are left untouched in case we are
likely to be short of maps.

The thumb-rule now is to rely on the worst-scenario case we can have.
That is when both vma's (old region and new region) are going to be
split in 3, so we get two more maps to the ones we already hold (one per
each).  If current map count + 2 maps still leads us to 4 maps below the
threshold, we are going to pass the check in move_vma().

Of course, this is not free, as it might generate false positives when
it is true that we are tight map-wise, but the unmap operation can
release several vma's leading us to a good state.

Another approach was also investigated [1], but it may be too much
hassle for what it brings.

[1] https://lore.kernel.org/lkml/20190219155320.tkfkwvqk53tfdojt@d104.suse.de/

Link: http://lkml.kernel.org/r/20190226091314.18446-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:21 -08:00
Qian Cai
d778015ac9 mm/sparse: fix a bad comparison
next_present_section_nr() could only return an unsigned number -1, so
just check it specifically where compilers will convert -1 to unsigned
if needed.

  mm/sparse.c: In function 'sparse_init_nid':
  mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
         ((section_nr >= 0) &&    \
                      ^~
  mm/sparse.c:478:2: note: in expansion of macro
  'for_each_present_section_nr'
    for_each_present_section_nr(pnum_begin, pnum) {
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~
  mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
         ((section_nr >= 0) &&    \
                      ^~
  mm/sparse.c:497:2: note: in expansion of macro
  'for_each_present_section_nr'
    for_each_present_section_nr(pnum_begin, pnum) {
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~
  mm/sparse.c: In function 'sparse_init':
  mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
         ((section_nr >= 0) &&    \
                      ^~
  mm/sparse.c:520:2: note: in expansion of macro
  'for_each_present_section_nr'
    for_each_present_section_nr(pnum_begin + 1, pnum_end) {
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~

Link: http://lkml.kernel.org/r/20190228181839.86504-1-cai@lca.pw
Fixes: c4e1be9ec1 ("mm, sparsemem: break out of loops early")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:21 -08:00
Jan Stancek
fc8efd2ddf mm/memory.c: do_fault: avoid usage of stale vm_area_struct
LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
This is a stress test, where one thread mmaps/writes/munmaps memory area
and other thread is trying to read from it:

  CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
  Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
  Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
  Call Trace:
  ([<0000000000000000>]           (null))
   [<00000000001adae4>] lock_acquire+0xec/0x258
   [<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98
   [<000000000012a780>] page_table_free+0x48/0x1a8
   [<00000000002f6e54>] do_fault+0xdc/0x670
   [<00000000002fadae>] __handle_mm_fault+0x416/0x5f0
   [<00000000002fb138>] handle_mm_fault+0x1b0/0x320
   [<00000000001248cc>] do_dat_exception+0x19c/0x2c8
   [<000000000080e5ee>] pgm_check_handler+0x19e/0x200

page_table_free() is called with NULL mm parameter, but because "0" is a
valid address on s390 (see S390_lowcore), it keeps going until it
eventually crashes in lockdep's lock_acquire.  This crash is
reproducible at least since 4.14.

Problem is that "vmf->vma" used in do_fault() can become stale.  Because
mmap_sem may be released, other threads can come in, call munmap() and
cause "vma" be returned to kmem cache, and get zeroed/re-initialized and
re-used:

handle_mm_fault                           |
  __handle_mm_fault                       |
    do_fault                              |
      vma = vmf->vma                      |
      do_read_fault                       |
        __do_fault                        |
          vma->vm_ops->fault(vmf);        |
            mmap_sem is released          |
                                          |
                                          | do_munmap()
                                          |   remove_vma_list()
                                          |     remove_vma()
                                          |       vm_area_free()
                                          |         # vma is released
                                          | ...
                                          | # same vma is allocated
                                          | # from kmem cache
                                          | do_mmap()
                                          |   vm_area_alloc()
                                          |     memset(vma, 0, ...)
                                          |
      pte_free(vma->vm_mm, ...);          |
        page_table_free                   |
          spin_lock_bh(&mm->context.lock);|
            <crash>                       |

Cache mm_struct to avoid using potentially stale "vma".

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c

Link: http://lkml.kernel.org/r/5b3fdf19e2a5be460a384b936f5b56e13733f1b8.1551595137.git.jstancek@redhat.com
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Rafael Aquini <aquini@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:21 -08:00
Qian Cai
70516b936b mm/huge_memory.c: fix "orig_pud" set but not used
Commit a00cc7d9dd ("mm, x86: add support for PUD-sized transparent
hugepages") introduced pudp_huge_get_and_clear_full() but no one uses
its return code.

In order to not diverge from pmdp_huge_get_and_clear_full(), just change
zap_huge_pud() to not assign the return value from
pudp_huge_get_and_clear_full().

  mm/huge_memory.c: In function 'zap_huge_pud':
  mm/huge_memory.c:1982:8: warning: variable 'orig_pud' set but not used [-Wunused-but-set-variable]
    pud_t orig_pud;
          ^~~~~~~~

Link: http://lkml.kernel.org/r/20190301221956.97493-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:21 -08:00
Qian Cai
cd02cf1ace mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC
When onlining a memory block with DEBUG_PAGEALLOC, it unmaps the pages
in the block from kernel, However, it does not map those pages while
offlining at the beginning.  As the result, it triggers a panic below
while onlining on ppc64le as it checks if the pages are mapped before
unmapping.  However, the imbalance exists for all arches where
double-unmappings could happen.  Therefore, let kernel map those pages
in generic_online_page() before they have being freed into the page
allocator for the first time where it will set the page count to one.

On the other hand, it works fine during the boot, because at least for
IBM POWER8, it does,

early_setup
  early_init_mmu
    harsh__early_init_mmu
      htab_initialize [1]
        htab_bolt_mapping [2]

where it effectively map all memblock regions just like
kernel_map_linear_page(), so later mem_init() -> memblock_free_all()
will unmap them just fine without any imbalance.  On other arches
without this imbalance checking, it still unmap them once at the most.

[1]
for_each_memblock(memory, reg) {
        base = (unsigned long)__va(reg->base);
        size = reg->size;

        DBG("creating mapping for region: %lx..%lx (prot: %lx)\n",
                base, size, prot);

        BUG_ON(htab_bolt_mapping(base, base + size, __pa(base),
                prot, mmu_linear_psize, mmu_kernel_ssize));
        }

[2] linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80;
    kernel BUG at arch/powerpc/mm/hash_utils_64.c:1815!
    Oops: Exception in kernel mode, sig: 5 [#1]
    LE SMP NR_CPUS=256 DEBUG_PAGEALLOC NUMA pSeries
    CPU: 2 PID: 4298 Comm: bash Not tainted 5.0.0-rc7+ #15
    NIP:  c000000000062670 LR: c00000000006265c CTR: 0000000000000000
    REGS: c0000005bf8a75b0 TRAP: 0700   Not tainted  (5.0.0-rc7+)
    MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28422842
    XER: 00000000
    CFAR: c000000000804f44 IRQMASK: 1
    NIP [c000000000062670] __kernel_map_pages+0x2e0/0x4f0
    LR [c00000000006265c] __kernel_map_pages+0x2cc/0x4f0
    Call Trace:
       __kernel_map_pages+0x2cc/0x4f0
       free_unref_page_prepare+0x2f0/0x4d0
       free_unref_page+0x44/0x90
       __online_page_free+0x84/0x110
       online_pages_range+0xc0/0x150
       walk_system_ram_range+0xc8/0x120
       online_pages+0x280/0x5a0
       memory_subsys_online+0x1b4/0x270
       device_online+0xc0/0xf0
       state_store+0xc0/0x180
       dev_attr_store+0x3c/0x60
       sysfs_kf_write+0x70/0xb0
       kernfs_fop_write+0x10c/0x250
       __vfs_write+0x48/0x240
       vfs_write+0xd8/0x210
       ksys_write+0x70/0x120
       system_call+0x5c/0x70

Link: http://lkml.kernel.org/r/20190301220814.97339-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:21 -08:00
Qian Cai
82ede7ee38 mm/memcontrol.c: fix bad line in comment
Commit 230671533d ("mm: memory.low hierarchical behavior") missed an
asterisk in one of the comments.

  mm/memcontrol.c:5774: warning: bad line:                | 0, otherwise.

Link: http://lkml.kernel.org/r/20190301143734.94393-1-cai@lca.pw
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:21 -08:00
Peng Fan
0d3bd18a5e mm/cma.c: cma_declare_contiguous: correct err handling
In case cma_init_reserved_mem failed, need to free the memblock
allocated by memblock_reserve or memblock_alloc_range.

Quote Catalin's comments:
  https://lkml.org/lkml/2019/2/26/482

Kmemleak is supposed to work with the memblock_{alloc,free} pair and it
ignores the memblock_reserve() as a memblock_alloc() implementation
detail. It is, however, tolerant to memblock_free() being called on
a sub-range or just a different range from a previous memblock_alloc().
So the original patch looks fine to me. FWIW:

Link: http://lkml.kernel.org/r/20190227144631.16708-1-peng.fan@nxp.com
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:21 -08:00
Qian Cai
0c81585499 mm/page_ext.c: fix an imbalance with kmemleak
After offlining a memory block, kmemleak scan will trigger a crash, as
it encounters a page ext address that has already been freed during
memory offlining.  At the beginning in alloc_page_ext(), it calls
kmemleak_alloc(), but it does not call kmemleak_free() in
free_page_ext().

    BUG: unable to handle kernel paging request at ffff888453d00000
    PGD 128a01067 P4D 128a01067 PUD 128a04067 PMD 47e09e067 PTE 800ffffbac2ff060
    Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
    CPU: 1 PID: 1594 Comm: bash Not tainted 5.0.0-rc8+ #15
    Hardware name: HP ProLiant DL180 Gen9/ProLiant DL180 Gen9, BIOS U20 10/25/2017
    RIP: 0010:scan_block+0xb5/0x290
    Code: 85 6e 01 00 00 48 b8 00 00 30 f5 81 88 ff ff 48 39 c3 0f 84 5b 01 00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 0f 85 87 01 00 00 <4c> 8b 3b e8 f3 0c fa ff 4c 39 3d 0c 6b 4c 01 0f 87 08 01 00 00 4c
    RSP: 0018:ffff8881ec57f8e0 EFLAGS: 00010082
    RAX: 0000000000000000 RBX: ffff888453d00000 RCX: ffffffffa61e5a54
    RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff888453d00000
    RBP: ffff8881ec57f920 R08: fffffbfff4ed588d R09: fffffbfff4ed588c
    R10: fffffbfff4ed588c R11: ffffffffa76ac463 R12: dffffc0000000000
    R13: ffff888453d00ff9 R14: ffff8881f80cef48 R15: ffff8881f80cef48
    FS:  00007f6c0e3f8740(0000) GS:ffff8881f7680000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff888453d00000 CR3: 00000001c4244003 CR4: 00000000001606a0
    Call Trace:
     scan_gray_list+0x269/0x430
     kmemleak_scan+0x5a8/0x10f0
     kmemleak_write+0x541/0x6ca
     full_proxy_write+0xf8/0x190
     __vfs_write+0xeb/0x980
     vfs_write+0x15a/0x4f0
     ksys_write+0xd2/0x1b0
     __x64_sys_write+0x73/0xb0
     do_syscall_64+0xeb/0xaaa
     entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f6c0dad73b8
    Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 63 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
    RSP: 002b:00007ffd5b863cb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f6c0dad73b8
    RDX: 0000000000000005 RSI: 000055a9216e1710 RDI: 0000000000000001
    RBP: 000055a9216e1710 R08: 000000000000000a R09: 00007ffd5b863840
    R10: 000000000000000a R11: 0000000000000246 R12: 00007f6c0dda9780
    R13: 0000000000000005 R14: 00007f6c0dda4740 R15: 0000000000000005
    Modules linked in: nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm irqbypass efivars ip_tables x_tables xfs sd_mod ahci libahci igb i2c_algo_bit libata i2c_core dm_mirror dm_region_hash dm_log dm_mod efivarfs
    CR2: ffff888453d00000
    ---[ end trace ccf646c7456717c5 ]---
    Kernel panic - not syncing: Fatal exception
    Shutting down cpus with NMI
    Kernel Offset: 0x24c00000 from 0xffffffff81000000 (relocation range:
    0xffffffff80000000-0xffffffffbfffffff)
    ---[ end Kernel panic - not syncing: Fatal exception ]---

Link: http://lkml.kernel.org/r/20190227173147.75650-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:21 -08:00
Andrey Ryabinin
5f438eee8f mm/compaction: pass pgdat to too_many_isolated() instead of zone
too_many_isolated() in mm/compaction.c looks only at node state, so it
makes more sense to change argument to pgdat instead of zone.

Link: http://lkml.kernel.org/r/20190228083329.31892-3-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:21 -08:00
Andrey Ryabinin
f4b7e272b5 mm: remove zone_lru_lock() function, access ->lru_lock directly
We have common pattern to access lru_lock from a page pointer:
	zone_lru_lock(page_zone(page))

Which is silly, because it unfolds to this:
	&NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock
while we can simply do
	&NODE_DATA(page_to_nid(page))->lru_lock

Remove zone_lru_lock() function, since it's only complicate things.  Use
'page_pgdat(page)->lru_lock' pattern instead.

[aryabinin@virtuozzo.com: a slightly better version of __split_huge_page()]
  Link: http://lkml.kernel.org/r/20190301121651.7741-1-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/20190228083329.31892-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:21 -08:00
Andrey Ryabinin
a7ca12f9d9 mm/workingset: remove unused @mapping argument in workingset_eviction()
workingset_eviction() doesn't use and never did use the @mapping
argument.  Remove it.

Link: http://lkml.kernel.org/r/20190228083329.31892-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:21 -08:00
Gustavo A. R. Silva
960087445c mm/swapfile.c: use struct_size() in kvzalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array.  For example:

  struct foo {
      int stuff;
      struct boo entry[];
  };

  size = sizeof(struct foo) + count * sizeof(struct boo);
  instance = kvzalloc(size, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

  instance = kvzalloc(struct_size(instance, entry, count), GFP_KERNEL);

Notice that, in this case, variable size is not necessary, hence it is
removed.

This code was detected with the help of Coccinelle.

Link: http://lkml.kernel.org/r/20190221154622.GA19599@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:21 -08:00
Yue Hu
5a7f1b2f2f mm/cma_debug.c: remove static scoped cma_debugfs_root
Currently cma_debugfs_root is static storage.  That is unnecessary since
it will be only used by next cma_debugfs_add_one().  We can just pass it
to following calling to save thisspace.  Also remove useless idx
parameter.

Link: http://lkml.kernel.org/r/20190221040130.8940-1-zbestahu@gmail.com
Signed-off-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:20 -08:00
Yu Zhao
5d3ee42f8f mm/shmem: make find_get_pages_range() work for huge page
find_get_pages_range() and find_get_pages_range_tag() already correctly
increment reference count on head when seeing compound page, but they
may still use page index from tail.  Page index from tail is always
zero, so these functions don't work on huge shmem.  This hasn't been a
problem because, AFAIK, nobody calls these functions on (huge) shmem.
Fix them anyway just in case.

Link: http://lkml.kernel.org/r/20190110030838.84446-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:20 -08:00
Christoph Hellwig
afa0011289 mm: unexport free_reserved_area
This function is only used by built-in code, which makes perfect sense
given the purpose of it.

Link: http://lkml.kernel.org/r/20190213174621.29297-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:20 -08:00
Oscar Salvador
daf3538ad5 mm,memory_hotplug: explicitly pass the head to isolate_huge_page
isolate_huge_page() expects we pass the head of hugetlb page to it:

  bool isolate_huge_page(...)
  {
	...
	VM_BUG_ON_PAGE(!PageHead(page), page);
	...
  }

While I really cannot think of any situation where we end up with a
non-head page between hands in do_migrate_range(), let us make sure the
code is as sane as possible by explicitly passing the Head.  Since we
already got the pointer, it does not take us extra effort.

Link: http://lkml.kernel.org/r/20190208090604.975-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:20 -08:00
Jan Kara
f900482da5 mm/migrate.c: cleanup expected_page_refs()
Andrea has noted that page migration code propagates page_mapping(page)
through the whole migration stack down to migrate_page() function so it
seems stupid to then use page_mapping(page) in expected_page_refs()
instead of passed down 'mapping' argument.  I agree so let's make
expected_page_refs() more in line with the rest of the migration stack.

Link: http://lkml.kernel.org/r/20190207112314.24872-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:20 -08:00
Mike Rapoport
a862f68a8b docs/core-api/mm: fix return value descriptions in mm/
Many kernel-doc comments in mm/ have the return value descriptions
either misformatted or omitted at all which makes kernel-doc script
unhappy:

$ make V=1 htmldocs
...
./mm/util.c:36: info: Scanning doc for kstrdup
./mm/util.c:41: warning: No description found for return value of 'kstrdup'
./mm/util.c:57: info: Scanning doc for kstrdup_const
./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
./mm/util.c:75: info: Scanning doc for kstrndup
./mm/util.c:83: warning: No description found for return value of 'kstrndup'
...

Fixing the formatting and adding the missing return value descriptions
eliminates ~100 such warnings.

Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:20 -08:00
Mike Rapoport
92eac16819 docs/mm: vmalloc: re-indent kernel-doc comemnts
Some kernel-doc comments in mm/vmalloc.c have leading tab in
indentation.  This leads to excessive indentation in the generated HTML
and to the inconsistency of its layout ([1] vs [2]).

Besides, multi-line Note: sections are not handled properly with extra
indentation.

[1] https://www.kernel.org/doc/html/v4.20/core-api/mm-api.html?#c.vm_map_ram
[2] https://www.kernel.org/doc/html/v4.20/core-api/mm-api.html?#c.vfree

Link: http://lkml.kernel.org/r/1549549644-4903-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:20 -08:00
Alexey Dobriyan
ce0725f78a numa: make "nr_online_nodes" unsigned int
Number of online NUMA nodes can't be negative as well.  This doesn't
save space as the variable is used only in 32-bit context, but do it
anyway for consistency.

Link: http://lkml.kernel.org/r/20190201223151.GB15820@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:20 -08:00
Alexey Dobriyan
b9726c26dc numa: make "nr_node_ids" unsigned int
Number of NUMA nodes can't be negative.

This saves a few bytes on x86_64:

	add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
	Function                                     old     new   delta
	hv_synic_alloc.cold                           88     110     +22
	prealloc_shrinker                            260     262      +2
	bootstrap                                    249     251      +2
	sched_init_numa                             1566    1567      +1
	show_slab_objects                            778     777      -1
	s_show                                      1201    1200      -1
	kmem_cache_init                              346     345      -1
	__alloc_workqueue_key                       1146    1145      -1
	mem_cgroup_css_alloc                        1614    1612      -2
	__do_sys_swapon                             4702    4699      -3
	__list_lru_init                              655     651      -4
	nic_probe                                   2379    2374      -5
	store_user_store                             118     111      -7
	red_zone_store                               106      99      -7
	poison_store                                 106      99      -7
	wq_numa_init                                 348     338     -10
	__kmem_cache_empty                            75      65     -10
	task_numa_free                               186     173     -13
	merge_across_nodes_store                     351     336     -15
	irq_create_affinity_masks                   1261    1246     -15
	do_numa_crng_init                            343     321     -22
	task_numa_fault                             4760    4737     -23
	swapfile_init                                179     156     -23
	hv_synic_alloc                               536     492     -44
	apply_wqattrs_prepare                        746     695     -51

Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:19 -08:00
Tetsuo Handa
d342a0b386 mm,oom: don't kill global init via memory.oom.group
Since setting global init process to some memory cgroup is technically
possible, oom_kill_memcg_member() must check it.

  Tasks in /test1 are going to be killed due to memory.oom.group set
  Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
  oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b

#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char *argv[])
{
	static char buffer[10485760];
	static int pipe_fd[2] = { EOF, EOF };
	unsigned int i;
	int fd;
	char buf[64] = { };
	if (pipe(pipe_fd))
		return 1;
	if (chdir("/sys/fs/cgroup/"))
		return 1;
	fd = open("cgroup.subtree_control", O_WRONLY);
	write(fd, "+memory", 7);
	close(fd);
	mkdir("test1", 0755);
	fd = open("test1/memory.oom.group", O_WRONLY);
	write(fd, "1", 1);
	close(fd);
	fd = open("test1/cgroup.procs", O_WRONLY);
	write(fd, "1", 1);
	snprintf(buf, sizeof(buf) - 1, "%d", getpid());
	write(fd, buf, strlen(buf));
	close(fd);
	snprintf(buf, sizeof(buf) - 1, "%lu", sizeof(buffer) * 5);
	fd = open("test1/memory.max", O_WRONLY);
	write(fd, buf, strlen(buf));
	close(fd);
	for (i = 0; i < 10; i++)
		if (fork() == 0) {
			char c;
			close(pipe_fd[1]);
			read(pipe_fd[0], &c, 1);
			memset(buffer, 0, sizeof(buffer));
			sleep(3);
			_exit(0);
		}
	close(pipe_fd[0]);
	close(pipe_fd[1]);
	sleep(3);
	return 0;
}

[   37.052923][ T9185] a.out invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[   37.056169][ T9185] CPU: 4 PID: 9185 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
[   37.059205][ T9185] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[   37.062954][ T9185] Call Trace:
[   37.063976][ T9185]  dump_stack+0x67/0x95
[   37.065263][ T9185]  dump_header+0x51/0x570
[   37.066619][ T9185]  ? trace_hardirqs_on+0x3f/0x110
[   37.068171][ T9185]  ? _raw_spin_unlock_irqrestore+0x3d/0x70
[   37.069967][ T9185]  oom_kill_process+0x18d/0x210
[   37.071515][ T9185]  out_of_memory+0x11b/0x380
[   37.072936][ T9185]  mem_cgroup_out_of_memory+0xb6/0xd0
[   37.074601][ T9185]  try_charge+0x790/0x820
[   37.076021][ T9185]  mem_cgroup_try_charge+0x42/0x1d0
[   37.077629][ T9185]  mem_cgroup_try_charge_delay+0x11/0x30
[   37.079370][ T9185]  do_anonymous_page+0x105/0x5e0
[   37.080939][ T9185]  __handle_mm_fault+0x9cb/0x1070
[   37.082485][ T9185]  handle_mm_fault+0x1b2/0x3a0
[   37.083819][ T9185]  ? handle_mm_fault+0x47/0x3a0
[   37.085181][ T9185]  __do_page_fault+0x255/0x4c0
[   37.086529][ T9185]  do_page_fault+0x28/0x260
[   37.087788][ T9185]  ? page_fault+0x8/0x30
[   37.088978][ T9185]  page_fault+0x1e/0x30
[   37.090142][ T9185] RIP: 0033:0x7f8b183aefe0
[   37.091433][ T9185] Code: 20 f3 44 0f 7f 44 17 d0 f3 44 0f 7f 47 30 f3 44 0f 7f 44 17 c0 48 01 fa 48 83 e2 c0 48 39 d1 74 a3 66 0f 1f 84 00 00 00 00 00 <66> 44 0f 7f 01 66 44 0f 7f 41 10 66 44 0f 7f 41 20 66 44 0f 7f 41
[   37.096917][ T9185] RSP: 002b:00007fffc5d329e8 EFLAGS: 00010206
[   37.098615][ T9185] RAX: 00000000006010e0 RBX: 0000000000000008 RCX: 0000000000c30000
[   37.100905][ T9185] RDX: 00000000010010c0 RSI: 0000000000000000 RDI: 00000000006010e0
[   37.103349][ T9185] RBP: 0000000000000000 R08: 00007f8b188f4740 R09: 0000000000000000
[   37.105797][ T9185] R10: 00007fffc5d32420 R11: 00007f8b183aef40 R12: 0000000000000005
[   37.108228][ T9185] R13: 0000000000000000 R14: ffffffffffffffff R15: 0000000000000000
[   37.110840][ T9185] memory: usage 51200kB, limit 51200kB, failcnt 125
[   37.113045][ T9185] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[   37.115808][ T9185] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[   37.117660][ T9185] Memory cgroup stats for /test1: cache:0KB rss:49484KB rss_huge:30720KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:49700KB inactive_file:0KB active_file:0KB unevictable:0KB
[   37.123371][ T9185] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test1,task_memcg=/test1,task=a.out,pid=9188,uid=0
[   37.128158][ T9185] Memory cgroup out of memory: Killed process 9188 (a.out) total-vm:14456kB, anon-rss:10324kB, file-rss:504kB, shmem-rss:0kB
[   37.132710][ T9185] Tasks in /test1 are going to be killed due to memory.oom.group set
[   37.132833][   T54] oom_reaper: reaped process 9188 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   37.135498][ T9185] Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
[   37.143434][ T9185] Memory cgroup out of memory: Killed process 9182 (a.out) total-vm:14456kB, anon-rss:76kB, file-rss:588kB, shmem-rss:0kB
[   37.144328][   T54] oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   37.147585][ T9185] Memory cgroup out of memory: Killed process 9183 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
[   37.157222][ T9185] Memory cgroup out of memory: Killed process 9184 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:508kB, shmem-rss:0kB
[   37.157259][ T9185] Memory cgroup out of memory: Killed process 9185 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
[   37.157291][ T9185] Memory cgroup out of memory: Killed process 9186 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:508kB, shmem-rss:0kB
[   37.157306][   T54] oom_reaper: reaped process 9183 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   37.157328][ T9185] Memory cgroup out of memory: Killed process 9187 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
[   37.157452][ T9185] Memory cgroup out of memory: Killed process 9189 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
[   37.158733][ T9185] Memory cgroup out of memory: Killed process 9190 (a.out) total-vm:14456kB, anon-rss:552kB, file-rss:512kB, shmem-rss:0kB
[   37.160083][   T54] oom_reaper: reaped process 9186 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   37.160187][   T54] oom_reaper: reaped process 9189 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   37.206941][   T54] oom_reaper: reaped process 9185 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   37.212300][ T9185] Memory cgroup out of memory: Killed process 9191 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
[   37.212317][   T54] oom_reaper: reaped process 9190 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   37.218860][ T9185] Memory cgroup out of memory: Killed process 9192 (a.out) total-vm:14456kB, anon-rss:1080kB, file-rss:512kB, shmem-rss:0kB
[   37.227667][   T54] oom_reaper: reaped process 9192 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   37.292323][ T9193] abrt-hook-ccpp (9193) used greatest stack depth: 10480 bytes left
[   37.351843][    T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
[   37.354833][    T1] CPU: 7 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
[   37.357876][    T1] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[   37.361685][    T1] Call Trace:
[   37.363239][    T1]  dump_stack+0x67/0x95
[   37.365010][    T1]  panic+0xfc/0x2b0
[   37.366853][    T1]  do_exit+0xd55/0xd60
[   37.368595][    T1]  do_group_exit+0x47/0xc0
[   37.370415][    T1]  get_signal+0x32a/0x920
[   37.372449][    T1]  ? _raw_spin_unlock_irqrestore+0x3d/0x70
[   37.374596][    T1]  do_signal+0x32/0x6e0
[   37.376430][    T1]  ? exit_to_usermode_loop+0x26/0x9b
[   37.378418][    T1]  ? prepare_exit_to_usermode+0xa8/0xd0
[   37.380571][    T1]  exit_to_usermode_loop+0x3e/0x9b
[   37.382588][    T1]  prepare_exit_to_usermode+0xa8/0xd0
[   37.384594][    T1]  ? page_fault+0x8/0x30
[   37.386453][    T1]  retint_user+0x8/0x18
[   37.388160][    T1] RIP: 0033:0x7f42c06974a8
[   37.389922][    T1] Code: Bad RIP value.
[   37.391788][    T1] RSP: 002b:00007ffc3effd388 EFLAGS: 00010213
[   37.394075][    T1] RAX: 000000000000000e RBX: 00007ffc3effd390 RCX: 0000000000000000
[   37.396963][    T1] RDX: 000000000000002a RSI: 00007ffc3effd390 RDI: 0000000000000004
[   37.399550][    T1] RBP: 00007ffc3effd680 R08: 0000000000000000 R09: 0000000000000000
[   37.402334][    T1] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001
[   37.404890][    T1] R13: ffffffffffffffff R14: 0000000000000884 R15: 000056460b1ac3b0

Link: http://lkml.kernel.org/r/201902010336.x113a4EO027170@www262.sakura.ne.jp
Fixes: 3d8b38eb81 ("mm, oom: introduce memory.oom.group")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:19 -08:00
Daniel Jordan
c10d38cc8d mm, swap: bounds check swap_info array accesses to avoid NULL derefs
Dan Carpenter reports a potential NULL dereference in
get_swap_page_of_type:

  Smatch complains that the NULL checks on "si" aren't consistent.  This
  seems like a real bug because we have not ensured that the type is
  valid and so "si" can be NULL.

Add the missing check for NULL, taking care to use a read barrier to
ensure CPU1 observes CPU0's updates in the correct order:

     CPU0                           CPU1
     alloc_swap_info()              if (type >= nr_swapfiles)
       swap_info[type] = p              /* handle invalid entry */
       smp_wmb()                    smp_rmb()
       ++nr_swapfiles               p = swap_info[type]

Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before
CPU0's write to swap_info[type] and read NULL from swap_info[type].

Ying Huang noticed other places in swapfile.c don't order these reads
properly.  Introduce swap_type_to_swap_info to encourage correct usage.

Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model
(see tools/memory-model/Documentation/explanation.txt).

This ordering need not be enforced in places where swap_lock is held
(e.g.  si_swapinfo) because swap_lock serializes updates to nr_swapfiles
and the swap_info array.

Link: http://lkml.kernel.org/r/20190131024410.29859-1-daniel.m.jordan@oracle.com
Fixes: ec8acf20af ("swap: add per-partition lock for swapfile")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:19 -08:00
Kirill Tkhai
060f005f07 mm/vmscan.c: do not allocate duplicate stack variables in shrink_page_list()
On path shrink_inactive_list() ---> shrink_page_list() we allocate stack
variables for the statistics twice.  This is completely useless, and
this just consumes stack much more, then we really need.

The patch kills duplicate stack variables from shrink_page_list(), and
this reduce stack usage and object file size significantly:

Stack usage:
  Before: vmscan.c:1122:22:shrink_page_list	648	static
  After:  vmscan.c:1122:22:shrink_page_list	616	static

Size of vmscan.o:
           text	   data	    bss	    dec	    hex	filename
  Before: 56866	   4720	    128	  61714	   f112	mm/vmscan.o
  After:  56770	   4720	    128	  61618	   f0b2	mm/vmscan.o

Link: http://lkml.kernel.org/r/154894900030.5211.12104993874109647641.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:19 -08:00
Yang Shi
2cee57d1b0 mm: ksm: do not block on page lock when searching stable tree
ksmd needs to search the stable tree to look for the suitable KSM page,
but the KSM page might be locked for a while due to i.e.  KSM page rmap
walk.  Basically it is not a big deal since commit 2c653d0ee2 ("ksm:
introduce ksm_max_page_sharing per page deduplication limit"), since
max_page_sharing limits the number of shared KSM pages.

But it still sounds not worth waiting for the lock, the page can be
skip, then try to merge it in the next scan to avoid potential stall if
its content is still intact.

Introduce trylock mode to get_ksm_page() to not block on page lock, like
what try_to_merge_one_page() does.  And, define three possible
operations (nolock, lock and trylock) as enum type to avoid stacking up
bools and make the code more readable.

Return -EBUSY if trylock fails, since NULL means not find suitable KSM
page, which is a valid case.

With the default max_page_sharing setting (256), there is almost no
observed change comparing lock vs trylock.

However, with ksm02 of LTP, the reduced ksmd full scan time can be
observed, which has set max_page_sharing to 786432.  With lock version,
ksmd may tak 10s - 11s to run two full scans, with trylock version ksmd
may take 8s - 11s to run two full scans.  And, the number of
pages_sharing and pages_to_scan keep same.  Basically, this change has
no harm.

[hughd@google.com: fix BUG_ON()]
  Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1902182122280.6914@eggly.anvils
Link: http://lkml.kernel.org/r/1548793753-62377-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Suggested-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:19 -08:00
Chris Down
1ff9e6e179 mm: memcontrol: expose THP events on a per-memcg basis
Currently THP allocation events data is fairly opaque, since you can
only get it system-wide.  This patch makes it easier to reason about
transparent hugepage behaviour on a per-memcg basis.

For anonymous THP-backed pages, we already have MEMCG_RSS_HUGE in v1,
which is used for v1's rss_huge [sic].  This is reused here as it's
fairly involved to untangle NR_ANON_THPS right now to make it per-memcg,
since right now some of this is delegated to rmap before we have any
memcg actually assigned to the page.  It's a good idea to rework that,
but let's leave untangling THP allocation for a future patch.

[akpm@linux-foundation.org: fix build]
[chris@chrisdown.name: fix memcontrol build when THP is disabled]
  Link: http://lkml.kernel.org/r/20190131160802.GA5777@chrisdown.name
Link: http://lkml.kernel.org/r/20190129205852.GA7310@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:19 -08:00
Yang Shi
2bb0f34fe3 mm: vmscan: do not iterate all mem cgroups for global direct reclaim
In current implementation, both kswapd and direct reclaim has to iterate
all mem cgroups.  It is not a problem before offline mem cgroups could
be iterated.  But, currently with iterating offline mem cgroups, it
could be very time consuming.  In our workloads, we saw over 400K mem
cgroups accumulated in some cases, only a few hundred are online memcgs.
Although kswapd could help out to reduce the number of memcgs, direct
reclaim still get hit with iterating a number of offline memcgs in some
cases.  We experienced the responsiveness problems due to this
occassionally.

A simple test with pref shows it may take around 220ms to iterate 8K
memcgs in direct reclaim:
             dd 13873 [011]   578.542919: vmscan:mm_vmscan_direct_reclaim_begin
             dd 13873 [011]   578.758689: vmscan:mm_vmscan_direct_reclaim_end
So for 400K, it may take around 11 seconds to iterate all memcgs.

Here just break the iteration once it reclaims enough pages as what
memcg direct reclaim does.  This may hurt the fairness among memcgs.
But the cached iterator cookie could help to achieve the fairness more
or less.

Link: http://lkml.kernel.org/r/1548799877-10949-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:19 -08:00
Joel Fernandes (Google)
ab3948f58f mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd
Android uses ashmem for sharing memory regions.  We are looking forward
to migrating all usecases of ashmem to memfd so that we can possibly
remove the ashmem driver in the future from staging while also
benefiting from using memfd and contributing to it.  Note staging
drivers are also not ABI and generally can be removed at anytime.

One of the main usecases Android has is the ability to create a region
and mmap it as writeable, then add protection against making any
"future" writes while keeping the existing already mmap'ed
writeable-region active.  This allows us to implement a usecase where
receivers of the shared memory buffer can get a read-only view, while
the sender continues to write to the buffer.  See CursorWindow
documentation in Android for more details:

  https://developer.android.com/reference/android/database/CursorWindow

This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
which prevents any future mmap and write syscalls from succeeding while
keeping the existing mmap active.

A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
where we don't need to modify core VFS structures to get the same
behavior of the seal.  This solves several side-effects pointed by Andy.
self-tests are provided in later patch to verify the expected semantics.

[1] https://lore.kernel.org/lkml/20181111173650.GA256781@google.com/

Thanks a lot to Andy for suggestions to improve code.

Link: http://lkml.kernel.org/r/20190112203816.85534-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:19 -08:00
Aneesh Kumar K.V
9a4e9f3b2d mm: update get_user_pages_longterm to migrate pages allocated from CMA region
This patch updates get_user_pages_longterm to migrate pages allocated
out of CMA region.  This makes sure that we don't keep non-movable pages
(due to page reference count) in the CMA area.

This will be used by ppc64 in a later patch to avoid pinning pages in
the CMA region.  ppc64 uses CMA region for allocation of the hardware
page table (hash page table) and not able to migrate pages out of CMA
region results in page table allocation failures.

One case where we hit this easy is when a guest using a VFIO passthrough
device.  VFIO locks all the guest's memory and if the guest memory is
backed by CMA region, it becomes unmovable resulting in fragmenting the
CMA and possibly preventing other guests from allocation a large enough
hash page table.

NOTE: We allocate the new page without using __GFP_THISNODE

Link: http://lkml.kernel.org/r/20190114095438.32470-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:19 -08:00
Vineeth Remanan Pillai
b56a2d8af9 mm: rid swapoff of quadratic complexity
This patch was initially posted by Kelley Nielsen.  Reposting the patch
with all review comments addressed and with minor modifications and
optimizations.  Also, folding in the fixes offered by Hugh Dickins and
Huang Ying.  Tests were rerun and commit message updated with new
results.

try_to_unuse() is of quadratic complexity, with a lot of wasted effort.
It unuses swap entries one by one, potentially iterating over all the
page tables for all the processes in the system for each one.

This new proposed implementation of try_to_unuse simplifies its
complexity to linear.  It iterates over the system's mms once, unusing
all the affected entries as it walks each set of page tables.  It also
makes similar changes to shmem_unuse.

Improvement

swapoff was called on a swap partition containing about 6G of data, in a
VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.

Present implementation....about 1200M calls(8min, avg 80% cpu util).
Prototype.................about  9.0K calls(3min, avg 5% cpu util).

Details

In shmem_unuse(), iterate over the shmem_swaplist and, for each
shmem_inode_info that contains a swap entry, pass it to
shmem_unuse_inode(), along with the swap type.  In shmem_unuse_inode(),
iterate over its associated xarray, and store the index and value of
each swap entry in an array for passing to shmem_swapin_page() outside
of the RCU critical section.

In try_to_unuse(), instead of iterating over the entries in the type and
unusing them one by one, perhaps walking all the page tables for all the
processes for each one, iterate over the mmlist, making one pass.  Pass
each mm to unuse_mm() to begin its page table walk, and during the walk,
unuse all the ptes that have backing store in the swap type received by
try_to_unuse().  After the walk, check the type for orphaned swap
entries with find_next_to_unuse(), and remove them from the swap cache.
If find_next_to_unuse() starts over at the beginning of the type, repeat
the check of the shmem_swaplist and the walk a maximum of three times.

Change unuse_mm() and the intervening walk functions down to
unuse_pte_range() to take the type as a parameter, and to iterate over
their entire range, calling the next function down on every iteration.
In unuse_pte_range(), make a swap entry from each pte in the range using
the passed in type.  If it has backing store in the type, call
swapin_readahead() to retrieve the page and pass it to unuse_pte().

Pass the count of pages_to_unuse down the page table walks in
try_to_unuse(), and return from the walk when the desired number of
pages has been swapped back in.

Link: http://lkml.kernel.org/r/20190114153129.4852-2-vpillai@digitalocean.com
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:18 -08:00
Vineeth Remanan Pillai
c5bf121e43 mm: refactor swap-in logic out of shmem_getpage_gfp
swapin logic can be reused independently without rest of the logic in
shmem_getpage_gfp.  So lets refactor it out as an independent function.

Link: http://lkml.kernel.org/r/20190114153129.4852-1-vpillai@digitalocean.com
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kelley Nielsen <kelleynnn@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:18 -08:00
Kirill Tkhai
a9e7c39fa9 mm/vmscan.c: remove 7th argument of isolate_lru_pages()
We may simply check for sc->may_unmap in isolate_lru_pages() instead of
doing that in both of its callers.

Link: http://lkml.kernel.org/r/154748280735.29962.15867846875217618569.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:18 -08:00
Vlastimil Babka
2e25644e8d mm, mempolicy: fix uninit memory access
Syzbot with KMSAN reports (excerpt):

==================================================================
BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:353 [inline]
BUG: KMSAN: uninit-value in mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
CPU: 1 PID: 17420 Comm: syz-executor4 Not tainted 4.20.0-rc7+ #15
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x173/0x1d0 lib/dump_stack.c:113
  kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613
  __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:295
  mpol_rebind_policy mm/mempolicy.c:353 [inline]
  mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
  update_tasks_nodemask+0x608/0xca0 kernel/cgroup/cpuset.c:1120
  update_nodemasks_hier kernel/cgroup/cpuset.c:1185 [inline]
  update_nodemask kernel/cgroup/cpuset.c:1253 [inline]
  cpuset_write_resmask+0x2a98/0x34b0 kernel/cgroup/cpuset.c:1728

...

Uninit was created at:
  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:204 [inline]
  kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:158
  kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176
  kmem_cache_alloc+0x572/0xb90 mm/slub.c:2777
  mpol_new mm/mempolicy.c:276 [inline]
  do_mbind mm/mempolicy.c:1180 [inline]
  kernel_mbind+0x8a7/0x31a0 mm/mempolicy.c:1347
  __do_sys_mbind mm/mempolicy.c:1354 [inline]

As it's difficult to report where exactly the uninit value resides in
the mempolicy object, we have to guess a bit.  mm/mempolicy.c:353
contains this part of mpol_rebind_policy():

        if (!mpol_store_user_nodemask(pol) &&
            nodes_equal(pol->w.cpuset_mems_allowed, *newmask))

"mpol_store_user_nodemask(pol)" is testing pol->flags, which I couldn't
ever see being uninitialized after leaving mpol_new().  So I'll guess
it's actually about accessing pol->w.cpuset_mems_allowed on line 354,
but still part of statement starting on line 353.

For w.cpuset_mems_allowed to be not initialized, and the nodes_equal()
reachable for a mempolicy where mpol_set_nodemask() is called in
do_mbind(), it seems the only possibility is a MPOL_PREFERRED policy
with empty set of nodes, i.e.  MPOL_LOCAL equivalent, with MPOL_F_LOCAL
flag.  Let's exclude such policies from the nodes_equal() check.  Note
the uninit access should be benign anyway, as rebinding this kind of
policy is always a no-op.  Therefore no actual need for stable
inclusion.

Link: http://lkml.kernel.org/r/a71997c3-e8ae-a787-d5ce-3db05768b27c@suse.cz
Link: http://lkml.kernel.org/r/73da3e9c-cc84-509e-17d9-0c434bb9967d@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: syzbot+b19c2dc2c990ea657a71@syzkaller.appspotmail.com
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Cc: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:18 -08:00
Tetsuo Handa
7775face20 memcg: killed threads should not invoke memcg OOM killer
If a memory cgroup contains a single process with many threads
(including different process group sharing the mm) then it is possible
to trigger a race when the oom killer complains that there are no oom
elible tasks and complain into the log which is both annoying and
confusing because there is no actual problem.  The race looks as
follows:

P1				oom_reaper		P2
try_charge						try_charge
  mem_cgroup_out_of_memory
    mutex_lock(oom_lock)
      out_of_memory
        oom_kill_process(P1,P2)
         wake_oom_reaper
    mutex_unlock(oom_lock)
    				oom_reap_task
							  mutex_lock(oom_lock)
							    select_bad_process # no victim

The problem is more visible with many threads.

Fix this by checking for fatal_signal_pending from
mem_cgroup_out_of_memory when the oom_lock is already held.

The oom bypass is safe because we do the same early in the try_charge
path already.  The situation migh have changed in the mean time.  It
should be safe to check for fatal_signal_pending and tsk_is_oom_victim
but for a better code readability abstract the current charge bypass
condition into should_force_charge and reuse it from that path.  "

Link: http://lkml.kernel.org/r/01370f70-e1f6-ebe4-b95e-0df21a0bc15e@i-love.sakura.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:18 -08:00
Mike Rapoport
23a7052a5d mm/page_alloc.c: check return value of memblock_alloc_node_nopanic()
There are two early memory allocations that use
memblock_alloc_node_nopanic() and do not check its return value.

While this happens very early during boot and chances that the
allocation will fail are diminishing, it is still worth to have proper
checks for the allocation errors.

Link: http://lkml.kernel.org/r/1547734941-944-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:18 -08:00
Aneesh Kumar K.V
023bdd0023 mm/hugetlb: add prot_modify_start/commit sequence for hugetlb update
Architectures like ppc64 require to do a conditional tlb flush based on
the old and new value of pte.  Follow the regular pte change protection
sequence for hugetlb too.  This allows the architectures to override the
update sequence.

Link: http://lkml.kernel.org/r/20190116085035.29729-5-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:18 -08:00
Aneesh Kumar K.V
04a8645304 mm: update ptep_modify_prot_commit to take old pte value as arg
Architectures like ppc64 require to do a conditional tlb flush based on
the old and new value of pte.  Enable that by passing old pte value as
the arg.

Link: http://lkml.kernel.org/r/20190116085035.29729-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:18 -08:00
Aneesh Kumar K.V
0cbe3e26ab mm: update ptep_modify_prot_start/commit to take vm_area_struct as arg
Patch series "NestMMU pte upgrade workaround for mprotect", v5.

We can upgrade pte access (R -> RW transition) via mprotect.  We need to
make sure we follow the recommended pte update sequence as outlined in
commit bd5050e38a ("powerpc/mm/radix: Change pte relax sequence to
handle nest MMU hang") for such updates.  This patch series does that.

This patch (of 5):

Some architectures may want to call flush_tlb_range from these helpers.

Link: http://lkml.kernel.org/r/20190116085035.29729-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:18 -08:00
Wei Yang
8bb4e7a2ee mm: fix some typos in mm directory
No functional change.

Link: http://lkml.kernel.org/r/20190118235123.27843-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:18 -08:00
Changbin Du
8aa49762db mm/page_owner: move config option to mm/Kconfig.debug
Move the PAGE_OWNER option from submenu "Compile-time checks and
compiler options" to dedicated submenu "Memory Debugging".

Link: http://lkml.kernel.org/r/20190120024254.6270-1-changbin.du@gmail.com
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:18 -08:00
Yang Fan
43cca0b1c5 mm/mmap.c: remove some redundancy in arch_get_unmapped_area_topdown()
The variable 'addr' is redundant in arch_get_unmapped_area_topdown(),
just use parameter 'addr0' directly.  Then remove the const qualifier of
the parameter, and change its name to 'addr'.

And in according with other functions, remove the const qualifier of all
other no-pointer parameters in function arch_get_unmapped_area_topdown().

Link: http://lkml.kernel.org/r/20190127041112.25599-1-nullptr.cpp@gmail.com
Signed-off-by: Yang Fan <nullptr.cpp@gmail.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Shakeel Butt
bbbe480297 mm, oom: remove 'prefer children over parent' heuristic
Since the start of the git history of Linux, the kernel after selecting
the worst process to be oom-killed, prefer to kill its child (if the
child does not share mm with the parent).  Later it was changed to
prefer to kill a child who is worst.  If the parent is still the worst
then the parent will be killed.

This heuristic assumes that the children did less work than their parent
and by killing one of them, the work lost will be less.  However this is
very workload dependent.  If there is a workload which can benefit from
this heuristic, can use oom_score_adj to prefer children to be killed
before the parent.

The select_bad_process() has already selected the worst process in the
system/memcg.  There is no need to recheck the badness of its children
and hoping to find a worse candidate.  That's a lot of unneeded racy
work.  Also the heuristic is dangerous because it make fork bomb like
workloads to recover much later because we constantly pick and kill
processes which are not memory hogs.  So, let's remove this whole
heuristic.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Greg Kroah-Hartman
d9f7979c92 mm: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Laura Abbott <labbott@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Matthew Wilcox
0ee930e6ca mm/memory.c: prevent mapping typed pages to userspace
Pages which use page_type must never be mapped to userspace as it would
destroy their page type.  Add an explicit check for this instead of
assuming that kernel drivers always get this right.

Link: http://lkml.kernel.org/r/20190129053830.3749-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Matthew Wilcox
2d432cb709 mm: prevent mapping slab pages to userspace
It's never appropriate to map a page allocated by SLAB into userspace.
A buggy device driver might try this, or an attacker might be able to
find a way to make it happen.

Christoph said:

: Let's just fail the code.  Currently this may work with SLUB.  But SLAB
: and SLOB overlay fields with mapcount.  So you would have a corrupted page
: struct if you mapped a slab page to user space.

Link: http://lkml.kernel.org/r/20190125173827.2658-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Uladzislau Rezki (Sony)
afd07389d3 mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512!
One of the vmalloc stress test case triggers the kernel BUG():

  <snip>
  [60.562151] ------------[ cut here ]------------
  [60.562154] kernel BUG at mm/vmalloc.c:512!
  [60.562206] invalid opcode: 0000 [#1] PREEMPT SMP PTI
  [60.562247] CPU: 0 PID: 430 Comm: vmalloc_test/0 Not tainted 4.20.0+ #161
  [60.562293] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
  [60.562351] RIP: 0010:alloc_vmap_area+0x36f/0x390
  <snip>

it can happen due to big align request resulting in overflowing of
calculated address, i.e.  it becomes 0 after ALIGN()'s fixup.

Fix it by checking if calculated address is within vstart/vend range.

Link: http://lkml.kernel.org/r/20190124115648.9433-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Chris Down
677dc9731b mm, memcg: extract memcg maxable seq_file logic to seq_show_memcg_tunable
memcg has a significant number of files exposed to kernfs where their
value is either exposed directly or is "max" in the case of
PAGE_COUNTER_MAX.

This patch makes this generic by providing a single function to do this
work.  In combination with the previous patch adding
mem_cgroup_from_seq, this makes all of the seq_show feeder functions
significantly more simple.

Link: http://lkml.kernel.org/r/20190124194100.GA31425@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Chris Down
aa9694bb78 mm, memcg: create mem_cgroup_from_seq
This is the start of a series of patches similar to my earlier
DEFINE_MEMCG_MAX_OR_VAL work, but with less Macro Magic(tm).

There are a bunch of places we go from seq_file to mem_cgroup, which
currently requires manually getting the css, then getting the mem_cgroup
from the css.  It's in enough places now that having mem_cgroup_from_seq
makes sense (and also makes the next patch a bit nicer).

Link: http://lkml.kernel.org/r/20190124194050.GA31341@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Mel Gorman
5e1f0f098b mm, compaction: capture a page under direct compaction
Compaction is inherently race-prone as a suitable page freed during
compaction can be allocated by any parallel task.  This patch uses a
capture_control structure to isolate a page immediately when it is freed
by a direct compactor in the slow path of the page allocator.  The
intent is to avoid redundant scanning.

                                     5.0.0-rc1              5.0.0-rc1
                               selective-v3r17          capture-v3r19
Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
Amean     fault-both-3      2582.11 (   0.00%)     2563.68 (   0.71%)
Amean     fault-both-5      4500.26 (   0.00%)     4233.52 (   5.93%)
Amean     fault-both-7      5819.53 (   0.00%)     6333.65 (  -8.83%)
Amean     fault-both-12     9321.18 (   0.00%)     9759.38 (  -4.70%)
Amean     fault-both-18     9782.76 (   0.00%)    10338.76 (  -5.68%)
Amean     fault-both-24    15272.81 (   0.00%)    13379.55 *  12.40%*
Amean     fault-both-30    15121.34 (   0.00%)    16158.25 (  -6.86%)
Amean     fault-both-32    18466.67 (   0.00%)    18971.21 (  -2.73%)

Latency is only moderately affected but the devil is in the details.  A
closer examination indicates that base page fault latency is reduced but
latency of huge pages is increased as it takes creater care to succeed.
Part of the "problem" is that allocation success rates are close to 100%
even when under pressure and compaction gets harder

                                5.0.0-rc1              5.0.0-rc1
                          selective-v3r17          capture-v3r19
Percentage huge-3        96.70 (   0.00%)       98.23 (   1.58%)
Percentage huge-5        96.99 (   0.00%)       95.30 (  -1.75%)
Percentage huge-7        94.19 (   0.00%)       97.24 (   3.24%)
Percentage huge-12       94.95 (   0.00%)       97.35 (   2.53%)
Percentage huge-18       96.74 (   0.00%)       97.30 (   0.58%)
Percentage huge-24       97.07 (   0.00%)       97.55 (   0.50%)
Percentage huge-30       95.69 (   0.00%)       98.50 (   2.95%)
Percentage huge-32       96.70 (   0.00%)       99.27 (   2.65%)

And scan rates are reduced as expected by 6% for the migration scanner
and 29% for the free scanner indicating that there is less redundant
work.

Compaction migrate scanned    20815362    19573286
Compaction free scanned       16352612    11510663

[mgorman@techsingularity.net: remove redundant check]
  Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Mel Gorman
e332f741a8 mm, compaction: be selective about what pageblocks to clear skip hints
Pageblock hints are cleared when compaction restarts or kswapd makes
enough progress that it can sleep but it's over-eager in that the bit is
cleared for migration sources with no LRU pages and migration targets
with no free pages.  As pageblock skip hint flushes are relatively rare
and out-of-band with respect to kswapd, this patch makes a few more
expensive checks to see if it's appropriate to even clear the bit.
Every pageblock that is not cleared will avoid 512 pages being scanned
unnecessarily on x86-64.

The impact is variable with different workloads showing small
differences in latency, success rates and scan rates.  This is expected
as clearing the hints is not that common but doing a small amount of
work out-of-band to avoid a large amount of work in-band later is
generally a good thing.

Link: http://lkml.kernel.org/r/20190118175136.31341-22-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
[cai@lca.pw: no stuck in __reset_isolation_pfn()]
  Link: http://lkml.kernel.org/r/20190206034732.75687-1-cai@lca.pw
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Mel Gorman
4fca9730c5 mm, compaction: sample pageblocks for free pages
Once fast searching finishes, there is a possibility that the linear
scanner is scanning full blocks found by the fast scanner earlier.  This
patch uses an adaptive stride to sample pageblocks for free pages.  The
more consecutive full pageblocks encountered, the larger the stride
until a pageblock with free pages is found.  The scanners might meet
slightly sooner but it is an acceptable risk given that the search of
the free lists may still encounter the pages and adjust the cached PFN
of the free scanner accordingly.

                                     5.0.0-rc1              5.0.0-rc1
                              roundrobin-v3r17       samplefree-v3r17
Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
Amean     fault-both-3      2752.37 (   0.00%)     2729.95 (   0.81%)
Amean     fault-both-5      4341.69 (   0.00%)     4397.80 (  -1.29%)
Amean     fault-both-7      6308.75 (   0.00%)     6097.61 (   3.35%)
Amean     fault-both-12    10241.81 (   0.00%)     9407.15 (   8.15%)
Amean     fault-both-18    13736.09 (   0.00%)    10857.63 *  20.96%*
Amean     fault-both-24    16853.95 (   0.00%)    13323.24 *  20.95%*
Amean     fault-both-30    15862.61 (   0.00%)    17345.44 (  -9.35%)
Amean     fault-both-32    18450.85 (   0.00%)    16892.00 (   8.45%)

The latency is mildly improved offseting some overhead from earlier
patches that are prerequisites for the rest of the series.  However, a
major impact is on the free scan rate with an 82% reduction.

                                5.0.0-rc1      5.0.0-rc1
                         roundrobin-v3r17 samplefree-v3r17
Compaction migrate scanned    21607271            20116887
Compaction free scanned       95336406            16668703

It's also the first time in the series where the number of pages scanned
by the migration scanner is greater than the free scanner due to the
increased search efficiency.

Link: http://lkml.kernel.org/r/20190118175136.31341-21-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Mel Gorman
dbe2d4e4f1 mm, compaction: round-robin the order while searching the free lists for a target
As compaction proceeds and creates high-order blocks, the free list
search gets less efficient as the larger blocks are used as compaction
targets.  Eventually, the larger blocks will be behind the migration
scanner for partially migrated pageblocks and the search fails.  This
patch round-robins what orders are searched so that larger blocks can be
ignored and find smaller blocks that can be used as migration targets.

The overall impact was small on 1-socket but it avoids corner cases
where the migration/free scanners meet prematurely or situations where
many of the pageblocks encountered by the free scanner are almost full
instead of being properly packed.  Previous testing had indicated that
without this patch there were occasional large spikes in the free
scanner without this patch.

[dan.carpenter@oracle.com: fix static checker warning]
Link: http://lkml.kernel.org/r/20190118175136.31341-20-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Mel Gorman
d097a6f635 mm, compaction: reduce premature advancement of the migration target scanner
The fast isolation of free pages allows the cached PFN of the free
scanner to advance faster than necessary depending on the contents of
the free list.  The key is that fast_isolate_freepages() can update
zone->compact_cached_free_pfn via isolate_freepages_block().  When the
fast search fails, the linear scan can start from a point that has
skipped valid migration targets, particularly pageblocks with just
low-order free pages.  This can cause the migration source/target
scanners to meet prematurely causing a reset.

This patch starts by avoiding an update of the pageblock skip
information and cached PFN from isolate_freepages_block() and puts the
responsibility of updating that information in the callers.  The fast
scanner will update the cached PFN if and only if it finds a block that
is higher than the existing cached PFN and sets the skip if the
pageblock is full or nearly full.  The linear scanner will update
skipped information and the cached PFN only when a block is completely
scanned.  The total impact is that the free scanner advances more slowly
as it is primarily driven by the linear scanner instead of the fast
search.

                                     5.0.0-rc1              5.0.0-rc1
                               noresched-v3r17         slowfree-v3r17
Amean     fault-both-3      2965.68 (   0.00%)     3036.75 (  -2.40%)
Amean     fault-both-5      3995.90 (   0.00%)     4522.24 * -13.17%*
Amean     fault-both-7      5842.12 (   0.00%)     6365.35 (  -8.96%)
Amean     fault-both-12     9550.87 (   0.00%)    10340.93 (  -8.27%)
Amean     fault-both-18    13304.72 (   0.00%)    14732.46 ( -10.73%)
Amean     fault-both-24    14618.59 (   0.00%)    16288.96 ( -11.43%)
Amean     fault-both-30    16650.96 (   0.00%)    16346.21 (   1.83%)
Amean     fault-both-32    17145.15 (   0.00%)    19317.49 ( -12.67%)

The impact to latency is higher than the last version but it appears to
be due to a slight increase in the free scan rates which is a potential
side-effect of the patch.  However, this is necessary for later patches
that are more careful about how pageblocks are treated as earlier
iterations of those patches hit corner cases where the restarts were
punishing and very visible.

Link: http://lkml.kernel.org/r/20190118175136.31341-19-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Mel Gorman
cf66f0700c mm, compaction: do not consider a need to reschedule as contention
Scanning on large machines can take a considerable length of time and
eventually need to be rescheduled.  This is treated as an abort event
but that's not appropriate as the attempt is likely to be retried after
making numerous checks and taking another cycle through the page
allocator.  This patch will check the need to reschedule if necessary
but continue the scanning.

The main benefit is reduced scanning when compaction is taking a long
time or the machine is over-saturated.  It also avoids an unnecessary
exit of compaction that ends up being retried by the page allocator in
the outer loop.

                                     5.0.0-rc1              5.0.0-rc1
                              synccached-v3r16        noresched-v3r17
Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
Amean     fault-both-3      2958.27 (   0.00%)     2965.68 (  -0.25%)
Amean     fault-both-5      4091.90 (   0.00%)     3995.90 (   2.35%)
Amean     fault-both-7      5803.05 (   0.00%)     5842.12 (  -0.67%)
Amean     fault-both-12     9481.06 (   0.00%)     9550.87 (  -0.74%)
Amean     fault-both-18    14141.51 (   0.00%)    13304.72 (   5.92%)
Amean     fault-both-24    16438.00 (   0.00%)    14618.59 (  11.07%)
Amean     fault-both-30    17531.72 (   0.00%)    16650.96 (   5.02%)
Amean     fault-both-32    17101.96 (   0.00%)    17145.15 (  -0.25%)

Link: http://lkml.kernel.org/r/20190118175136.31341-18-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Mel Gorman
cb810ad294 mm, compaction: rework compact_should_abort as compact_check_resched
With incremental changes, compact_should_abort no longer makes any
documented sense.  Rename to compact_check_resched and update the
associated comments.  There is no benefit other than reducing redundant
code and making the intent slightly clearer.  It could potentially be
merged with earlier patches but it just makes the review slightly
harder.

Link: http://lkml.kernel.org/r/20190118175136.31341-17-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Mel Gorman
8854c55f54 mm, compaction: keep cached migration PFNs synced for unusable pageblocks
Migrate has separate cached PFNs for ASYNC and SYNC* migration on the
basis that some migrations will fail in ASYNC mode.  However, if the
cached PFNs match at the start of scanning and pageblocks are skipped
due to having no isolation candidates, then the sync state does not
matter.  This patch keeps matching cached PFNs in sync until a pageblock
with isolation candidates is found.

The actual benefit is marginal given that the sync scanner following the
async scanner will often skip a number of pageblocks but it's useless
work.  Any benefit depends heavily on whether the scanners restarted
recently.

Link: http://lkml.kernel.org/r/20190118175136.31341-16-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Mel Gorman
9bebefd590 mm, compaction: check early for huge pages encountered by the migration scanner
When scanning for sources or targets, PageCompound is checked for huge
pages as they can be skipped quickly but it happens relatively late
after a lot of setup and checking.  This patch short-cuts the check to
make it earlier.  It might still change when the lock is acquired but
this has less overhead overall.  The free scanner advances but the
migration scanner does not.  Typically the free scanner encounters more
movable blocks that change state over the lifetime of the system and
also tends to scan more aggressively as it's actively filling its
portion of the physical address space with data.  This could change in
the future but for the moment, this worked better in practice and
incurred fewer scan restarts.

The impact on latency and allocation success rates is marginal but the
free scan rates are reduced by 15% and system CPU usage is reduced by
3.3%.  The 2-socket results are not materially different.

Link: http://lkml.kernel.org/r/20190118175136.31341-15-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Mel Gorman
cb2dcaf023 mm, compaction: finish pageblock scanning on contention
Async migration aborts on spinlock contention but contention can be high
when there are multiple compaction attempts and kswapd is active.  The
consequence is that the migration scanners move forward uselessly while
still contending on locks for longer while leaving suitable migration
sources behind.

This patch will acquire the lock but track when contention occurs.  When
it does, the current pageblock will finish as compaction may succeed for
that block and then abort.  This will have a variable impact on latency
as in some cases useless scanning is avoided (reduces latency) but a
lock will be contended (increase latency) or a single contended
pageblock is scanned that would otherwise have been skipped (increase
latency).

                                     5.0.0-rc1              5.0.0-rc1
                                norescan-v3r16    finishcontend-v3r16
Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
Amean     fault-both-3      3002.07 (   0.00%)     3153.17 (  -5.03%)
Amean     fault-both-5      4684.47 (   0.00%)     4280.52 (   8.62%)
Amean     fault-both-7      6815.54 (   0.00%)     5811.50 *  14.73%*
Amean     fault-both-12    10864.02 (   0.00%)     9276.85 (  14.61%)
Amean     fault-both-18    12247.52 (   0.00%)    11032.67 (   9.92%)
Amean     fault-both-24    15683.99 (   0.00%)    14285.70 (   8.92%)
Amean     fault-both-30    18620.02 (   0.00%)    16293.76 *  12.49%*
Amean     fault-both-32    19250.28 (   0.00%)    16721.02 *  13.14%*

                                5.0.0-rc1              5.0.0-rc1
                           norescan-v3r16    finishcontend-v3r16
Percentage huge-1         0.00 (   0.00%)        0.00 (   0.00%)
Percentage huge-3        95.00 (   0.00%)       96.82 (   1.92%)
Percentage huge-5        94.22 (   0.00%)       95.40 (   1.26%)
Percentage huge-7        92.35 (   0.00%)       95.92 (   3.86%)
Percentage huge-12       91.90 (   0.00%)       96.73 (   5.25%)
Percentage huge-18       89.58 (   0.00%)       96.77 (   8.03%)
Percentage huge-24       90.03 (   0.00%)       96.05 (   6.69%)
Percentage huge-30       89.14 (   0.00%)       96.81 (   8.60%)
Percentage huge-32       90.58 (   0.00%)       97.41 (   7.54%)

There is a variable impact that is mostly good on latency while allocation
success rates are slightly higher.  System CPU usage is reduced by about
10% but scan rate impact is mixed

Compaction migrate scanned    27997659.00    20148867
Compaction free scanned      120782791.00   118324914

Migration scan rates are reduced 28% which is expected as a pageblock is
used by the async scanner instead of skipped.  The impact on the free
scanner is known to be variable.  Overall the primary justification for
this patch is that completing scanning of a pageblock is very important
for later patches.

[yuehaibing@huawei.com: fix unused variable warning]
Link: http://lkml.kernel.org/r/20190118175136.31341-14-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: YueHaibing <yuehaibing@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Mel Gorman
804d3121ba mm, compaction: avoid rescanning the same pageblock multiple times
Pageblocks are marked for skip when no pages are isolated after a scan.
However, it's possible to hit corner cases where the migration scanner
gets stuck near the boundary between the source and target scanner.  Due
to pages being migrated in blocks of COMPACT_CLUSTER_MAX, pages that are
migrated can be reallocated before the pageblock is complete.  The
pageblock is not necessarily skipped so it can be rescanned multiple
times.  Similarly, a pageblock with some dirty/writeback pages may fail
to migrate and be rescanned until writeback completes which is wasteful.

This patch tracks if a pageblock is being rescanned.  If so, then the
entire pageblock will be migrated as one operation.  This narrows the
race window during which pages can be reallocated during migration.
Secondly, if there are pages that cannot be isolated then the pageblock
will still be fully scanned and marked for skipping.  On the second
rescan, the pageblock skip is set and the migration scanner makes
progress.

                                     5.0.0-rc1              5.0.0-rc1
                                findfree-v3r16         norescan-v3r16
Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
Amean     fault-both-3      3200.68 (   0.00%)     3002.07 (   6.21%)
Amean     fault-both-5      4847.75 (   0.00%)     4684.47 (   3.37%)
Amean     fault-both-7      6658.92 (   0.00%)     6815.54 (  -2.35%)
Amean     fault-both-12    11077.62 (   0.00%)    10864.02 (   1.93%)
Amean     fault-both-18    12403.97 (   0.00%)    12247.52 (   1.26%)
Amean     fault-both-24    15607.10 (   0.00%)    15683.99 (  -0.49%)
Amean     fault-both-30    18752.27 (   0.00%)    18620.02 (   0.71%)
Amean     fault-both-32    21207.54 (   0.00%)    19250.28 *   9.23%*

                                5.0.0-rc1              5.0.0-rc1
                           findfree-v3r16         norescan-v3r16
Percentage huge-3        96.86 (   0.00%)       95.00 (  -1.91%)
Percentage huge-5        93.72 (   0.00%)       94.22 (   0.53%)
Percentage huge-7        94.31 (   0.00%)       92.35 (  -2.08%)
Percentage huge-12       92.66 (   0.00%)       91.90 (  -0.82%)
Percentage huge-18       91.51 (   0.00%)       89.58 (  -2.11%)
Percentage huge-24       90.50 (   0.00%)       90.03 (  -0.52%)
Percentage huge-30       91.57 (   0.00%)       89.14 (  -2.65%)
Percentage huge-32       91.00 (   0.00%)       90.58 (  -0.46%)

Negligible difference but this was likely a case when the specific
corner case was not hit.  A previous run of the same patch based on an
earlier iteration of the series showed large differences where migration
rates could be halved when the corner case was hit.

The specific corner case where migration scan rates go through the roof
was due to a dirty/writeback pageblock located at the boundary of the
migration/free scanner did not happen in this case.  When it does
happen, the scan rates multipled by massive margins.

Link: http://lkml.kernel.org/r/20190118175136.31341-13-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Mel Gorman
5a811889de mm, compaction: use free lists to quickly locate a migration target
Similar to the migration scanner, this patch uses the free lists to
quickly locate a migration target.  The search is different in that
lower orders will be searched for a suitable high PFN if necessary but
the search is still bound.  This is justified on the grounds that the
free scanner typically scans linearly much more than the migration
scanner.

If a free page is found, it is isolated and compaction continues if
enough pages were isolated.  For SYNC* scanning, the full pageblock is
scanned for any remaining free pages so that is can be marked for
skipping in the near future.

1-socket thpfioscale
                                     5.0.0-rc1              5.0.0-rc1
                                 isolmig-v3r15         findfree-v3r16
Amean     fault-both-3      3024.41 (   0.00%)     3200.68 (  -5.83%)
Amean     fault-both-5      4749.30 (   0.00%)     4847.75 (  -2.07%)
Amean     fault-both-7      6454.95 (   0.00%)     6658.92 (  -3.16%)
Amean     fault-both-12    10324.83 (   0.00%)    11077.62 (  -7.29%)
Amean     fault-both-18    12896.82 (   0.00%)    12403.97 (   3.82%)
Amean     fault-both-24    13470.60 (   0.00%)    15607.10 * -15.86%*
Amean     fault-both-30    17143.99 (   0.00%)    18752.27 (  -9.38%)
Amean     fault-both-32    17743.91 (   0.00%)    21207.54 * -19.52%*

The impact on latency is variable but the search is optimistic and
sensitive to the exact system state.  Success rates are similar but the
major impact is to the rate of scanning

                                5.0.0-rc1      5.0.0-rc1
                            isolmig-v3r15 findfree-v3r16
Compaction migrate scanned    25646769          29507205
Compaction free scanned      201558184         100359571

The free scan rates are reduced by 50%.  The 2-socket reductions for the
free scanner are more dramatic which is a likely reflection that the
machine has more memory.

[dan.carpenter@oracle.com: fix static checker warning]
[vbabka@suse.cz: correct number of pages scanned for lower orders]
Link: http://lkml.kernel.org/r/20190118175136.31341-12-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Mel Gorman
e380bebe47 mm, compaction: keep migration source private to a single compaction instance
Due to either a fast search of the free list or a linear scan, it is
possible for multiple compaction instances to pick the same pageblock
for migration.  This is lucky for one scanner and increased scanning for
all the others.  It also allows a race between requests on which first
allocates the resulting free block.

This patch tests and updates the pageblock skip for the migration
scanner carefully.  When isolating a block, it will check and skip if
the block is already in use.  Once the zone lock is acquired, it will be
rechecked so that only one scanner can set the pageblock skip for
exclusive use.  Any scanner contending will continue with a linear scan.
The skip bit is still set if no pages can be isolated in a range.  While
this may result in redundant scanning, it avoids unnecessarily acquiring
the zone lock when there are no suitable migration sources.

1-socket thpscale
Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
Amean     fault-both-3      3390.40 (   0.00%)     3024.41 (  10.80%)
Amean     fault-both-5      5082.28 (   0.00%)     4749.30 (   6.55%)
Amean     fault-both-7      7012.51 (   0.00%)     6454.95 (   7.95%)
Amean     fault-both-12    11346.63 (   0.00%)    10324.83 (   9.01%)
Amean     fault-both-18    15324.19 (   0.00%)    12896.82 *  15.84%*
Amean     fault-both-24    16088.50 (   0.00%)    13470.60 *  16.27%*
Amean     fault-both-30    18723.42 (   0.00%)    17143.99 (   8.44%)
Amean     fault-both-32    18612.01 (   0.00%)    17743.91 (   4.66%)

                                5.0.0-rc1              5.0.0-rc1
                            findmig-v3r15          isolmig-v3r15
Percentage huge-3        89.83 (   0.00%)       92.96 (   3.48%)
Percentage huge-5        91.96 (   0.00%)       93.26 (   1.41%)
Percentage huge-7        92.85 (   0.00%)       93.63 (   0.84%)
Percentage huge-12       92.74 (   0.00%)       92.80 (   0.07%)
Percentage huge-18       91.71 (   0.00%)       91.62 (  -0.10%)
Percentage huge-24       92.13 (   0.00%)       91.50 (  -0.69%)
Percentage huge-30       93.79 (   0.00%)       92.73 (  -1.13%)
Percentage huge-32       91.27 (   0.00%)       91.94 (   0.74%)

This shows a reasonable reduction in latency as multiple compaction
scanners do not operate on the same blocks with a similar allocation
success rate.

Compaction migrate scanned    41093126    25646769

Migration scan rates are reduced by 38%.

Link: http://lkml.kernel.org/r/20190118175136.31341-11-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Mel Gorman
70b44595ea mm, compaction: use free lists to quickly locate a migration source
The migration scanner is a linear scan of a zone with a potentiall large
search space.  Furthermore, many pageblocks are unusable such as those
filled with reserved pages or partially filled with pages that cannot
migrate.  These still get scanned in the common case of allocating a THP
and the cost accumulates.

The patch uses a partial search of the free lists to locate a migration
source candidate that is marked as MOVABLE when allocating a THP.  It
prefers picking a block with a larger number of free pages already on
the basis that there are fewer pages to migrate to free the entire
block.  The lowest PFN found during searches is tracked as the basis of
the start for the linear search after the first search of the free list
fails.  After the search, the free list is shuffled so that the next
search will not encounter the same page.  If the search fails then the
subsequent searches will be shorter and the linear scanner is used.

If this search fails, or if the request is for a small or
unmovable/reclaimable allocation then the linear scanner is still used.
It is somewhat pointless to use the list search in those cases.  Small
free pages must be used for the search and there is no guarantee that
movable pages are located within that block that are contiguous.

                                     5.0.0-rc1              5.0.0-rc1
                                 noboost-v3r10          findmig-v3r15
Amean     fault-both-3      3771.41 (   0.00%)     3390.40 (  10.10%)
Amean     fault-both-5      5409.05 (   0.00%)     5082.28 (   6.04%)
Amean     fault-both-7      7040.74 (   0.00%)     7012.51 (   0.40%)
Amean     fault-both-12    11887.35 (   0.00%)    11346.63 (   4.55%)
Amean     fault-both-18    16718.19 (   0.00%)    15324.19 (   8.34%)
Amean     fault-both-24    21157.19 (   0.00%)    16088.50 *  23.96%*
Amean     fault-both-30    21175.92 (   0.00%)    18723.42 *  11.58%*
Amean     fault-both-32    21339.03 (   0.00%)    18612.01 *  12.78%*

                                5.0.0-rc1              5.0.0-rc1
                            noboost-v3r10          findmig-v3r15
Percentage huge-3        86.50 (   0.00%)       89.83 (   3.85%)
Percentage huge-5        92.52 (   0.00%)       91.96 (  -0.61%)
Percentage huge-7        92.44 (   0.00%)       92.85 (   0.44%)
Percentage huge-12       92.98 (   0.00%)       92.74 (  -0.25%)
Percentage huge-18       91.70 (   0.00%)       91.71 (   0.02%)
Percentage huge-24       91.59 (   0.00%)       92.13 (   0.60%)
Percentage huge-30       90.14 (   0.00%)       93.79 (   4.04%)
Percentage huge-32       90.03 (   0.00%)       91.27 (   1.37%)

This shows an improvement in allocation latencies with similar
allocation success rates.  While not presented, there was a 31%
reduction in migration scanning and a 8% reduction on system CPU usage.
A 2-socket machine showed similar benefits.

[mgorman@techsingularity.net: several fixes]
  Link: http://lkml.kernel.org/r/20190204120111.GL9565@techsingularity.net
[vbabka@suse.cz: migrate block that was found-fast, some optimisations]
Link: http://lkml.kernel.org/r/20190118175136.31341-10-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <Vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Mel Gorman
fd1444b272 mm, compaction: ignore the fragmentation avoidance boost for isolation and compaction
When pageblocks get fragmented, watermarks are artifically boosted to
reclaim pages to avoid further fragmentation events.  However,
compaction is often either fragmentation-neutral or moving movable pages
away from unmovable/reclaimable pages.  As the true watermarks are
preserved, allow compaction to ignore the boost factor.

The expected impact is very slight as the main benefit is that
compaction is slightly more likely to succeed when the system has been
fragmented very recently.  On both 1-socket and 2-socket machines for
THP-intensive allocation during fragmentation the success rate was
increased by less than 1% which is marginal.  However, detailed tracing
indicated that failure of migration due to a premature ENOMEM triggered
by watermark checks were eliminated.

Link: http://lkml.kernel.org/r/20190118175136.31341-9-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Mel Gorman
efe771c760 mm, compaction: always finish scanning of a full pageblock
When compaction is finishing, it uses a flag to ensure the pageblock is
complete but it makes sense to always complete migration of a pageblock.
Minimally, skip information is based on a pageblock and partially
scanned pageblocks may incur more scanning in the future.  The pageblock
skip handling also becomes more strict later in the series and the hint
is more useful if a complete pageblock was always scanned.

The potentially impacts latency as more scanning is done but it's not a
consistent win or loss as the scanning is not always a high percentage
of the pageblock and sometimes it is offset by future reductions in
scanning.  Hence, the results are not presented this time due to a
misleading mix of gains/losses without any clear pattern.  However, full
scanning of the pageblock is important for later patches.

Link: http://lkml.kernel.org/r/20190118175136.31341-8-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Mel Gorman
806031bb5e mm, migrate: immediately fail migration of a page with no migration handler
Pages with no migration handler use a fallback handler which sometimes
works and sometimes persistently retries.  A historical example was
blockdev pages but there are others such as odd refcounting when
page->private is used.  These are retried multiple times which is
wasteful during compaction so this patch will fail migration faster
unless the caller specifies MIGRATE_SYNC.

This is not expected to help THP allocation success rates but it did
reduce latencies very slightly in some cases.

1-socket thpfioscale
                                        4.20.0                 4.20.0
                              noreserved-v2r15         failfast-v2r15
Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
Amean     fault-both-3      3839.67 (   0.00%)     3833.72 (   0.15%)
Amean     fault-both-5      5177.47 (   0.00%)     4967.15 (   4.06%)
Amean     fault-both-7      7245.03 (   0.00%)     7139.19 (   1.46%)
Amean     fault-both-12    11534.89 (   0.00%)    11326.30 (   1.81%)
Amean     fault-both-18    16241.10 (   0.00%)    16270.70 (  -0.18%)
Amean     fault-both-24    19075.91 (   0.00%)    19839.65 (  -4.00%)
Amean     fault-both-30    22712.11 (   0.00%)    21707.05 (   4.43%)
Amean     fault-both-32    21692.92 (   0.00%)    21968.16 (  -1.27%)

The 2-socket results are not materially different.  Scan rates are
similar as expected.

Link: http://lkml.kernel.org/r/20190118175136.31341-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Mel Gorman
4469ab9847 mm, compaction: rename map_pages to split_map_pages
It's non-obvious that high-order free pages are split into order-0 pages
from the function name.  Fix it.

Link: http://lkml.kernel.org/r/20190118175136.31341-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Mel Gorman
40cacbcb32 mm, compaction: remove unnecessary zone parameter in some instances
A zone parameter is passed into a number of top-level compaction
functions despite the fact that it's already in compact_control.  This
is harmless but it did need an audit to check if zone actually ever
changes meaningfully.  This patches removes the parameter in a number of
top-level functions.  The change could be much deeper but this was
enough to briefly clarify the flow.

No functional change.

Link: http://lkml.kernel.org/r/20190118175136.31341-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Mel Gorman
566e54e113 mm, compaction: remove last_migrated_pfn from compact_control
The last_migrated_pfn field is a bit dubious as to whether it really
helps but either way, the information from it can be inferred without
increasing the size of compact_control so remove the field.

Link: http://lkml.kernel.org/r/20190118175136.31341-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Mel Gorman
c5943b9c53 mm, compaction: rearrange compact_control
compact_control spans two cache lines with write-intensive lines on
both.  Rearrange so the most write-intensive fields are in the same
cache line.  This has a negligible impact on the overall performance of
compaction and is more a tidying exercise than anything.

Link: http://lkml.kernel.org/r/20190118175136.31341-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Mel Gorman
c5fbd937b6 mm, compaction: shrink compact_control
Patch series "Increase success rates and reduce latency of compaction", v3.

This series reduces scan rates and success rates of compaction,
primarily by using the free lists to shorten scans, better controlling
of skip information and whether multiple scanners can target the same
block and capturing pageblocks before being stolen by parallel requests.
The series is based on mmotm from January 9th, 2019 with the previous
compaction series reverted.

I'm mostly using thpscale to measure the impact of the series.  The
benchmark creates a large file, maps it, faults it, punches holes in the
mapping so that the virtual address space is fragmented and then tries
to allocate THP.  It re-executes for different numbers of threads.  From
a fragmentation perspective, the workload is relatively benign but it
does stress compaction.

The overall impact on latencies for a 1-socket machine is

				      baseline		      patches
Amean     fault-both-3      3832.09 (   0.00%)     2748.56 *  28.28%*
Amean     fault-both-5      4933.06 (   0.00%)     4255.52 (  13.73%)
Amean     fault-both-7      7017.75 (   0.00%)     6586.93 (   6.14%)
Amean     fault-both-12    11610.51 (   0.00%)     9162.34 *  21.09%*
Amean     fault-both-18    17055.85 (   0.00%)    11530.06 *  32.40%*
Amean     fault-both-24    19306.27 (   0.00%)    17956.13 (   6.99%)
Amean     fault-both-30    22516.49 (   0.00%)    15686.47 *  30.33%*
Amean     fault-both-32    23442.93 (   0.00%)    16564.83 *  29.34%*

The allocation success rates are much improved

			 	 baseline		 patches
Percentage huge-3        85.99 (   0.00%)       97.96 (  13.92%)
Percentage huge-5        88.27 (   0.00%)       96.87 (   9.74%)
Percentage huge-7        85.87 (   0.00%)       94.53 (  10.09%)
Percentage huge-12       82.38 (   0.00%)       98.44 (  19.49%)
Percentage huge-18       83.29 (   0.00%)       99.14 (  19.04%)
Percentage huge-24       81.41 (   0.00%)       97.35 (  19.57%)
Percentage huge-30       80.98 (   0.00%)       98.05 (  21.08%)
Percentage huge-32       80.53 (   0.00%)       97.06 (  20.53%)

That's a nearly perfect allocation success rate.

The biggest impact is on the scan rates

Compaction migrate scanned    55893379    19341254
Compaction free scanned      474739990    11903963

The number of pages scanned for migration was reduced by 65% and the
free scanner was reduced by 97.5%.  So much less work in exchange for
lower latency and better success rates.

The series was also evaluated using a workload that heavily fragments
memory but the benefits there are also significant, albeit not
presented.

It was commented that we should be rethinking scanning entirely and to a
large extent I agree.  However, to achieve that you need a lot of this
series in place first so it's best to make the linear scanners as best
as possible before ripping them out.

This patch (of 22):

The isolate and migrate scanners should never isolate more than a
pageblock of pages so unsigned int is sufficient saving 8 bytes on a
64-bit build.

Link: http://lkml.kernel.org/r/20190118175136.31341-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
zhengbin
35f12f0f5c mm/filemap: pass inclusive 'end_byte' parameter to filemap_range_has_page
The 'end_byte' parameter of filemap_range_has_page is required to be
inclusive, so follow the rule.

Link: http://lkml.kernel.org/r/1548678679-18122-1-git-send-email-zhengbin13@huawei.com
Fixes: 6be96d3ad3 ("fs: return if direct I/O will trigger writeback")
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hou Tao <houtao1@huawei.com>
Cc: zhangyi (F) <yi.zhang@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Yang Shi
e9f598730e mm: swap: add comment for swap_vma_readahead
swap_vma_readahead()'s comment is missing, just add it.

Link: http://lkml.kernel.org/r/1546543673-108536-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Hugh Dickins <hughd@google.com
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Yang Shi
8fd2e0b505 mm: swap: check if swap backing device is congested or not
Swap readahead would read in a few pages regardless if the underlying
device is busy or not.  It may incur long waiting time if the device is
congested, and it may also exacerbate the congestion.

Use inode_read_congested() to check if the underlying device is busy or
not like what file page readahead does.  Get inode from
swap_info_struct.

Although we can add inode information in swap_address_space
(address_space->host), it may lead some unexpected side effect, i.e.  it
may break mapping_cap_account_dirty().  Using inode from
swap_info_struct seems simple and good enough.

Just does the check in vma_cluster_readahead() since
swap_vma_readahead() is just used for non-rotational device which much
less likely has congestion than traditional HDD.

Although swap slots may be consecutive on swap partition, it still may
be fragmented on swap file.  This check would help to reduce excessive
stall for such case.

The test with page_fault1 of will-it-scale (sometimes tracing may just
show runtest.py that is the wrapper script of page_fault1), which
basically launches NR_CPU threads to generate 128MB anonymous pages for
each thread, on my virtual machine with congested HDD shows long tail
latency is reduced significantly.

Without the patch
 page_fault1_thr-1490  [023]   129.311706: funcgraph_entry:      #57377.796 us |  do_swap_page();
 page_fault1_thr-1490  [023]   129.369103: funcgraph_entry:        5.642us   |  do_swap_page();
 page_fault1_thr-1490  [023]   129.369119: funcgraph_entry:      #1289.592 us |  do_swap_page();
 page_fault1_thr-1490  [023]   129.370411: funcgraph_entry:        4.957us   |  do_swap_page();
 page_fault1_thr-1490  [023]   129.370419: funcgraph_entry:        1.940us   |  do_swap_page();
 page_fault1_thr-1490  [023]   129.378847: funcgraph_entry:      #1411.385 us |  do_swap_page();
 page_fault1_thr-1490  [023]   129.380262: funcgraph_entry:        3.916us   |  do_swap_page();
 page_fault1_thr-1490  [023]   129.380275: funcgraph_entry:      #4287.751 us |  do_swap_page();

With the patch
      runtest.py-1417  [020]   301.925911: funcgraph_entry:      #9870.146 us |  do_swap_page();
      runtest.py-1417  [020]   301.935785: funcgraph_entry:        9.802us   |  do_swap_page();
      runtest.py-1417  [020]   301.935799: funcgraph_entry:        3.551us   |  do_swap_page();
      runtest.py-1417  [020]   301.935806: funcgraph_entry:        2.142us   |  do_swap_page();
      runtest.py-1417  [020]   301.935853: funcgraph_entry:        6.938us   |  do_swap_page();
      runtest.py-1417  [020]   301.935864: funcgraph_entry:        3.765us   |  do_swap_page();
      runtest.py-1417  [020]   301.935871: funcgraph_entry:        3.600us   |  do_swap_page();
      runtest.py-1417  [020]   301.935878: funcgraph_entry:        7.202us   |  do_swap_page();

[akpm@linux-foundation.org: code cleanup]
[yang.shi@linux.alibaba.com: add comment]
  Link: http://lkml.kernel.org/r/bbc7bda7-62d0-df1a-23ef-d369e865bdca@linux.alibaba.com
Link: http://lkml.kernel.org/r/1546543673-108536-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Tim Chen <tim.c.chen@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Hugh Dickins <hughd@google.com
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Matthew Wilcox
14ef1fc72a mm/filemap.c: remove redundant test from find_get_pages_contig
After we establish a reference on the page, we check the pointer
continues to be in the correct position in i_pages.  Checking
page->index afterwards is unnecessary; if it were to change, then the
pointer to it from the page cache would also move.  The check used to be
done before grabbing a reference on the page which was racy (see commit
9cbb4cb21b ("mm: find_get_pages_contig fixlet")), but nobody noticed
that moving the check after grabbing the reference was redundant.

Link: http://lkml.kernel.org/r/20190107200224.13260-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:16 -08:00
Gustavo A. R. Silva
67b8046f42 mm/memcontrol.c: use struct_size() in kmalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array.  For example:

  struct foo {
      int stuff;
      void *entry[];
  };

  instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

  instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Link: http://lkml.kernel.org/r/20190104183726.GA6374@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:15 -08:00
Wei Yang
c52e75935f mm: remove extra drain pages on pcp list
In the current implementation, there are two places to isolate a range
of page: __offline_pages() and alloc_contig_range().  During this
procedure, it will drain pages on pcp list.

Below is a brief call flow:

  __offline_pages()/alloc_contig_range()
      start_isolate_page_range()
          set_migratetype_isolate()
              drain_all_pages()
      drain_all_pages()                 <--- A

This snippet shows the current logic is isolate and drain pcp list for
each pageblock and drain pcp list again for the whole range.

start_isolate_page_range is responsible for isolating the given pfn
range.  One part of that job is to make sure that also pages that are on
the allocator pcp lists are properly isolated.  Otherwise they could be
reused and the range wouldn't be completely isolated until the memory is
freed back.  While there is no strict guarantee here because pages might
get allocated at any time before drain_all_pages is called there doesn't
seem to be any strong demand for such a guarantee.

In any case, draining is already done at the isolation level and there
is no need to do it again later by start_isolate_page_range callers
(memory hotplug and CMA allocator currently).  Therefore remove
pointless draining in existing callers to make the code more clear and
functionally correct.

[mhocko@suse.com: provide a clearer changelog for the last two paragraphs]
Link: http://lkml.kernel.org/r/20190105233141.2329-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:15 -08:00
Anshuman Khandual
7ed2c31dab mm/hugetlb: distinguish between migratability and movability
Patch series "arm64/mm: Enable HugeTLB migration", v4.

This patch series enables HugeTLB migration support for all supported
huge page sizes at all levels including contiguous bit implementation.
Following HugeTLB migration support matrix has been enabled with this
patch series.  All permutations have been tested except for the 16GB.

           CONT PTE    PMD    CONT PMD    PUD
           --------    ---    --------    ---
  4K:         64K     2M         32M     1G
  16K:         2M    32M          1G
  64K:         2M   512M         16G

First the series adds migration support for PUD based huge pages.  It
then adds a platform specific hook to query an architecture if a given
huge page size is supported for migration while also providing a default
fallback option preserving the existing semantics which just checks for
(PMD|PUD|PGDIR)_SHIFT macros.  The last two patches enables HugeTLB
migration on arm64 and subscribe to this new platform specific hook by
defining an override.

The second patch differentiates between movability and migratability
aspects of huge pages and implements hugepage_movable_supported() which
can then be used during allocation to decide whether to place the huge
page in movable zone or not.

This patch (of 5):

During huge page allocation it's migratability is checked to determine
if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE.
But the movability aspect of the huge page could depend on other factors
than just migratability.  Movability in itself is a distinct property
which should not be tied with migratability alone.

This differentiates these two and implements an enhanced movability check
which also considers huge page size to determine if it is feasible to be
placed under a movable zone.  At present it just checks for gigantic pages
but going forward it can incorporate other enhanced checks.

Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:15 -08:00
Matthew Wilcox
6b7e5cad65 mm: remove sysctl_extfrag_handler()
sysctl_extfrag_handler() neglects to propagate the return value from
proc_dointvec_minmax() to its caller.  It's a wrapper that doesn't need
to exist, so just use proc_dointvec_minmax() directly.

Link: http://lkml.kernel.org/r/20190104032557.3056-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reported-by: Aditya Pakki <pakki001@umn.edu>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:15 -08:00
Uladzislau Rezki (Sony)
153178edc7 vmalloc: export __vmalloc_node_range for CONFIG_TEST_VMALLOC_MODULE
Export __vmaloc_node_range() function if CONFIG_TEST_VMALLOC_MODULE is
enabled.  Some test cases in vmalloc test suite module require and make
use of that function.  Please note, that it is not supposed to be used
for other purposes.

We need it only for performance analysis, stressing and stability check
of vmalloc allocator.

Link: http://lkml.kernel.org/r/20190103142108.20744-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:15 -08:00
Roman Penyaev
bc84c53525 mm/vmalloc: pass VM_USERMAP flags directly to __vmalloc_node_range()
vmalloc_user*() calls differ from normal vmalloc() only in that they set
VM_USERMAP flags for the area.  During the whole history of vmalloc.c
changes now it is possible simply to pass VM_USERMAP flags directly to
__vmalloc_node_range() call instead of finding the area (which obviously
takes time) after the allocation.

Link: http://lkml.kernel.org/r/20190103145954.16942-4-rpenyaev@suse.de
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:15 -08:00
Roman Penyaev
c67dc62475 mm/vmalloc: do not call kmemleak_free() on not yet accounted memory
__vmalloc_area_node() calls vfree() on error path, which in turn calls
kmemleak_free(), but area is not yet accounted by kmemleak_vmalloc().

Link: http://lkml.kernel.org/r/20190103145954.16942-3-rpenyaev@suse.de
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:15 -08:00
Roman Penyaev
401592d2e0 mm/vmalloc: fix size check for remap_vmalloc_range_partial()
When VM_NO_GUARD is not set area->size includes adjacent guard page,
thus for correct size checking get_vm_area_size() should be used, but
not area->size.

This fixes possible kernel oops when userspace tries to mmap an area on
1 page bigger than was allocated by vmalloc_user() call: the size check
inside remap_vmalloc_range_partial() accounts non-existing guard page
also, so check successfully passes but vmalloc_to_page() returns NULL
(guard page does not physically exist).

The following code pattern example should trigger an oops:

  static int oops_mmap(struct file *file, struct vm_area_struct *vma)
  {
        void *mem;

        mem = vmalloc_user(4096);
        BUG_ON(!mem);
        /* Do not care about mem leak */

        return remap_vmalloc_range(vma, mem, 0);
  }

And userspace simply mmaps size + PAGE_SIZE:

  mmap(NULL, 8192, PROT_WRITE|PROT_READ, MAP_PRIVATE, fd, 0);

Possible candidates for oops which do not have any explicit size
checks:

   *** drivers/media/usb/stkwebcam/stk-webcam.c:
   v4l_stk_mmap[789]   ret = remap_vmalloc_range(vma, sbuf->buffer, 0);

Or the following one:

   *** drivers/video/fbdev/core/fbmem.c
   static int
   fb_mmap(struct file *file, struct vm_area_struct * vma)
        ...
        res = fb->fb_mmap(info, vma);

Where fb_mmap callback calls remap_vmalloc_range() directly without any
explicit checks:

   *** drivers/video/fbdev/vfb.c
   static int vfb_mmap(struct fb_info *info,
             struct vm_area_struct *vma)
   {
       return remap_vmalloc_range(vma, (void *)info->fix.smem_start, vma->vm_pgoff);
   }

Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.de
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Joe Perches <joe@perches.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:15 -08:00
Roman Penyaev
5a82ac715d mm/vmalloc.c: make vmalloc_32_user() align base kernel virtual address to SHMLBA
This patch repeats the original one from David S Miller:

  2dca6999ee ("mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBA")

but for missed vmalloc_32_user() case, which also requires correct
alignment of virtual address on kernel side to avoid D-caches aliases.
A bit of copy-paste from original patch to recover in memory of what is
all about:

  When a vmalloc'd area is mmap'd into userspace, some kind of
  co-ordination is necessary for this to work on platforms with cpu
  D-caches which can have aliases.

  Otherwise kernel side writes won't be seen properly in userspace and
  vice versa.

  If the kernel side mapping and the user side one have the same
  alignment, modulo SHMLBA, this can work as long as VM_SHARED is shared
  of VMA and for all current users this is true. VM_SHARED will force
  SHMLBA alignment of the user side mmap on platforms with D-cache
  aliasing matters.

  David S. Miller

> What are the user-visible runtime effects of this change?

In simple words: proper alignment avoids possible difference in data,
seen by different virtual mapings: userspace and kernel in our case.
I.e. userspace reads cache line A, kernel writes to cache line B.  Both
cache lines correspond to the same physical memory (thus aliases).

So this should fix data corruption for archs with vivt and vipt caches,
e.g. armv6.  Personally I've never worked with this archs, I just
spotted the strange difference in code: for one case we do alignment,
for another - not.  I have a strong feeling that David simply missed
vmalloc_32_user() case.

>
> Is a -stable backport needed?

No, I do not think so.  The only one user of vmalloc_32_user() is
virtual frame buffer device drivers/video/fbdev/vfb.c, which has in the
description "The main use of this frame buffer device is testing and
debugging the frame buffer subsystem.  Do NOT enable it for normal
systems!".

And it seems to me that this vfb.c does not need 32bit addressable pages
(vmalloc_32_user() case), because it is virtual device and should not
care about things like dma32 zones, etc.  Probably is better to clean
the code and switch vfb.c from vmalloc_32_user() to vmalloc_user() case
and wipe out vmalloc_32_user() from vmalloc.c completely.  But I'm not
very much sure that this is worth to do, that's so minor, so we can
leave it as is.

Link: http://lkml.kernel.org/r/20190108110944.23591-1-rpenyaev@suse.de
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:15 -08:00
Shakeel Butt
60cd4bcd62 memcg: localize memcg_kmem_enabled() check
Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
functions, so, the users don't have to explicitly check that condition.

This is purely code cleanup patch without any functional change.  Only
the order of checks in memcg_charge_slab() can potentially be changed
but the functionally it will be same.  This should not matter as
memcg_charge_slab() is not in the hot path.

Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:15 -08:00
Wei Yang
9234bae9b2 mm, slub: make the comment of put_cpu_partial() complete
There are two cases when put_cpu_partial() is invoked.

    * __slab_free
    * get_partial_node

This patch just makes it cover these two cases.

Link: http://lkml.kernel.org/r/20181025094437.18951-3-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:15 -08:00
Kirill Tkhai
52d1e606ee mm: reuse only-pte-mapped KSM page in do_wp_page()
Add an optimization for KSM pages almost in the same way that we have
for ordinary anonymous pages.  If there is a write fault in a page,
which is mapped to an only pte, and it is not related to swap cache; the
page may be reused without copying its content.

[ Note that we do not consider PageSwapCache() pages at least for now,
  since we don't want to complicate __get_ksm_page(), which has nice
  optimization based on this (for the migration case). Currenly it is
  spinning on PageSwapCache() pages, waiting for when they have
  unfreezed counters (i.e., for the migration finish). But we don't want
  to make it also spinning on swap cache pages, which we try to reuse,
  since there is not a very high probability to reuse them. So, for now
  we do not consider PageSwapCache() pages at all. ]

So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
page_stable_node(), to skip a page, which KSM is currently trying to
link to stable tree.  Then we do page_ref_freeze() to prohibit KSM to
merge one more page into the page, we are reusing.  After that, nobody
can refer to the reusing page: KSM skips !PageSwapCache() pages with
zero refcount; and the protection against of all other participants is
the same as for reused ordinary anon pages pte lock, page lock and
mmap_sem.

[akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:15 -08:00
Anshuman Khandual
98fa15f34c mm: replace all open encodings for NUMA_NO_NODE
Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

All these places for replacement were found by running the following
grep patterns on the entire kernel code.  Please let me know if this
might have missed some instances.  This might also have replaced some
false positives.  I will appreciate suggestions, inputs and review.

1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"

This patch (of 2):

At present there are multiple places where invalid node number is
encoded as -1.  Even though implicitly understood it is always better to
have macros in there.  Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE.  This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.

Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:14 -08:00
Liviu Dudau
6ade20327d mm/vmalloc.c: don't dereference possible NULL pointer in __vunmap()
find_vmap_area() can return a NULL pointer and we're going to
dereference it without checking it first.  Use the existing
find_vm_area() function which does exactly what we want and checks for
the NULL pointer.

Link: http://lkml.kernel.org/r/20181228171009.22269-1-liviu@dudau.co.uk
Fixes: f3c01d2f3a ("mm: vmalloc: avoid racy handling of debugobjects in vunmap")
Signed-off-by: Liviu Dudau <liviu@dudau.co.uk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Chintan Pandya <cpandya@codeaurora.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:14 -08:00
Arun KS
a9cd410a3d mm/page_alloc.c: memory hotplug: free pages as higher order
When freeing pages are done with higher order, time spent on coalescing
pages by buddy allocator can be reduced.  With section size of 256MB,
hot add latency of a single section shows improvement from 50-60 ms to
less than 1 ms, hence improving the hot add latency by 60 times.  Modify
external providers of online callback to align with the change.

[arunks@codeaurora.org: v11]
  Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
[akpm@linux-foundation.org: remove unused local, per Arun]
[akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
[akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
[arunks@codeaurora.org: v8]
  Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
[arunks@codeaurora.org: v9]
  Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:14 -08:00
Qian Cai
278d7756df mm/slub.c: remove an unused addr argument
"addr" function argument is not used in alloc_consistency_checks() at
all, so remove it.

Link: http://lkml.kernel.org/r/20190211123214.35592-1-cai@lca.pw
Fixes: becfda68ab ("slub: convert SLAB_DEBUG_FREE to SLAB_CONSISTENCY_CHECKS")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:14 -08:00
Qian Cai
92d1d07daa mm/slab.c: kmemleak no scan alien caches
Kmemleak throws endless warnings during boot due to in
__alloc_alien_cache(),

    alc = kmalloc_node(memsize, gfp, node);
    init_arraycache(&alc->ac, entries, batch);
    kmemleak_no_scan(ac);

Kmemleak does not track the array cache (alc->ac) but the alien cache
(alc) instead, so let it track the latter by lifting kmemleak_no_scan()
out of init_arraycache().

There is another place that calls init_arraycache(), but
alloc_kmem_cache_cpus() uses the percpu allocation where will never be
considered as a leak.

  kmemleak: Found object by alias at 0xffff8007b9aa7e38
  CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
  Call trace:
   dump_backtrace+0x0/0x168
   show_stack+0x24/0x30
   dump_stack+0x88/0xb0
   lookup_object+0x84/0xac
   find_and_get_object+0x84/0xe4
   kmemleak_no_scan+0x74/0xf4
   setup_kmem_cache_node+0x2b4/0x35c
   __do_tune_cpucache+0x250/0x2d4
   do_tune_cpucache+0x4c/0xe4
   enable_cpucache+0xc8/0x110
   setup_cpu_cache+0x40/0x1b8
   __kmem_cache_create+0x240/0x358
   create_cache+0xc0/0x198
   kmem_cache_create_usercopy+0x158/0x20c
   kmem_cache_create+0x50/0x64
   fsnotify_init+0x58/0x6c
   do_one_initcall+0x194/0x388
   kernel_init_freeable+0x668/0x688
   kernel_init+0x18/0x124
   ret_from_fork+0x10/0x18
  kmemleak: Object 0xffff8007b9aa7e00 (size 256):
  kmemleak:   comm "swapper/0", pid 1, jiffies 4294697137
  kmemleak:   min_count = 1
  kmemleak:   count = 0
  kmemleak:   flags = 0x1
  kmemleak:   checksum = 0
  kmemleak:   backtrace:
       kmemleak_alloc+0x84/0xb8
       kmem_cache_alloc_node_trace+0x31c/0x3a0
       __kmalloc_node+0x58/0x78
       setup_kmem_cache_node+0x26c/0x35c
       __do_tune_cpucache+0x250/0x2d4
       do_tune_cpucache+0x4c/0xe4
       enable_cpucache+0xc8/0x110
       setup_cpu_cache+0x40/0x1b8
       __kmem_cache_create+0x240/0x358
       create_cache+0xc0/0x198
       kmem_cache_create_usercopy+0x158/0x20c
       kmem_cache_create+0x50/0x64
       fsnotify_init+0x58/0x6c
       do_one_initcall+0x194/0x388
       kernel_init_freeable+0x668/0x688
       kernel_init+0x18/0x124
  kmemleak: Not scanning unknown object at 0xffff8007b9aa7e38
  CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
  Call trace:
   dump_backtrace+0x0/0x168
   show_stack+0x24/0x30
   dump_stack+0x88/0xb0
   kmemleak_no_scan+0x90/0xf4
   setup_kmem_cache_node+0x2b4/0x35c
   __do_tune_cpucache+0x250/0x2d4
   do_tune_cpucache+0x4c/0xe4
   enable_cpucache+0xc8/0x110
   setup_cpu_cache+0x40/0x1b8
   __kmem_cache_create+0x240/0x358
   create_cache+0xc0/0x198
   kmem_cache_create_usercopy+0x158/0x20c
   kmem_cache_create+0x50/0x64
   fsnotify_init+0x58/0x6c
   do_one_initcall+0x194/0x388
   kernel_init_freeable+0x668/0x688
   kernel_init+0x18/0x124
   ret_from_fork+0x10/0x18

Link: http://lkml.kernel.org/r/20190129184518.39808-1-cai@lca.pw
Fixes: 1fe00d50a9 ("slab: factor out initialization of array cache")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:14 -08:00
Peng Wang
edde82b6df mm/slub.c: freelist is ensured to be NULL when new_slab() fails
new_slab_objects() will return immediately if freelist is not NULL.

         if (freelist)
                 return freelist;

One more assignment operation could be avoided.

Link: http://lkml.kernel.org/r/20181229062512.30469-1-rocking@whu.edu.cn
Signed-off-by: Peng Wang <rocking@whu.edu.cn>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:14 -08:00
Andrey Konovalov
5c0198b6fb kasan: fix coccinelle warnings in kasan_p*_table
kasan_p4d_table(), kasan_pmd_table() and kasan_pud_table() are declared
as returning bool, but return 0 instead of false, which produces a
coccinelle warning.  Fix it.

Link: http://lkml.kernel.org/r/1fa6fadf644859e8a6a8ecce258444b49be8c7ee.1551716733.git.andreyknvl@google.com
Fixes: 0207df4fa1 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:13 -08:00
Arnd Bergmann
bcf6f55a0d kasan: fix kasan_check_read/write definitions
Building little-endian allmodconfig kernels on arm64 started failing
with the generated atomic.h implementation, since we now try to call
kasan helpers from the EFI stub:

  aarch64-linux-gnu-ld: drivers/firmware/efi/libstub/arm-stub.stub.o: in function `atomic_set':
  include/generated/atomic-instrumented.h:44: undefined reference to `__efistub_kasan_check_write'

I suspect that we get similar problems in other files that explicitly
disable KASAN for some reason but call atomic_t based helper functions.

We can fix this by checking the predefined __SANITIZE_ADDRESS__ macro
that the compiler sets instead of checking CONFIG_KASAN, but this in
turn requires a small hack in mm/kasan/common.c so we do see the extern
declaration there instead of the inline function.

Link: http://lkml.kernel.org/r/20181211133453.2835077-1-arnd@arndb.de
Fixes: b1864b828644 ("locking/atomics: build atomic headers as required")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: Anders Roxell <anders.roxell@linaro.org>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:13 -08:00
Qian Cai
4117992df6 page_poison: play nicely with KASAN
KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
It triggers false positives in the allocation path:

  BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
  Read of size 8 at addr ffff88881f800000 by task swapper/0
  CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
  Call Trace:
   dump_stack+0xe0/0x19a
   print_address_description.cold.2+0x9/0x28b
   kasan_report.cold.3+0x7a/0xb5
   __asan_report_load8_noabort+0x19/0x20
   memchr_inv+0x2ea/0x330
   kernel_poison_pages+0x103/0x3d5
   get_page_from_freelist+0x15e7/0x4d90

because KASAN has not yet unpoisoned the shadow page for allocation
before it checks memchr_inv() but only found a stale poison pattern.

Also, false positives in free path,

  BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
  Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
  CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
  Call Trace:
   dump_stack+0xe0/0x19a
   print_address_description.cold.2+0x9/0x28b
   kasan_report.cold.3+0x7a/0xb5
   check_memory_region+0x22d/0x250
   memset+0x28/0x40
   kernel_poison_pages+0x29e/0x3d5
   __free_pages_ok+0x75f/0x13e0

due to KASAN adds poisoned redzones around slab objects, but the page
poisoning needs to poison the whole page.

Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:13 -08:00
Andrey Ryabinin
7771bdbbfd kasan: remove use after scope bugs detection.
Use after scope bugs detector seems to be almost entirely useless for
the linux kernel.  It exists over two years, but I've seen only one
valid bug so far [1].  And the bug was fixed before it has been
reported.  There were some other use-after-scope reports, but they were
false-positives due to different reasons like incompatibility with
structleak plugin.

This feature significantly increases stack usage, especially with GCC <
9 version, and causes a 32K stack overflow.  It probably adds
performance penalty too.

Given all that, let's remove use-after-scope detector entirely.

While preparing this patch I've noticed that we mistakenly enable
use-after-scope detection for clang compiler regardless of
CONFIG_KASAN_EXTRA setting.  This is also fixed now.

[1] http://lkml.kernel.org/r/<20171129052106.rhgbjhhis53hkgfn@wfg-t540p.sh.intel.com>

Link: http://lkml.kernel.org/r/20190111185842.13978-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Will Deacon <will.deacon@arm.com>		[arm64]
Cc: Qian Cai <cai@lca.pw>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:13 -08:00
zhongjiang
46612b751c mm: hwpoison: fix thp split handing in soft_offline_in_use_page()
When soft_offline_in_use_page() runs on a thp tail page after pmd is
split, we trigger the following VM_BUG_ON_PAGE():

  Memory failure: 0x3755ff: non anonymous thp
  __get_any_page: 0x3755ff: unknown zero refcount page type 2fffff80000000
  Soft offlining pfn 0x34d805 at process virtual address 0x20fff000
  page:ffffea000d360140 count:0 mapcount:0 mapping:0000000000000000 index:0x1
  flags: 0x2fffff80000000()
  raw: 002fffff80000000 ffffea000d360108 ffffea000d360188 0000000000000000
  raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
  ------------[ cut here ]------------
  kernel BUG at ./include/linux/mm.h:519!

soft_offline_in_use_page() passed refcount and page lock from tail page
to head page, which is not needed because we can pass any subpage to
split_huge_page().

Naoya had fixed a similar issue in c3901e722b ("mm: hwpoison: fix thp
split handling in memory_failure()").  But he missed fixing soft
offline.

Link: http://lkml.kernel.org/r/1551452476-24000-1-git-send-email-zhongjiang@huawei.com
Fixes: 61f5d698cc ("mm: re-enable THP")
Signed-off-by: zhongjiang <zhongjiang@huawei.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>	[4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:13 -08:00
Linus Torvalds
2d28e01dca Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "2 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  hugetlbfs: fix races and page leaks during migration
  kasan: turn off asan-stack for clang-8 and earlier
2019-03-01 09:04:59 -08:00
Mike Kravetz
cb6acd01e2 hugetlbfs: fix races and page leaks during migration
hugetlb pages should only be migrated if they are 'active'.  The
routines set/clear_page_huge_active() modify the active state of hugetlb
pages.

When a new hugetlb page is allocated at fault time, set_page_huge_active
is called before the page is locked.  Therefore, another thread could
race and migrate the page while it is being added to page table by the
fault code.  This race is somewhat hard to trigger, but can be seen by
strategically adding udelay to simulate worst case scheduling behavior.
Depending on 'how' the code races, various BUG()s could be triggered.

To address this issue, simply delay the set_page_huge_active call until
after the page is successfully added to the page table.

Hugetlb pages can also be leaked at migration time if the pages are
associated with a file in an explicitly mounted hugetlbfs filesystem.
For example, consider a two node system with 4GB worth of huge pages
available.  A program mmaps a 2G file in a hugetlbfs filesystem.  It
then migrates the pages associated with the file from one node to
another.  When the program exits, huge page counts are as follows:

  node0
  1024    free_hugepages
  1024    nr_hugepages

  node1
  0       free_hugepages
  1024    nr_hugepages

  Filesystem                         Size  Used Avail Use% Mounted on
  nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool

That is as expected.  2G of huge pages are taken from the free_hugepages
counts, and 2G is the size of the file in the explicitly mounted
filesystem.  If the file is then removed, the counts become:

  node0
  1024    free_hugepages
  1024    nr_hugepages

  node1
  1024    free_hugepages
  1024    nr_hugepages

  Filesystem                         Size  Used Avail Use% Mounted on
  nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool

Note that the filesystem still shows 2G of pages used, while there
actually are no huge pages in use.  The only way to 'fix' the filesystem
accounting is to unmount the filesystem

If a hugetlb page is associated with an explicitly mounted filesystem,
this information in contained in the page_private field.  At migration
time, this information is not preserved.  To fix, simply transfer
page_private from old to new page at migration time if necessary.

There is a related race with removing a huge page from a file and
migration.  When a huge page is removed from the pagecache, the
page_mapping() field is cleared, yet page_private remains set until the
page is actually freed by free_huge_page().  A page could be migrated
while in this state.  However, since page_mapping() is not set the
hugetlbfs specific routine to transfer page_private is not called and we
leak the page count in the filesystem.

To fix that, check for this condition before migrating a huge page.  If
the condition is detected, return EBUSY for the page.

Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
Fixes: bcc5422230 ("mm: hugetlb: introduce page_huge_active")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org>
[mike.kravetz@oracle.com: v2]
  Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
[mike.kravetz@oracle.com: update comment and changelog]
  Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-01 09:02:33 -08:00
Dave Hansen
2794129e90 mm/memory-hotplug: Allow memory resources to be children
The mm/resource.c code is used to manage the physical address
space.  The current resource configuration can be viewed in
/proc/iomem.  An example of this is at the bottom of this
description.

The nvdimm subsystem "owns" the physical address resources which
map to persistent memory and has resources inserted for them as
"Persistent Memory".  The best way to repurpose this for volatile
use is to leave the existing resource in place, but add a "System
RAM" resource underneath it. This clearly communicates the
ownership relationship of this memory.

The request_resource_conflict() API only deals with the
top-level resources.  Replace it with __request_region() which
will search for !IORESOURCE_BUSY areas lower in the resource
tree than the top level.

We *could* also simply truncate the existing top-level
"Persistent Memory" resource and take over the released address
space.  But, this means that if we ever decide to hot-unplug the
"RAM" and give it back, we need to recreate the original setup,
which may mean going back to the BIOS tables.

This should have no real effect on the existing collision
detection because the areas that truly conflict should be marked
IORESOURCE_BUSY.

00000000-00000fff : Reserved
00001000-0009fbff : System RAM
0009fc00-0009ffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c97ff : Video ROM
000c9800-000ca5ff : Adapter ROM
000f0000-000fffff : Reserved
  000f0000-000fffff : System ROM
00100000-9fffffff : System RAM
  01000000-01e071d0 : Kernel code
  01e071d1-027dfdff : Kernel data
  02dc6000-0305dfff : Kernel bss
a0000000-afffffff : Persistent Memory (legacy)
  a0000000-a7ffffff : System RAM
b0000000-bffdffff : System RAM
bffe0000-bfffffff : Reserved
c0000000-febfffff : PCI Bus 0000:00

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-02-28 10:41:23 -08:00
Dave Hansen
b926b7f3ba mm/resource: Move HMM pr_debug() deeper into resource code
HMM consumes physical address space for its own use, even
though nothing is mapped or accessible there.  It uses a
special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
to uniquely identify these areas.

When HMM consumes address space, it makes a best guess about
what to consume.  However, it is possible that a future memory
or device hotplug can collide with the reserved area.  In the
case of these conflicts, there is an error message in
register_memory_resource().

Later patches in this series move register_memory_resource()
from using request_resource_conflict() to __request_region().
Unfortunately, __request_region() does not return the conflict
like the previous function did, which makes it impossible to
check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
resource.

Instead of warning in register_memory_resource(), move the
check into the core resource code itself (__request_region())
where the conflicting resource _is_ available.  This has the
added bonus of producing a warning in case of HMM conflicts
with devices *or* RAM address space, as opposed to the RAM-
only warnings that were there previously.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-02-28 10:41:23 -08:00
Jann Horn
0a1d52994d mm: enforce min addr even if capable() in expand_downwards()
security_mmap_addr() does a capability check with current_cred(), but
we can reach this code from contexts like a VFS write handler where
current_cred() must not be used.

This can be abused on systems without SMAP to make NULL pointer
dereferences exploitable again.

Fixes: 8869477a49 ("security: protect from stack expansion into low vm addresses")
Cc: stable@kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-27 17:27:02 -08:00
Peng Fan
1b046b445c percpu: km: no need to consider pcpu_group_offsets[0]
percpu-km is used on UP systems which only has one group,
so the group offset will be always 0, there is no need
to subtract pcpu_group_offsets[0] when assigning chunk->base_addr

Signed-off-by: Peng Fan <peng.fan@nxp.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2019-02-26 13:47:58 -08:00
Darrick J. Wong
29b00e6099 tmpfs: fix uninitialized return value in shmem_link
When we made the shmem_reserve_inode call in shmem_link conditional, we
forgot to update the declaration for ret so that it always has a known
value.  Dan Carpenter pointed out this deficiency in the original patch.

Fixes: 1062af920c ("tmpfs: fix link accounting when a tmpfile is linked in")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Matej Kupljen <matej.kupljen@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-25 11:49:22 -08:00
Linus Torvalds
53a41cb7ed Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses"
This reverts commit 9da3f2b740.

It was well-intentioned, but wrong.  Overriding the exception tables for
instructions for random reasons is just wrong, and that is what the new
code did.

It caused problems for tracing, and it caused problems for strncpy_from_user(),
because the new checks made perfectly valid use cases break, rather than
catch things that did bad things.

Unchecked user space accesses are a problem, but that's not a reason to
add invalid checks that then people have to work around with silly flags
(in this case, that 'kernel_uaccess_faults_ok' flag, which is just an
odd way to say "this commit was wrong" and was sprinked into random
places to hide the wrongness).

The real fix to unchecked user space accesses is to get rid of the
special "let's not check __get_user() and __put_user() at all" logic.
Make __{get|put}_user() be just aliases to the regular {get|put}_user()
functions, and make it impossible to access user space without having
the proper checks in places.

The raison d'être of the special double-underscore versions used to be
that the range check was expensive, and if you did multiple user
accesses, you'd do the range check up front (like the signal frame
handling code, for example).  But SMAP (on x86) and PAN (on ARM) have
made that optimization pointless, because the _real_ expense is the "set
CPU flag to allow user space access".

Do let's not break the valid cases to catch invalid cases that shouldn't
even exist.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-25 09:10:51 -08:00
Peng Fan
2de7852fe9 percpu: use nr_groups as check condition
group_cnt array is defined with NR_CPUS entries, but normally
nr_groups will not reach up to NR_CPUS. So there is no issue
to the current code.

Checking other parts of pcpu_build_alloc_info, use nr_groups as
check condition, so make it consistent to use 'group < nr_groups'
as for loop check. In case we do have nr_groups equals with NR_CPUS,
we could also avoid memory access out of bounds.

Signed-off-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2019-02-23 14:36:53 -08:00
Michal Hocko
891cb2a72d mm, memory_hotplug: fix off-by-one in is_pageblock_removable
Rong Chen has reported the following boot crash:

    PGD 0 P4D 0
    Oops: 0000 [#1] PREEMPT SMP PTI
    CPU: 1 PID: 239 Comm: udevd Not tainted 5.0.0-rc4-00149-gefad4e4 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    RIP: 0010:page_mapping+0x12/0x80
    Code: 5d c3 48 89 df e8 0e ad 02 00 85 c0 75 da 89 e8 5b 5d c3 0f 1f 44 00 00 53 48 89 fb 48 8b 43 08 48 8d 50 ff a8 01 48 0f 45 da <48> 8b 53 08 48 8d 42 ff 83 e2 01 48 0f 44 c3 48 83 38 ff 74 2f 48
    RSP: 0018:ffff88801fa87cd8 EFLAGS: 00010202
    RAX: ffffffffffffffff RBX: fffffffffffffffe RCX: 000000000000000a
    RDX: fffffffffffffffe RSI: ffffffff820b9a20 RDI: ffff88801e5c0000
    RBP: 6db6db6db6db6db7 R08: ffff88801e8bb000 R09: 0000000001b64d13
    R10: ffff88801fa87cf8 R11: 0000000000000001 R12: ffff88801e640000
    R13: ffffffff820b9a20 R14: ffff88801f145258 R15: 0000000000000001
    FS:  00007fb2079817c0(0000) GS:ffff88801dd00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000006 CR3: 000000001fa82000 CR4: 00000000000006a0
    Call Trace:
     __dump_page+0x14/0x2c0
     is_mem_section_removable+0x24c/0x2c0
     removable_show+0x87/0xa0
     dev_attr_show+0x25/0x60
     sysfs_kf_seq_show+0xba/0x110
     seq_read+0x196/0x3f0
     __vfs_read+0x34/0x180
     vfs_read+0xa0/0x150
     ksys_read+0x44/0xb0
     do_syscall_64+0x5e/0x4a0
     entry_SYSCALL_64_after_hwframe+0x49/0xbe

and bisected it down to commit efad4e475c ("mm, memory_hotplug:
is_mem_section_removable do not pass the end of a zone").

The reason for the crash is that the mapping is garbage for poisoned
(uninitialized) page.  This shouldn't happen as all pages in the zone's
boundary should be initialized.

Later debugging revealed that the actual problem is an off-by-one when
evaluating the end_page.  'start_pfn + nr_pages' resp 'zone_end_pfn'
refers to a pfn after the range and as such it might belong to a
differen memory section.

This along with CONFIG_SPARSEMEM then makes the loop condition
completely bogus because a pointer arithmetic doesn't work for pages
from two different sections in that memory model.

Fix the issue by reworking is_pageblock_removable to be pfn based and
only use struct page where necessary.  This makes the code slightly
easier to follow and we will remove the problematic pointer arithmetic
completely.

Link: http://lkml.kernel.org/r/20190218181544.14616-1-mhocko@kernel.org
Fixes: efad4e475c ("mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: <rong.a.chen@intel.com>
Tested-by: <rong.a.chen@intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:01 -08:00
Daniel Vetter
6c8fcc096b mm: don't let userspace spam allocations warnings
memdump_user usually gets fed unchecked userspace input.  Blasting a
full backtrace into dmesg every time is a bit excessive - I'm not sure
on the kernel rule in general, but at least in drm we're trying not to
let unpriviledge userspace spam the logs freely.  Definitely not entire
warning backtraces.

It also means more filtering for our CI, because our testsuite exercises
these corner cases and so hits these a lot.

Link: http://lkml.kernel.org/r/20190220204058.11676-1-daniel.vetter@ffwll.ch
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Bartosz Golaszewski <brgl@bgdev.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:01 -08:00
Qian Cai
6373dca16c slub: fix a crash with SLUB_DEBUG + KASAN_SW_TAGS
In process_slab(), "p = get_freepointer()" could return a tagged
pointer, but "addr = page_address()" always return a native pointer.  As
the result, slab_index() is messed up here,

    return (p - addr) / s->size;

All other callers of slab_index() have the same situation where "addr"
is from page_address(), so just need to untag "p".

    # cat /sys/kernel/slab/hugetlbfs_inode_cache/alloc_calls

    Unable to handle kernel paging request at virtual address 2bff808aa4856d48
    Mem abort info:
      ESR = 0x96000007
      Exception class = DABT (current EL), IL = 32 bits
      SET = 0, FnV = 0
      EA = 0, S1PTW = 0
    Data abort info:
      ISV = 0, ISS = 0x00000007
      CM = 0, WnR = 0
    swapper pgtable: 64k pages, 48-bit VAs, pgdp = 0000000002498338
    [2bff808aa4856d48] pgd=00000097fcfd0003, pud=00000097fcfd0003, pmd=00000097fca30003, pte=00e8008b24850712
    Internal error: Oops: 96000007 [#1] SMP
    CPU: 3 PID: 79210 Comm: read_all Tainted: G             L    5.0.0-rc7+ #84
    Hardware name: HPE Apollo 70             /C01_APACHE_MB         , BIOS L50_5.13_1.0.6 07/10/2018
    pstate: 00400089 (nzcv daIf +PAN -UAO)
    pc : get_map+0x78/0xec
    lr : get_map+0xa0/0xec
    sp : aeff808989e3f8e0
    x29: aeff808989e3f940 x28: ffff800826200000
    x27: ffff100012d47000 x26: 9700000000002500
    x25: 0000000000000001 x24: 52ff8008200131f8
    x23: 52ff8008200130a0 x22: 52ff800820013098
    x21: ffff800826200000 x20: ffff100013172ba0
    x19: 2bff808a8971bc00 x18: ffff1000148f5538
    x17: 000000000000001b x16: 00000000000000ff
    x15: ffff1000148f5000 x14: 00000000000000d2
    x13: 0000000000000001 x12: 0000000000000000
    x11: 0000000020000002 x10: 2bff808aa4856d48
    x9 : 0000020000000000 x8 : 68ff80082620ebb0
    x7 : 0000000000000000 x6 : ffff1000105da1dc
    x5 : 0000000000000000 x4 : 0000000000000000
    x3 : 0000000000000010 x2 : 2bff808a8971bc00
    x1 : ffff7fe002098800 x0 : ffff80082620ceb0
    Process read_all (pid: 79210, stack limit = 0x00000000f65b9361)
    Call trace:
     get_map+0x78/0xec
     process_slab+0x7c/0x47c
     list_locations+0xb0/0x3c8
     alloc_calls_show+0x34/0x40
     slab_attr_show+0x34/0x48
     sysfs_kf_seq_show+0x2e4/0x570
     kernfs_seq_show+0x12c/0x1a0
     seq_read+0x48c/0xf84
     kernfs_fop_read+0xd4/0x448
     __vfs_read+0x94/0x5d4
     vfs_read+0xcc/0x194
     ksys_read+0x6c/0xe8
     __arm64_sys_read+0x68/0xb0
     el0_svc_handler+0x230/0x3bc
     el0_svc+0x8/0xc
    Code: d3467d2a 9ac92329 8b0a0e6a f9800151 (c85f7d4b)
    ---[ end trace a383a9a44ff13176 ]---
    Kernel panic - not syncing: Fatal exception
    SMP: stopping secondary CPUs
    SMP: failed to stop secondary CPUs 1-7,32,40,127
    Kernel Offset: disabled
    CPU features: 0x002,20000c18
    Memory Limit: none
    ---[ end Kernel panic - not syncing: Fatal exception ]---

Link: http://lkml.kernel.org/r/20190220020251.82039-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Andrey Konovalov
557ea25383 kasan, slab: remove redundant kasan_slab_alloc hooks
kasan_slab_alloc() calls in kmem_cache_alloc() and kmem_cache_alloc_node()
are redundant as they are already called via slab_alloc/slab_alloc_node()->
slab_post_alloc_hook()->kasan_slab_alloc().  Remove them.

Link: http://lkml.kernel.org/r/4ca1655cdcfc4379c49c50f7bf80f81c4ad01485.1550602886.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Qian Cai <cai@lca.pw>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgeniy Stepanov <eugenis@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Andrey Konovalov
51dedad06b kasan, slab: make freelist stored without tags
Similarly to "kasan, slub: move kasan_poison_slab hook before
page_address", move kasan_poison_slab() before alloc_slabmgmt(), which
calls page_address(), to make page_address() return value to be
non-tagged.  This, combined with calling kasan_reset_tag() for off-slab
slab management object, leads to freelist being stored non-tagged.

Link: http://lkml.kernel.org/r/dfb53b44a4d00de3879a05a9f04c1f55e584f7a1.1550602886.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Qian Cai <cai@lca.pw>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgeniy Stepanov <eugenis@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Andrey Konovalov
219667c23c kasan, slab: fix conflicts with CONFIG_HARDENED_USERCOPY
Similarly to commit 96fedce27e ("kasan: make tag based mode work with
CONFIG_HARDENED_USERCOPY"), we need to reset pointer tags in
__check_heap_object() in mm/slab.c before doing any pointer math.

Link: http://lkml.kernel.org/r/9a5c0f958db10e69df5ff9f2b997866b56b7effc.1550602886.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Qian Cai <cai@lca.pw>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgeniy Stepanov <eugenis@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Andrey Konovalov
dc15a8a254 kasan: prevent tracing of tags.c
Similarly to commit 0d0c8de878 ("kasan: mark file common so ftrace
doesn't trace it") add the -pg flag to mm/kasan/tags.c to prevent
conflicts with tracing.

Link: http://lkml.kernel.org/r/9c4c3ce5ccfb894c7fe66d91de7c1da2787b4da4.1550602886.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Qian Cai <cai@lca.pw>
Tested-by: Qian Cai <cai@lca.pw>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Evgeniy Stepanov <eugenis@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Andrey Konovalov
3f41b60938 kasan: fix random seed generation for tag-based mode
There are two issues with assigning random percpu seeds right now:

1. We use for_each_possible_cpu() to iterate over cpus, but cpumask is
   not set up yet at the moment of kasan_init(), and thus we only set
   the seed for cpu #0.

2. A call to get_random_u32() always returns the same number and produces
   a message in dmesg, since the random subsystem is not yet initialized.

Fix 1 by calling kasan_init_tags() after cpumask is set up.

Fix 2 by using get_cycles() instead of get_random_u32(). This gives us
lower quality random numbers, but it's good enough, as KASAN is meant to
be used as a debugging tool and not a mitigation.

Link: http://lkml.kernel.org/r/1f815cc914b61f3516ed4cc9bfd9eeca9bd5d9de.1550677973.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Darrick J. Wong
1062af920c tmpfs: fix link accounting when a tmpfile is linked in
tmpfs has a peculiarity of accounting hard links as if they were
separate inodes: so that when the number of inodes is limited, as it is
by default, a user cannot soak up an unlimited amount of unreclaimable
dcache memory just by repeatedly linking a file.

But when v3.11 added O_TMPFILE, and the ability to use linkat() on the
fd, we missed accommodating this new case in tmpfs: "df -i" shows that
an extra "inode" remains accounted after the file is unlinked and the fd
closed and the actual inode evicted.  If a user repeatedly links
tmpfiles into a tmpfs, the limit will be hit (ENOSPC) even after they
are deleted.

Just skip the extra reservation from shmem_link() in this case: there's
a sense in which this first link of a tmpfile is then cheaper than a
hard link of another file, but the accounting works out, and there's
still good limiting, so no need to do anything more complicated.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1902182134370.7035@eggly.anvils
Fixes: f4e0c30c19 ("allow the temp files created by open() to be linked to")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Matej Kupljen <matej.kupljen@gmail.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Michal Hocko
6ea183d60c mm: handle lru_add_drain_all for UP properly
Since for_each_cpu(cpu, mask) added by commit 2d3854a37e
("cpumask: introduce new API, without changing anything") did not
evaluate the mask argument if NR_CPUS == 1 due to CONFIG_SMP=n,
lru_add_drain_all() is hitting WARN_ON() at __flush_work() added by
commit 4d43d395fe ("workqueue: Try to catch flush_work() without
INIT_WORK().") by unconditionally calling flush_work() [1].

Workaround this issue by using CONFIG_SMP=n specific lru_add_drain_all
implementation.  There is no real need to defer the implementation to
the workqueue as the draining is going to happen on the local cpu.  So
alias lru_add_drain_all to lru_add_drain which does all the necessary
work.

[akpm@linux-foundation.org: fix various build warnings]
[1] https://lkml.kernel.org/r/18a30387-6aa5-6123-e67c-57579ecc3f38@roeck-us.net
Link: http://lkml.kernel.org/r/20190213124334.GH4525@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Debugged-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Mel Gorman
94b3334cbe mm, page_alloc: fix a division by zero error when boosting watermarks v2
Yury Norov reported that an arm64 KVM instance could not boot since
after v5.0-rc1 and could addressed by reverting the patches

  1c30844d2d ("mm: reclaim small amounts of memory when an external
  73444bc4d8 ("mm, page_alloc: do not wake kswapd with zone lock held")

The problem is that a division by zero error is possible if boosting
occurs very early in boot if the system has very little memory.  This
patch avoids the division by zero error.

Link: http://lkml.kernel.org/r/20190213143012.GT9565@techsingularity.net
Fixes: 1c30844d2d ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Will Deacon <will.deacon@arm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Robin Murphy
311ade0eab mm/debug.c: fix __dump_page() for poisoned pages
Evaluating page_mapping() on a poisoned page ends up dereferencing junk
and making PF_POISONED_CHECK() considerably crashier than intended:

    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000006
    Mem abort info:
      ESR = 0x96000005
      Exception class = DABT (current EL), IL = 32 bits
      SET = 0, FnV = 0
      EA = 0, S1PTW = 0
    Data abort info:
      ISV = 0, ISS = 0x00000005
      CM = 0, WnR = 0
    user pgtable: 4k pages, 39-bit VAs, pgdp = 00000000c2f6ac38
    [0000000000000006] pgd=0000000000000000, pud=0000000000000000
    Internal error: Oops: 96000005 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 2 PID: 491 Comm: bash Not tainted 5.0.0-rc1+ #1
    Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Dec 17 2018
    pstate: 00000005 (nzcv daif -PAN -UAO)
    pc : page_mapping+0x18/0x118
    lr : __dump_page+0x1c/0x398
    Process bash (pid: 491, stack limit = 0x000000004ebd4ecd)
    Call trace:
     page_mapping+0x18/0x118
     __dump_page+0x1c/0x398
     dump_page+0xc/0x18
     remove_store+0xbc/0x120
     dev_attr_store+0x18/0x28
     sysfs_kf_write+0x40/0x50
     kernfs_fop_write+0x130/0x1d8
     __vfs_write+0x30/0x180
     vfs_write+0xb4/0x1a0
     ksys_write+0x60/0xd0
     __arm64_sys_write+0x18/0x20
     el0_svc_common+0x94/0xf8
     el0_svc_handler+0x68/0x70
     el0_svc+0x8/0xc
    Code: f9400401 d1000422 f240003f 9a801040 (f9400402)
    ---[ end trace cdb5eb5bf435cecb ]---

Fix that by not inspecting the mapping until we've determined that it's
likely to be valid.  Now the above condition still ends up stopping the
kernel, but in the correct manner:

    page:ffffffbf20000000 is uninitialized and poisoned
    raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
    raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
    ------------[ cut here ]------------
    kernel BUG at ./include/linux/mm.h:1006!
    Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 1 PID: 483 Comm: bash Not tainted 5.0.0-rc1+ #3
    Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Dec 17 2018
    pstate: 40000005 (nZcv daif -PAN -UAO)
    pc : remove_store+0xbc/0x120
    lr : remove_store+0xbc/0x120
    ...

Link: http://lkml.kernel.org/r/03b53ee9d7e76cda4b9b5e1e31eea080db033396.1550071778.git.robin.murphy@arm.com
Fixes: 1c6fb1d89e ("mm: print more information about mapping in __dump_page")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Qian Cai
338cfaad49 slub: fix SLAB_CONSISTENCY_CHECKS + KASAN_SW_TAGS
Enabling SLUB_DEBUG's SLAB_CONSISTENCY_CHECKS with KASAN_SW_TAGS
triggers endless false positives during boot below due to
check_valid_pointer() checks tagged pointers which have no addresses
that is valid within slab pages:

  BUG radix_tree_node (Tainted: G    B            ): Freelist Pointer check fails
  -----------------------------------------------------------------------------

  INFO: Slab objects=69 used=69 fp=0x          (null) flags=0x7ffffffc000200
  INFO: Object @offset=15060037153926966016 fp=0x

  Redzone: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 18 6b 06 00 08 80 ff d0  .........k......
  Object : 18 6b 06 00 08 80 ff d0 00 00 00 00 00 00 00 00  .k..............
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  Redzone: bb bb bb bb bb bb bb bb                          ........
  Padding: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
  CPU: 0 PID: 0 Comm: swapper/0 Tainted: G    B             5.0.0-rc5+ #18
  Call trace:
    dump_backtrace+0x0/0x450
    show_stack+0x20/0x2c
    __dump_stack+0x20/0x28
    dump_stack+0xa0/0xfc
    print_trailer+0x1bc/0x1d0
    object_err+0x40/0x50
    alloc_debug_processing+0xf0/0x19c
    ___slab_alloc+0x554/0x704
    kmem_cache_alloc+0x2f8/0x440
    radix_tree_node_alloc+0x90/0x2fc
    idr_get_free+0x1e8/0x6d0
    idr_alloc_u32+0x11c/0x2a4
    idr_alloc+0x74/0xe0
    worker_pool_assign_id+0x5c/0xbc
    workqueue_init_early+0x49c/0xd50
    start_kernel+0x52c/0xac4
  FIX radix_tree_node: Marking all objects used

Link: http://lkml.kernel.org/r/20190209044128.3290-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Andrey Konovalov
d36a63a943 kasan, slub: fix more conflicts with CONFIG_SLAB_FREELIST_HARDENED
When CONFIG_KASAN_SW_TAGS is enabled, ptr_addr might be tagged.  Normally,
this doesn't cause any issues, as both set_freepointer() and
get_freepointer() are called with a pointer with the same tag.  However,
there are some issues with CONFIG_SLUB_DEBUG code.  For example, when
__free_slub() iterates over objects in a cache, it passes untagged
pointers to check_object().  check_object() in turns calls
get_freepointer() with an untagged pointer, which causes the freepointer
to be restored incorrectly.

Add kasan_reset_tag to freelist_ptr(). Also add a detailed comment.

Link: http://lkml.kernel.org/r/bf858f26ef32eb7bd24c665755b3aee4bc58d0e4.1550103861.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Qian Cai <cai@lca.pw>
Tested-by: Qian Cai <cai@lca.pw>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Andrey Konovalov
18e5066102 kasan, slub: fix conflicts with CONFIG_SLAB_FREELIST_HARDENED
CONFIG_SLAB_FREELIST_HARDENED hashes freelist pointer with the address of
the object where the pointer gets stored.  With tag based KASAN we don't
account for that when building freelist, as we call set_freepointer() with
the first argument untagged.  This patch changes the code to properly
propagate tags throughout the loop.

Link: http://lkml.kernel.org/r/3df171559c52201376f246bf7ce3184fe21c1dc7.1549921721.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Qian Cai <cai@lca.pw>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Evgeniy Stepanov <eugenis@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Andrey Konovalov
a710122428 kasan, slub: move kasan_poison_slab hook before page_address
With tag based KASAN page_address() looks at the page flags to see whether
the resulting pointer needs to have a tag set.  Since we don't want to set
a tag when page_address() is called on SLAB pages, we call
page_kasan_tag_reset() in kasan_poison_slab().  However in allocate_slab()
page_address() is called before kasan_poison_slab().  Fix it by changing
the order.

[andreyknvl@google.com: fix compilation error when CONFIG_SLUB_DEBUG=n]
  Link: http://lkml.kernel.org/r/ac27cc0bbaeb414ed77bcd6671a877cf3546d56e.1550066133.git.andreyknvl@google.com
Link: http://lkml.kernel.org/r/cd895d627465a3f1c712647072d17f10883be2a1.1549921721.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgeniy Stepanov <eugenis@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Andrey Konovalov
a2f775751d kmemleak: account for tagged pointers when calculating pointer range
kmemleak keeps two global variables, min_addr and max_addr, which store
the range of valid (encountered by kmemleak) pointer values, which it
later uses to speed up pointer lookup when scanning blocks.

With tagged pointers this range will get bigger than it needs to be.  This
patch makes kmemleak untag pointers before saving them to min_addr and
max_addr and when performing a lookup.

Link: http://lkml.kernel.org/r/16e887d442986ab87fe87a755815ad92fa431a5f.1550066133.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Qian Cai <cai@lca.pw>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgeniy Stepanov <eugenis@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Andrey Konovalov
53128245b4 kasan, kmemleak: pass tagged pointers to kmemleak
Right now we call kmemleak hooks before assigning tags to pointers in
KASAN hooks.  As a result, when an objects gets allocated, kmemleak sees a
differently tagged pointer, compared to the one it sees when the object
gets freed.  Fix it by calling KASAN hooks before kmemleak's ones.

Link: http://lkml.kernel.org/r/cd825aa4897b0fc37d3316838993881daccbe9f5.1549921721.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Qian Cai <cai@lca.pw>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgeniy Stepanov <eugenis@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Andrey Konovalov
e1db95befb kasan: fix assigning tags twice
When an object is kmalloc()'ed, two hooks are called: kasan_slab_alloc()
and kasan_kmalloc().  Right now we assign a tag twice, once in each of the
hooks.  Fix it by assigning a tag only in the former hook.

Link: http://lkml.kernel.org/r/ce8c6431da735aa7ec051fd6497153df690eb021.1549921721.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgeniy Stepanov <eugenis@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Ralph Campbell
050c17f239 numa: change get_mempolicy() to use nr_node_ids instead of MAX_NUMNODES
The system call, get_mempolicy() [1], passes an unsigned long *nodemask
pointer and an unsigned long maxnode argument which specifies the length
of the user's nodemask array in bits (which is rounded up).  The manual
page says that if the maxnode value is too small, get_mempolicy will
return EINVAL but there is no system call to return this minimum value.
To determine this value, some programs search /proc/<pid>/status for a
line starting with "Mems_allowed:" and use the number of digits in the
mask to determine the minimum value.  A recent change to the way this line
is formatted [2] causes these programs to compute a value less than
MAX_NUMNODES so get_mempolicy() returns EINVAL.

Change get_mempolicy(), the older compat version of get_mempolicy(), and
the copy_nodes_to_user() function to use nr_node_ids instead of
MAX_NUMNODES, thus preserving the defacto method of computing the minimum
size for the nodemask array and the maxnode argument.

[1] http://man7.org/linux/man-pages/man2/get_mempolicy.2.html
[2] https://lore.kernel.org/lkml/1545405631-6808-1-git-send-email-longman@redhat.com

Link: http://lkml.kernel.org/r/20190211180245.22295-1-rcampbell@nvidia.com
Fixes: 4fb8e5b89bcbbbb ("include/linux/nodemask.h: use nr_node_ids (not MAX_NUMNODES) in __nodemask_pr_numnodes()")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Waiman Long <longman@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:00:59 -08:00
Linus Torvalds
40e196a906 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix suspend and resume in mt76x0u USB driver, from Stanislaw
    Gruszka.

 2) Missing memory barriers in xsk, from Magnus Karlsson.

 3) rhashtable fixes in mac80211 from Herbert Xu.

 4) 32-bit MIPS eBPF JIT fixes from Paul Burton.

 5) Fix for_each_netdev_feature() on big endian, from Hauke Mehrtens.

 6) GSO validation fixes from Willem de Bruijn.

 7) Endianness fix for dwmac4 timestamp handling, from Alexandre Torgue.

 8) More strict checks in tcp_v4_err(), from Eric Dumazet.

 9) af_alg_release should NULL out the sk after the sock_put(), from Mao
    Wenan.

10) Missing unlock in mac80211 mesh error path, from Wei Yongjun.

11) Missing device put in hns driver, from Salil Mehta.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
  sky2: Increase D3 delay again
  vhost: correctly check the return value of translate_desc() in log_used()
  net: netcp: Fix ethss driver probe issue
  net: hns: Fixes the missing put_device in positive leg for roce reset
  net: stmmac: Fix a race in EEE enable callback
  qed: Fix iWARP syn packet mac address validation.
  qed: Fix iWARP buffer size provided for syn packet processing.
  r8152: Add support for MAC address pass through on RTL8153-BD
  mac80211: mesh: fix missing unlock on error in table_path_del()
  net/mlx4_en: fix spelling mistake: "quiting" -> "quitting"
  net: crypto set sk to NULL when af_alg_release.
  net: Do not allocate page fragments that are not skb aligned
  mm: Use fixed constant in page_frag_alloc instead of size + 1
  tcp: tcp_v4_err() should be more careful
  tcp: clear icsk_backoff in tcp_write_queue_purge()
  net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe()
  qmi_wwan: apply SET_DTR quirk to Sierra WP7607
  net: stmmac: handle endianness in dwmac4_get_timestamp
  doc: Mention MSG_ZEROCOPY implementation for UDP
  mlxsw: __mlxsw_sp_port_headroom_set(): Fix a use of local variable
  ...
2019-02-19 16:13:19 -08:00
Juergen Gross
357b4da50a x86: respect memory size limiting via mem= parameter
When limiting memory size via kernel parameter "mem=" this should be
respected even in case of memory made accessible via a PCI card.

Today this kind of memory won't be made usable in initial memory
setup as the memory won't be visible in E820 map, but it might be
added when adding PCI devices due to corresponding ACPI table entries.

Not respecting "mem=" can be corrected by adding a global max_mem_size
variable set by parse_memopt() which will result in rejecting adding
memory areas resulting in a memory size above the allowed limit.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2019-02-18 06:50:34 +01:00
Alexander Duyck
8644772637 mm: Use fixed constant in page_frag_alloc instead of size + 1
This patch replaces the size + 1 value introduced with the recent fix for 1
byte allocs with a constant value.

The idea here is to reduce code overhead as the previous logic would have
to read size into a register, then increment it, and write it back to
whatever field was being used. By using a constant we can avoid those
memory reads and arithmetic operations in favor of just encoding the
maximum value into the operation itself.

Fixes: 2c2ade8174 ("mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs")
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17 15:48:43 -08:00
Ard Biesheuvel
8a5b403d71 arm64, mm, efi: Account for GICv3 LPI tables in static memblock reserve table
In the irqchip and EFI code, we have what basically amounts to a quirk
to work around a peculiarity in the GICv3 architecture, which permits
the system memory address of LPI tables to be programmable only once
after a CPU reset. This means kexec kernels must use the same memory
as the first kernel, and thus ensure that this memory has not been
given out for other purposes by the time the ITS init code runs, which
is not very early for secondary CPUs.

On systems with many CPUs, these reservations could overflow the
memblock reservation table, and this was addressed in commit:

  eff8962888 ("efi/arm: Defer persistent reservations until after paging_init()")

However, this turns out to have made things worse, since the allocation
of page tables and heap space for the resized memblock reservation table
itself may overwrite the regions we are attempting to reserve, which may
cause all kinds of corruption, also considering that the ITS will still
be poking bits into that memory in response to incoming MSIs.

So instead, let's grow the static memblock reservation table on such
systems so it can accommodate these reservations at an earlier time.
This will permit us to revert the above commit in a subsequent patch.

[ mingo: Minor cleanups. ]

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20190215123333.21209-2-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-16 15:02:03 +01:00
Linus Torvalds
6e7bd3b549 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix MAC address setting in mac80211 pmsr code, from Johannes Berg.

 2) Probe SFP modules after being attached, from Russell King.

 3) Byte ordering bug in SMC rx_curs_confirmed code, from Ursula Braun.

 4) Revert some r8169 changes that are causing regressions, from Heiner
    Kallweit.

 5) Fix spurious connection timeouts in netfilter nat code, from Florian
    Westphal.

 6) SKB leak in tipc, from Hoang Le.

 7) Short packet checkum issue in mlx4, similar to a previous mlx5
    change, from Saeed Mahameed. The issue is that whilst padding bytes
    are usually zero, it is not guarateed and the hardware doesn't take
    the padding bytes into consideration when generating the checksum.

 8) Fix various races in cls_tcindex, from Cong Wang.

 9) Need to set stream ext to NULL before freeing in SCTP code, from Xin
    Long.

10) Fix locking in phy_is_started, from Heiner Kallweit.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (54 commits)
  net: ethernet: freescale: set FEC ethtool regs version
  net: hns: Fix object reference leaks in hns_dsaf_roce_reset()
  mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
  net: phy: fix potential race in the phylib state machine
  net: phy: don't use locking in phy_is_started
  selftests: fix timestamping Makefile
  net: dsa: bcm_sf2: potential array overflow in bcm_sf2_sw_suspend()
  net: fix possible overflow in __sk_mem_raise_allocated()
  dsa: mv88e6xxx: Ensure all pending interrupts are handled prior to exit
  net: phy: fix interrupt handling in non-started states
  sctp: set stream ext to NULL after freeing it in sctp_stream_outq_migrate
  sctp: call gso_reset_checksum when computing checksum in sctp_gso_segment
  net/mlx5e: XDP, fix redirect resources availability check
  net/mlx5: Fix a compilation warning in events.c
  net/mlx5: No command allowed when command interface is not ready
  net/mlx5e: Fix NULL pointer derefernce in set channels error flow
  netfilter: nft_compat: use-after-free when deleting targets
  team: avoid complex list operations in team_nl_cmd_options_set()
  net_sched: fix two more memory leaks in cls_tcindex
  net_sched: fix a memory leak in cls_tcindex
  ...
2019-02-15 08:00:11 -08:00
Jann Horn
2c2ade8174 mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
number of references that we might need to create in the fastpath later,
the bump-allocation fastpath only has to modify the non-atomic bias value
that tracks the number of extra references we hold instead of the atomic
refcount. The maximum number of allocations we can serve (under the
assumption that no allocation is made with size 0) is nc->size, so that's
the bias used.

However, even when all memory in the allocation has been given away, a
reference to the page is still held; and in the `offset < 0` slowpath, the
page may be reused if everyone else has dropped their references.
This means that the necessary number of references is actually
`nc->size+1`.

Luckily, from a quick grep, it looks like the only path that can call
page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
requires CAP_NET_ADMIN in the init namespace and is only intended to be
used for kernel testing and fuzzing.

To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
`offset < 0` path, below the virt_to_page() call, and then repeatedly call
writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
with a vector consisting of 15 elements containing 1 byte each.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-14 12:12:17 -05:00
Qian Cai
2f1ee0913c Revert "mm: use early_pfn_to_nid in page_ext_init"
This reverts commit fe53ca5427 ("mm: use early_pfn_to_nid in
page_ext_init").

When booting a system with "page_owner=on",

start_kernel
  page_ext_init
    invoke_init_callbacks
      init_section_page_ext
        init_page_owner
          init_early_allocated_pages
            init_zones_in_node
              init_pages_in_zone
                lookup_page_ext
                  page_to_nid

The issue here is that page_to_nid() will not work since some page flags
have no node information until later in page_alloc_init_late() due to
DEFERRED_STRUCT_PAGE_INIT.  Hence, it could trigger an out-of-bounds
access with an invalid nid.

  UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
  index 7 is out of range for type 'zone [5]'

Also, kernel will panic since flags were poisoned earlier with,

CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_NODE_NOT_IN_PAGE_FLAGS=n

start_kernel
  setup_arch
    pagetable_init
      paging_init
        sparse_init
          sparse_init_nid
            memblock_alloc_try_nid_raw

It did not handle it well in init_pages_in_zone() which ends up calling
page_to_nid().

  page:ffffea0004200000 is uninitialized and poisoned
  raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
  raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
  page_owner info is not active (free page?)
  kernel BUG at include/linux/mm.h:990!
  RIP: 0010:init_page_owner+0x486/0x520

This means that assumptions behind commit fe53ca5427 ("mm: use
early_pfn_to_nid in page_ext_init") are incomplete.  Therefore, revert
the commit for now.  A proper way to move the page_owner initialization
to sooner is to hook into memmap initialization.

Link: http://lkml.kernel.org/r/20190115202812.75820-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-12 16:33:18 -08:00
Yu Zhao
414fd080d1 mm/gup: fix gup_pmd_range() for dax
For dax pmd, pmd_trans_huge() returns false but pmd_huge() returns true
on x86.  So the function works as long as hugetlb is configured.
However, dax doesn't depend on hugetlb.

Link: http://lkml.kernel.org/r/20190111034033.601-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-12 16:33:18 -08:00
Dave Chinner
a9a238e83f Revert "mm: slowly shrink slabs with a relatively small number of objects"
This reverts commit 172b06c32b ("mm: slowly shrink slabs with a
relatively small number of objects").

This change changes the agressiveness of shrinker reclaim, causing small
cache and low priority reclaim to greatly increase scanning pressure on
small caches.  As a result, light memory pressure has a disproportionate
affect on small caches, and causes large caches to be reclaimed much
faster than previously.

As a result, it greatly perturbs the delicate balance of the VFS caches
(dentry/inode vs file page cache) such that the inode/dentry caches are
reclaimed much, much faster than the page cache and this drives us into
several other caching imbalance related problems.

As such, this is a bad change and needs to be reverted.

[ Needs some massaging to retain the later seekless shrinker
  modifications.]

Link: http://lkml.kernel.org/r/20190130041707.27750-3-david@fromorbit.com
Fixes: 172b06c32b ("mm: slowly shrink slabs with a relatively small number of objects")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Wolfgang Walter <linux@stwm.de>
Cc: Roman Gushchin <guro@fb.com>
Cc: Spock <dairinin@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-12 16:33:18 -08:00
Ira Weiny
ad8cfb9c42 mm/gup: Remove the 'write' parameter from gup_fast_permitted()
The 'write' parameter is unused in gup_fast_permitted() so remove it.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20190210223424.13934-1-ira.weiny@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-11 08:20:40 +01:00
Davidlohr Bueso
70f8a3ca68 mm: make mm->pinned_vm an atomic64 counter
Taking a sleeping lock to _only_ increment a variable is quite the
overkill, and pretty much all users do this. Furthermore, some drivers
(ie: infiniband and scif) that need pinned semantics can go to quite
some trouble to actually delay via workqueue (un)accounting for pinned
pages when not possible to acquire it.

By making the counter atomic we no longer need to hold the mmap_sem and
can simply some code around it for pinned_vm users. The counter is 64-bit
such that we need not worry about overflows such as rdma user input
controlled from userspace.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07 12:54:02 -07:00
David Hildenbrand
e0a352fabc mm: migrate: don't rely on __PageMovable() of newpage after unlocking it
We had a race in the old balloon compaction code before b1123ea6d3
("mm: balloon: use general non-lru movable page feature") refactored it
that became visible after backporting 195a8c43e9 ("virtio-balloon:
deflate via a page list") without the refactoring.

The bug existed from commit d6d86c0a7f ("mm/balloon_compaction:
redesign ballooned pages management") till b1123ea6d3 ("mm: balloon:
use general non-lru movable page feature").  d6d86c0a7f
("mm/balloon_compaction: redesign ballooned pages management") was
backported to 3.12, so the broken kernels are stable kernels [3.12 -
4.7].

There was a subtle race between dropping the page lock of the newpage in
__unmap_and_move() and checking for __is_movable_balloon_page(newpage).

Just after dropping this page lock, virtio-balloon could go ahead and
deflate the newpage, effectively dequeueing it and clearing PageBalloon,
in turn making __is_movable_balloon_page(newpage) fail.

This resulted in dropping the reference of the newpage via
putback_lru_page(newpage) instead of put_page(newpage), leading to
page->lru getting modified and a !LRU page ending up in the LRU lists.
With 195a8c43e9 ("virtio-balloon: deflate via a page list")
backported, one would suddenly get corrupted lists in
release_pages_balloon():

- WARNING: CPU: 13 PID: 6586 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
- list_del corruption. prev->next should be ffffe253961090a0, but was dead000000000100

Nowadays this race is no longer possible, but it is hidden behind very
ugly handling of __ClearPageMovable() and __PageMovable().

__ClearPageMovable() will not make __PageMovable() fail, only
PageMovable().  So the new check (__PageMovable(newpage)) will still
hold even after newpage was dequeued by virtio-balloon.

If anybody would ever change that special handling, the BUG would be
introduced again.  So instead, make it explicit and use the information
of the original isolated page before migration.

This patch can be backported fairly easy to stable kernels (in contrast
to the refactoring).

Link: http://lkml.kernel.org/r/20190129233217.10747-1-david@redhat.com
Fixes: d6d86c0a7f ("mm/balloon_compaction: redesign ballooned pages management")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Vratislav Bendel <vbendel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vratislav Bendel <vbendel@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>	[3.12 - 4.7]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:24 -08:00
Michal Hocko
e3df4c6e48 mm, memory_hotplug: __offline_pages fix wrong locking
Jan has noticed that we do double unlock on some failure paths when
offlining a page range.  This is indeed the case when
test_pages_in_a_zone respp.  start_isolate_page_range fail.  This was an
omission when forward porting the debugging patch from an older kernel.

Fix the issue by dropping mem_hotplug_done from the failure condition
and keeping the single unlock in the catch all failure path.

Link: http://lkml.kernel.org/r/20190115120307.22768-1-mhocko@kernel.org
Fixes: 7960509329 ("mm, memory_hotplug: print reason for the offlining failure")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
Naoya Horiguchi
6376360ecb mm: hwpoison: use do_send_sig_info() instead of force_sig()
Currently memory_failure() is racy against process's exiting, which
results in kernel crash by null pointer dereference.

The root cause is that memory_failure() uses force_sig() to forcibly
kill asynchronous (meaning not in the current context) processes.  As
discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM
fixes, this is not a right thing to do.  OOM solves this issue by using
do_send_sig_info() as done in commit d2d393099d ("signal:
oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this
patch is suggesting to do the same for hwpoison.  do_send_sig_info()
properly accesses to siglock with lock_task_sighand(), so is free from
the reported race.

I confirmed that the reported bug reproduces with inserting some delay
in kill_procs(), and it never reproduces with this patch.

Note that memory_failure() can send another type of signal using
force_sig_mceerr(), and the reported race shouldn't happen on it because
force_sig_mceerr() is called only for synchronous processes (i.e.
BUS_MCEERR_AR happens only when some process accesses to the corrupted
memory.)

Link: http://lkml.kernel.org/r/20190116093046.GA29835@hori1.linux.bs1.fc.nec.co.jp
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
Anders Roxell
0d0c8de878 kasan: mark file common so ftrace doesn't trace it
When option CONFIG_KASAN is enabled toghether with ftrace, function
ftrace_graph_caller() gets in to a recursion, via functions
kasan_check_read() and kasan_check_write().

 Breakpoint 2, ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179
 179             mcount_get_pc             x0    //     function's pc
 (gdb) bt
 #0  ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179
 #1  0xffffff90101406c8 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:151
 #2  0xffffff90106fd084 in kasan_check_write (p=0xffffffc06c170878, size=4) at ../mm/kasan/common.c:105
 #3  0xffffff90104a2464 in atomic_add_return (v=<optimized out>, i=<optimized out>) at ./include/generated/atomic-instrumented.h:71
 #4  atomic_inc_return (v=<optimized out>) at ./include/generated/atomic-fallback.h:284
 #5  trace_graph_entry (trace=0xffffffc03f5ff380) at ../kernel/trace/trace_functions_graph.c:441
 #6  0xffffff9010481774 in trace_graph_entry_watchdog (trace=<optimized out>) at ../kernel/trace/trace_selftest.c:741
 #7  0xffffff90104a185c in function_graph_enter (ret=<optimized out>, func=<optimized out>, frame_pointer=18446743799894897728, retp=<optimized out>) at ../kernel/trace/trace_functions_graph.c:196
 #8  0xffffff9010140628 in prepare_ftrace_return (self_addr=18446743592948977792, parent=0xffffffc03f5ff418, frame_pointer=18446743799894897728) at ../arch/arm64/kernel/ftrace.c:231
 #9  0xffffff90101406f4 in ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:182
 Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 (gdb)

Rework so that the kasan implementation isn't traced.

Link: http://lkml.kernel.org/r/20181212183447.15890-1-anders.roxell@linaro.org
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
Shakeel Butt
cefc7ef3c8 mm, oom: fix use-after-free in oom_kill_process
Syzbot instance running on upstream kernel found a use-after-free bug in
oom_kill_process.  On further inspection it seems like the process
selected to be oom-killed has exited even before reaching
read_lock(&tasklist_lock) in oom_kill_process().  More specifically the
tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task()
and the put_task_struct within for_each_thread() frees the tsk and
for_each_thread() tries to access the tsk.  The easiest fix is to do
get/put across the for_each_thread() on the selected task.

Now the next question is should we continue with the oom-kill as the
previously selected task has exited? However before adding more
complexity and heuristics, let's answer why we even look at the children
of oom-kill selected task? The select_bad_process() has already selected
the worst process in the system/memcg.  Due to race, the selected
process might not be the worst at the kill time but does that matter?
The userspace can use the oom_score_adj interface to prefer children to
be killed before the parent.  I looked at the history but it seems like
this is there before git history.

Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com
Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com
Fixes: 6b0c81b3be ("mm, oom: reduce dependency on tasklist_lock")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
Oscar Salvador
eeb0efd071 mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages
This is the same sort of error we saw in commit 17e2e7d7e1 ("mm,
page_alloc: fix has_unmovable_pages for HugePages").

Gigantic hugepages cross several memblocks, so it can be that the page
we get in scan_movable_pages() is a page-tail belonging to a
1G-hugepage.  If that happens, page_hstate()->size_to_hstate() will
return NULL, and we will blow up in hugepage_migration_supported().

The splat is as follows:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  #PF error: [normal kernel read fault]
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP PTI
  CPU: 1 PID: 1350 Comm: bash Tainted: G            E     5.0.0-rc1-mm1-1-default+ #27
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
  RIP: 0010:__offline_pages+0x6ae/0x900
  Call Trace:
   memory_subsys_offline+0x42/0x60
   device_offline+0x80/0xa0
   state_store+0xab/0xc0
   kernfs_fop_write+0x102/0x180
   __vfs_write+0x26/0x190
   vfs_write+0xad/0x1b0
   ksys_write+0x42/0x90
   do_syscall_64+0x5b/0x180
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  Modules linked in: af_packet(E) xt_tcpudp(E) ipt_REJECT(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv4(E) ip_set(E) nfnetlink(E) ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ebtable_filter(E) ebtables(E) iptable_filter(E) ip_tables(E) x_tables(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) bochs_drm(E) ttm(E) aesni_intel(E) drm_kms_helper(E) aes_x86_64(E) crypto_simd(E) cryptd(E) glue_helper(E) drm(E) virtio_net(E) syscopyarea(E) sysfillrect(E) net_failover(E) sysimgblt(E) pcspkr(E) failover(E) i2c_piix4(E) fb_sys_fops(E) parport_pc(E) parport(E) button(E) btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) sd_mod(E) ata_generic(E) ata_piix(E) ahci(E) libahci(E) libata(E) crc32c_intel(E) serio_raw(E) virtio_pci(E) virtio_ring(E) virtio(E) sg(E) scsi_mod(E) autofs4(E)

[akpm@linux-foundation.org: fix brace layout, per David.  Reduce indentation]
Link: http://lkml.kernel.org/r/20190122154407.18417-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
Mikhail Zaslonko
24feb47c5f mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zone
If memory end is not aligned with the sparse memory section boundary,
the mapping of such a section is only partly initialized.  This may lead
to VM_BUG_ON due to uninitialized struct pages access from
test_pages_in_a_zone() function triggered by memory_hotplug sysfs
handlers.

Here are the the panic examples:
 CONFIG_DEBUG_VM_PGFLAGS=y
 kernel parameter mem=2050M
 --------------------------
 page:000003d082008000 is uninitialized and poisoned
 page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
 Call Trace:
   test_pages_in_a_zone+0xde/0x160
   show_valid_zones+0x5c/0x190
   dev_attr_show+0x34/0x70
   sysfs_kf_seq_show+0xc8/0x148
   seq_read+0x204/0x480
   __vfs_read+0x32/0x178
   vfs_read+0x82/0x138
   ksys_read+0x5a/0xb0
   system_call+0xdc/0x2d8
 Last Breaking-Event-Address:
   test_pages_in_a_zone+0xde/0x160
 Kernel panic - not syncing: Fatal exception: panic_on_oops

Fix this by checking whether the pfn to check is within the zone.

[mhocko@suse.com: separated this change from http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com]
Link: http://lkml.kernel.org/r/20190128144506.15603-3-mhocko@kernel.org

[mhocko@suse.com: separated this change from
http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
Michal Hocko
efad4e475c mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone
Patch series "mm, memory_hotplug: fix uninitialized pages fallouts", v2.

Mikhail Zaslonko has posted fixes for the two bugs quite some time ago
[1].  I have pushed back on those fixes because I believed that it is
much better to plug the problem at the initialization time rather than
play whack-a-mole all over the hotplug code and find all the places
which expect the full memory section to be initialized.

We have ended up with commit 2830bf6f05 ("mm, memory_hotplug:
initialize struct pages for the full memory section") merged and cause a
regression [2][3].  The reason is that there might be memory layouts
when two NUMA nodes share the same memory section so the merged fix is
simply incorrect.

In order to plug this hole we really have to be zone range aware in
those handlers.  I have split up the original patch into two.  One is
unchanged (patch 2) and I took a different approach for `removable'
crash.

[1] http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1666948
[3] http://lkml.kernel.org/r/20190125163938.GA20411@dhcp22.suse.cz

This patch (of 2):

Mikhail has reported the following VM_BUG_ON triggered when reading sysfs
removable state of a memory block:

 page:000003d08300c000 is uninitialized and poisoned
 page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
 Call Trace:
   is_mem_section_removable+0xb4/0x190
   show_mem_removable+0x9a/0xd8
   dev_attr_show+0x34/0x70
   sysfs_kf_seq_show+0xc8/0x148
   seq_read+0x204/0x480
   __vfs_read+0x32/0x178
   vfs_read+0x82/0x138
   ksys_read+0x5a/0xb0
   system_call+0xdc/0x2d8
 Last Breaking-Event-Address:
   is_mem_section_removable+0xb4/0x190
 Kernel panic - not syncing: Fatal exception: panic_on_oops

The reason is that the memory block spans the zone boundary and we are
stumbling over an unitialized struct page.  Fix this by enforcing zone
range in is_mem_section_removable so that we never run away from a zone.

Link: http://lkml.kernel.org/r/20190128144506.15603-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Debugged-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
Tetsuo Handa
9bcdeb51bd oom, oom_reaper: do not enqueue same task twice
Arkadiusz reported that enabling memcg's group oom killing causes
strange memcg statistics where there is no task in a memcg despite the
number of tasks in that memcg is not 0.  It turned out that there is a
bug in wake_oom_reaper() which allows enqueuing same task twice which
makes impossible to decrease the number of tasks in that memcg due to a
refcount leak.

This bug existed since the OOM reaper became invokable from
task_will_free_mem(current) path in out_of_memory() in Linux 4.7,

  T1@P1     |T2@P1     |T3@P1     |OOM reaper
  ----------+----------+----------+------------
                                   # Processing an OOM victim in a different memcg domain.
                        try_charge()
                          mem_cgroup_out_of_memory()
                            mutex_lock(&oom_lock)
             try_charge()
               mem_cgroup_out_of_memory()
                 mutex_lock(&oom_lock)
  try_charge()
    mem_cgroup_out_of_memory()
      mutex_lock(&oom_lock)
                            out_of_memory()
                              oom_kill_process(P1)
                                do_send_sig_info(SIGKILL, @P1)
                                mark_oom_victim(T1@P1)
                                wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
                            mutex_unlock(&oom_lock)
                 out_of_memory()
                   mark_oom_victim(T2@P1)
                   wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
                 mutex_unlock(&oom_lock)
      out_of_memory()
        mark_oom_victim(T1@P1)
        wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
      mutex_unlock(&oom_lock)
                                   # Completed processing an OOM victim in a different memcg domain.
                                   spin_lock(&oom_reaper_lock)
                                   # T1P1 is dequeued.
                                   spin_unlock(&oom_reaper_lock)

but memcg's group oom killing made it easier to trigger this bug by
calling wake_oom_reaper() on the same task from one out_of_memory()
request.

Fix this bug using an approach used by commit 855b018325 ("oom,
oom_reaper: disable oom_reaper for oom_kill_allocating_task").  As a
side effect of this patch, this patch also avoids enqueuing multiple
threads sharing memory via task_will_free_mem(current) path.

Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
Fixes: af8e15cc85 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <asarai@suse.de>
Cc: Jay Kamat <jgkamat@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
Jan Kara
80409c65e2 mm: migrate: make buffer_migrate_page_norefs() actually succeed
Currently, buffer_migrate_page_norefs() was constantly failing because
buffer_migrate_lock_buffers() grabbed reference on each buffer.  In
fact, there's no reason for buffer_migrate_lock_buffers() to grab any
buffer references as the page is locked during all our operation and
thus nobody can reclaim buffers from the page.

So remove grabbing of buffer references which also makes
buffer_migrate_page_norefs() succeed.

Link: http://lkml.kernel.org/r/20190116131217.7226-1-jack@suse.cz
Fixes: 89cb0888ca "mm: migrate: provide buffer_migrate_page_norefs()"
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
Andrea Arcangeli
1ac25013fb mm/hugetlb.c: teach follow_hugetlb_page() to handle FOLL_NOWAIT
hugetlb needs the same fix as faultin_nopage (which was applied in
commit 96312e6128 ("mm/gup.c: teach get_user_pages_unlocked to handle
FOLL_NOWAIT")) or KVM hangs because it thinks the mmap_sem was already
released by hugetlb_fault() if it returned VM_FAULT_RETRY, but it wasn't
in the FOLL_NOWAIT case.

Link: http://lkml.kernel.org/r/20190109020203.26669-2-aarcange@redhat.com
Fixes: ce53053ce3 ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Reported-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
Oscar Salvador
1723058eab mm, memory_hotplug: don't bail out in do_migrate_range() prematurely
do_migrate_range() takes a memory range and tries to isolate the pages
to put them into a list.  This list will be later on used in
migrate_pages() to know the pages we need to migrate.

Currently, if we fail to isolate a single page, we put all already
isolated pages back to their LRU and we bail out from the function.
This is quite suboptimal, as this will force us to start over again
because scan_movable_pages will give us the same range.  If there is no
chance that we can isolate that page, we will loop here forever.

Issue debugged in [1] has proved that.  During the debugging of that
issue, it was noticed that if do_migrate_ranges() fails to isolate a
single page, we will just discard the work we have done so far and bail
out, which means that scan_movable_pages() will find again the same set
of pages.

Instead, we can just skip the error, keep isolating as much pages as
possible and then proceed with the call to migrate_pages().

This will allow us to do as much work as possible at once.

[1] https://lkml.org/lkml/2018/12/6/324

Michal said:

: I still think that this doesn't give us a whole picture.  Looping for
: ever is a bug.  Failing the isolation is quite possible and it should
: be a ephemeral condition (e.g.  a race with freeing the page or
: somebody else isolating the page for whatever reason).  And here comes
: the disadvantage of the current implementation.  We simply throw
: everything on the floor just because of a ephemeral condition.  The
: racy page_count check is quite dubious to prevent from that.

Link: http://lkml.kernel.org/r/20181211135312.27034-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:22 -08:00
Michal Hocko
4aa9fc2a43 Revert "mm, memory_hotplug: initialize struct pages for the full memory section"
This reverts commit 2830bf6f05.

The underlying assumption that one sparse section belongs into a single
numa node doesn't hold really. Robert Shteynfeld has reported a boot
failure. The boot log was not captured but his memory layout is as
follows:

  Early memory node ranges
    node   1: [mem 0x0000000000001000-0x0000000000090fff]
    node   1: [mem 0x0000000000100000-0x00000000dbdf8fff]
    node   1: [mem 0x0000000100000000-0x0000001423ffffff]
    node   0: [mem 0x0000001424000000-0x0000002023ffffff]

This means that node0 starts in the middle of a memory section which is
also in node1.  memmap_init_zone tries to initialize padding of a
section even when it is outside of the given pfn range because there are
code paths (e.g.  memory hotplug) which assume that the full worth of
memory section is always initialized.

In this particular case, though, such a range is already intialized and
most likely already managed by the page allocator.  Scribbling over
those pages corrupts the internal state and likely blows up when any of
those pages gets used.

Reported-by: Robert Shteynfeld <robert.shteynfeld@gmail.com>
Fixes: 2830bf6f05 ("mm, memory_hotplug: initialize struct pages for the full memory section")
Cc: stable@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-28 10:35:22 -08:00
Linus Torvalds
6b8f915916 for-linus-20190125
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlxLdgsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoVGD/4sGYQqfiXogQIJYbPH2RRPrJuLIIITjiAv
 lPXX1wx/tvz/ktwKJE2OiWTES0JjdH1HlC+7/0L/fLb8DXiKBmFuUHlwhureoL9Y
 o//BIQKuaje35kyHITTy2UAJOOqnNtJTaAP2AfkL+eOcj/V/G5rJIfLGs9QtuAR7
 sRJ+uhg1EbW/+uO0bULDmG6WUjxFu8mqcw3i6g0VVLVOnXB2EKcZTl3KPrdAXrUp
 XtmouERga6OfAUSJyZmTSV136mL+opRB2WFFVeIzjQfLmyItDGbSX/YPS8oJ2pow
 v7630F+CMrd4aKpqqtnAhfWpGqd0Xw7cYfZ9MKTJmZPmGzf9a1fQFpmgZosD4Dh3
 7MrhboU4TUt9PdXESA7CmE7LkTp99ghfj5/ysKrSV5h3HsH2RbLxJk91Rx3vmAWD
 u1xWRYL+GYLH6ZwOLvM1iqBrrLN3mUyrx98SaMgoXuqNzmQmgz9LPeA0Gt09FJbo
 uj+ebg4dRwuThjni4xQhl3zL2RQy7nlTDFDdKOz/XoiYk2NUVksss+sxGjNarHj0
 b5pCD4HOp57OreGExaOARpBRah5HSNdQtBRsIOnbyEq6f/e1LsIY23Z9nNF0deGO
 sZzgsbnsn+zg8bC6T/Gk4UY6XdCcgaS3SL04SVKAE3lO6A4C/Awo8DgD9Bl1zpC1
 HQlNkl5fBg==
 =iucY
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190125' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A collection of fixes for this release. This contains:

   - Silence sparse rightfully complaining about non-static wbt
     functions (Bart)

   - Fixes for the zoned comments/ioctl documentation (Damien)

   - direct-io fix that's been lingering for a while (Ernesto)

   - cgroup writeback fix (Tejun)

   - Set of NVMe patches for nvme-rdma/tcp (Sagi, Hannes, Raju)

   - Block recursion tracking fix (Ming)

   - Fix debugfs command flag naming for a few flags (Jianchao)"

* tag 'for-linus-20190125' of git://git.kernel.dk/linux-block:
  block: Fix comment typo
  uapi: fix ioctl documentation
  blk-wbt: Declare local functions static
  blk-mq: fix the cmd_flag_name array
  nvme-multipath: drop optimization for static ANA group IDs
  nvmet-rdma: fix null dereference under heavy load
  nvme-rdma: rework queue maps handling
  nvme-tcp: fix timeout handler
  nvme-rdma: fix timeout handler
  writeback: synchronize sync(2) against cgroup writeback membership switches
  block: cover another queue enter recursion via BIO_QUEUE_ENTERED
  direct-io: allow direct writes to empty inodes
2019-01-26 12:42:41 -08:00
Linus Torvalds
30bac164ac Revert "Change mincore() to count "mapped" pages rather than "cached" pages"
This reverts commit 574823bfab.

It turns out that my hope that we could just remove the code that
exposes the cache residency status from mincore() was too optimistic.

There are various random users that want it, and one example would be
the Netflix database cluster maintenance. To quote Josh Snyder:

 "For Netflix, losing accurate information from the mincore syscall
  would lengthen database cluster maintenance operations from days to
  months. We rely on cross-process mincore to migrate the contents of a
  page cache from machine to machine, and across reboots.

  To do this, I wrote and maintain happycache [1], a page cache
  dumper/loader tool. It is quite similar in architecture to pgfincore,
  except that it is agnostic to workload. The gist of happycache's
  operation is "produce a dump of residence status for each page, do
  some operation, then reload exactly the same pages which were present
  before." happycache is entirely dependent on accurate reporting of the
  in-core status of file-backed pages, as accessed by another process.

  We primarily use happycache with Cassandra, which (like Postgres +
  pgfincore) relies heavily on OS page cache to reduce disk accesses.
  Because our workloads never experience a cold page cache, we are able
  to provision hardware for a peak utilization level that is far lower
  than the hypothetical "every query is a cache miss" peak.

  A database warmed by happycache can be ready for service in seconds
  (bounded only by the performance of the drives and the I/O subsystem),
  with no period of in-service degradation. By contrast, putting a
  database in service without a page cache entails a potentially
  unbounded period of degradation (at Netflix, the time to populate a
  single node's cache via natural cache misses varies by workload from
  hours to weeks). If a single node upgrade were to take weeks, then
  upgrading an entire cluster would take months. Since we want to apply
  security upgrades (and other things) on a somewhat tighter schedule,
  we would have to develop more complex solutions to provide the same
  functionality already provided by mincore.

  At the bottom line, happycache is designed to benignly exploit the
  same information leak documented in the paper [2]. I think it makes
  perfect sense to remove cross-process mincore functionality from
  unprivileged users, but not to remove it entirely"

We do have an alternate approach that limits the cache residency
reporting only to processes that have write permissions to the file, so
we can fix the original information leak issue that way.  It involves
_adding_ code rather than removing it, which is sad, but hey, at least
we haven't found any users that would find the restrictions
unacceptable.

So revert the optimistic first approach to make room for that alternate
fix instead.

Reported-by: Josh Snyder <joshs@netflix.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kevin Easton <kevin@guarana.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Cyril Hrubis <chrubis@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Daniel Gruss <daniel@gruss.cc>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-24 09:04:37 +13:00
Tejun Heo
7fc5854f8c writeback: synchronize sync(2) against cgroup writeback membership switches
sync_inodes_sb() can race against cgwb (cgroup writeback) membership
switches and fail to writeback some inodes.  For example, if an inode
switches to another wb while sync_inodes_sb() is in progress, the new
wb might not be visible to bdi_split_work_to_wbs() at all or the inode
might jump from a wb which hasn't issued writebacks yet to one which
already has.

This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb
switch path against sync_inodes_sb() so that sync_inodes_sb() is
guaranteed to see all the target wbs and inodes can't jump wbs to
escape syncing.

v2: Fixed misplaced rwsem init.  Spotted by Jiufei.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jiufei Xue <xuejiufei@gmail.com>
Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.com
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-22 14:39:38 -07:00
Sean Christopherson
ba42273131 mm/mmu_notifier: mm/rmap.c: Fix a mmu_notifier range bug in try_to_unmap_one
The conversion to use a structure for mmu_notifier_invalidate_range_*()
unintentionally changed the usage in try_to_unmap_one() to init the
'struct mmu_notifier_range' with vma->vm_start instead of @address,
i.e. it invalidates the wrong address range.  Revert to the correct
address range.

Manifests as KVM use-after-free WARNINGs and subsequent "BUG: Bad page
state in process X" errors when reclaiming from a KVM guest due to KVM
removing the wrong pages from its own mappings.

Reported-by: leozinho29_eu@hotmail.com
Reported-by: Mike Galbraith <efault@gmx.de>
Reported-and-tested-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Pankaj gupta <pagupta@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: ac46d4f3c4 ("mm/mmu_notifier: use structure for invalidate_range_start/end calls v2")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-10 02:58:21 -08:00
Mel Gorman
73444bc4d8 mm, page_alloc: do not wake kswapd with zone lock held
syzbot reported the following regression in the latest merge window and
it was confirmed by Qian Cai that a similar bug was visible from a
different context.

  ======================================================
  WARNING: possible circular locking dependency detected
  4.20.0+ #297 Not tainted
  ------------------------------------------------------
  syz-executor0/8529 is trying to acquire lock:
  000000005e7fb829 (&pgdat->kswapd_wait){....}, at:
  __wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120

  but task is already holding lock:
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: spin_lock
  include/linux/spinlock.h:329 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_bulk
  mm/page_alloc.c:2548 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: __rmqueue_pcplist
  mm/page_alloc.c:3021 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_pcplist
  mm/page_alloc.c:3050 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue
  mm/page_alloc.c:3072 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at:
  get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491

It appears to be a false positive in that the only way the lock ordering
should be inverted is if kswapd is waking itself and the wakeup
allocates debugging objects which should already be allocated if it's
kswapd doing the waking.  Nevertheless, the possibility exists and so
it's best to avoid the problem.

This patch flags a zone as needing a kswapd using the, surprisingly,
unused zone flag field.  The flag is read without the lock held to do
the wakeup.  It's possible that the flag setting context is not the same
as the flag clearing context or for small races to occur.  However, each
race possibility is harmless and there is no visible degredation in
fragmentation treatment.

While zone->flag could have continued to be unused, there is potential
for moving some existing fields into the flags field instead.
Particularly read-mostly ones like zone->initialized and
zone->contiguous.

Link: http://lkml.kernel.org/r/20190103225712.GJ31517@techsingularity.net
Fixes: 1c30844d2d ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Reported-by: syzbot+93d94a001cfbce9e60e1@syzkaller.appspotmail.com
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Qian Cai <cai@lca.pw>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08 17:15:11 -08:00
Mike Kravetz
ddeaab32a8 hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization"
This reverts b43a999005

The reverted commit caused issues with migration and poisoning of anon
huge pages.  The LTP move_pages12 test will cause an "unable to handle
kernel NULL pointer" BUG would occur with stack similar to:

  RIP: 0010:down_write+0x1b/0x40
  Call Trace:
    migrate_pages+0x81f/0xb90
    __ia32_compat_sys_migrate_pages+0x190/0x190
    do_move_pages_to_node.isra.53.part.54+0x2a/0x50
    kernel_move_pages+0x566/0x7b0
    __x64_sys_move_pages+0x24/0x30
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

The purpose of the reverted patch was to fix some long existing races
with huge pmd sharing.  It used i_mmap_rwsem for this purpose with the
idea that this could also be used to address truncate/page fault races
with another patch.  Further analysis has determined that i_mmap_rwsem
can not be used to address all these hugetlbfs synchronization issues.
Therefore, revert this patch while working an another approach to the
underlying issues.

Link: http://lkml.kernel.org/r/20190103235452.29335-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08 17:15:11 -08:00
Mike Kravetz
e7c5809779 hugetlbfs: revert "Use i_mmap_rwsem to fix page fault/truncate race"
This reverts c86aa7bbfd

The reverted commit caused ABBA deadlocks when file migration raced with
file eviction for specific hugetlbfs files.  This was discovered with a
modified version of the LTP move_pages12 test.

The purpose of the reverted patch was to close a long existing race
between hugetlbfs file truncation and page faults.  After more analysis
of the patch and impacted code, it was determined that i_mmap_rwsem can
not be used for all required synchronization.  Therefore, revert this
patch while working an another approach to the underlying issue.

Link: http://lkml.kernel.org/r/20190103235452.29335-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08 17:15:11 -08:00
Jan Stancek
8ab88c7169 mm: page_mapped: don't assume compound page is huge or THP
LTP proc01 testcase has been observed to rarely trigger crashes
on arm64:
    page_mapped+0x78/0xb4
    stable_page_flags+0x27c/0x338
    kpageflags_read+0xfc/0x164
    proc_reg_read+0x7c/0xb8
    __vfs_read+0x58/0x178
    vfs_read+0x90/0x14c
    SyS_read+0x60/0xc0

The issue is that page_mapped() assumes that if compound page is not
huge, then it must be THP.  But if this is 'normal' compound page
(COMPOUND_PAGE_DTOR), then following loop can keep running (for
HPAGE_PMD_NR iterations) until it tries to read from memory that isn't
mapped and triggers a panic:

        for (i = 0; i < hpage_nr_pages(page); i++) {
                if (atomic_read(&page[i]._mapcount) >= 0)
                        return true;
	}

I could replicate this on x86 (v4.20-rc4-98-g60b548237fed) only
with a custom kernel module [1] which:
 - allocates compound page (PAGEC) of order 1
 - allocates 2 normal pages (COPY), which are initialized to 0xff (to
   satisfy _mapcount >= 0)
 - 2 PAGEC page structs are copied to address of first COPY page
 - second page of COPY is marked as not present
 - call to page_mapped(COPY) now triggers fault on access to 2nd COPY
   page at offset 0x30 (_mapcount)

[1] https://github.com/jstancek/reproducers/blob/master/kernel/page_mapped_crash/repro.c

Fix the loop to iterate for "1 << compound_order" pages.

Kirrill said "IIRC, sound subsystem can producuce custom mapped compound
pages".

Link: http://lkml.kernel.org/r/c440d69879e34209feba21e12d236d06bc0a25db.1543577156.git.jstancek@redhat.com
Fixes: e1534ae950 ("mm: differentiate page_mapped() from page_mapcount() for compound pages")
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Debugged-by: Laszlo Ersek <lersek@redhat.com>
Suggested-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08 17:15:11 -08:00
Matthew Wilcox
1ed7293ac4 mm/memory.c: initialise mmu_notifier_range correctly
One of the paths in follow_pte_pmd() initialised the mmu_notifier_range
incorrectly.

Link: http://lkml.kernel.org/r/20190103002126.GM6310@bombadil.infradead.org
Fixes: ac46d4f3c4 ("mm/mmu_notifier: use structure for invalidate_range_start/end calls v2")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08 17:15:11 -08:00
Andrey Konovalov
a3fe7cdf02 kasan: fix krealloc handling for tag-based mode
Right now tag-based KASAN can retag the memory that is reallocated via
krealloc and return a differently tagged pointer even if the same slab
object gets used and no reallocated technically happens.

There are a few issues with this approach.  One is that krealloc callers
can't rely on comparing the return value with the passed argument to
check whether reallocation happened.  Another is that if a caller knows
that no reallocation happened, that it can access object memory through
the old pointer, which leads to false positives.  Look at
nf_ct_ext_add() to see an example.

Fix this by keeping the same tag if the memory don't actually gets
reallocated during krealloc.

Link: http://lkml.kernel.org/r/bb2a71d17ed072bcc528cbee46fcbd71a6da3be4.1546540962.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08 17:15:11 -08:00
Andrey Konovalov
96fedce27e kasan: make tag based mode work with CONFIG_HARDENED_USERCOPY
With CONFIG_HARDENED_USERCOPY enabled __check_heap_object() compares and
then subtracts a potentially tagged pointer with a non-tagged address of
the page that this pointer belongs to, which leads to unexpected
behavior.

Untag the pointer in __check_heap_object() before doing any of these
operations.

Link: http://lkml.kernel.org/r/7e756a298d514c4482f52aea6151db34818d395d.1546540962.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08 17:15:11 -08:00
Andrey Konovalov
eb214f2dda kasan, arm64: use ARCH_SLAB_MINALIGN instead of manual aligning
Instead of changing cache->align to be aligned to KASAN_SHADOW_SCALE_SIZE
in kasan_cache_create() we can reuse the ARCH_SLAB_MINALIGN macro.

Link: http://lkml.kernel.org/r/52ddd881916bcc153a9924c154daacde78522227.1546540962.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Suggested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08 17:15:11 -08:00
Michal Hocko
63f3655f95 mm, memcg: fix reclaim deadlock with writeback
Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
ext4 writeback

  task1:
    wait_on_page_bit+0x82/0xa0
    shrink_page_list+0x907/0x960
    shrink_inactive_list+0x2c7/0x680
    shrink_node_memcg+0x404/0x830
    shrink_node+0xd8/0x300
    do_try_to_free_pages+0x10d/0x330
    try_to_free_mem_cgroup_pages+0xd5/0x1b0
    try_charge+0x14d/0x720
    memcg_kmem_charge_memcg+0x3c/0xa0
    memcg_kmem_charge+0x7e/0xd0
    __alloc_pages_nodemask+0x178/0x260
    alloc_pages_current+0x95/0x140
    pte_alloc_one+0x17/0x40
    __pte_alloc+0x1e/0x110
    alloc_set_pte+0x5fe/0xc20
    do_fault+0x103/0x970
    handle_mm_fault+0x61e/0xd10
    __do_page_fault+0x252/0x4d0
    do_page_fault+0x30/0x80
    page_fault+0x28/0x30

  task2:
    __lock_page+0x86/0xa0
    mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
    ext4_writepages+0x479/0xd60
    do_writepages+0x1e/0x30
    __writeback_single_inode+0x45/0x320
    writeback_sb_inodes+0x272/0x600
    __writeback_inodes_wb+0x92/0xc0
    wb_writeback+0x268/0x300
    wb_workfn+0xb4/0x390
    process_one_work+0x189/0x420
    worker_thread+0x4e/0x4b0
    kthread+0xe6/0x100
    ret_from_fork+0x41/0x50

He adds
 "task1 is waiting for the PageWriteback bit of the page that task2 has
  collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
  LOCKED bit the page which tasks1 has locked"

More precisely task1 is handling a page fault and it has a page locked
while it charges a new page table to a memcg.  That in turn hits a
memory limit reclaim and the memcg reclaim for legacy controller is
waiting on the writeback but that is never going to finish because the
writeback itself is waiting for the page locked in the #PF path.  So
this is essentially ABBA deadlock:

                                        lock_page(A)
                                        SetPageWriteback(A)
                                        unlock_page(A)
  lock_page(B)
                                        lock_page(B)
  pte_alloc_pne
    shrink_page_list
      wait_on_page_writeback(A)
                                        SetPageWriteback(B)
                                        unlock_page(B)

                                        # flush A, B to clear the writeback

This accumulating of more pages to flush is used by several filesystems
to generate a more optimal IO patterns.

Waiting for the writeback in legacy memcg controller is a workaround for
pre-mature OOM killer invocations because there is no dirty IO
throttling available for the controller.  There is no easy way around
that unfortunately.  Therefore fix this specific issue by pre-allocating
the page table outside of the page lock.  We have that handy
infrastructure for that already so simply reuse the fault-around pattern
which already does this.

There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
from under a fs page locked but they should be really rare.  I am not
aware of a better solution unfortunately.

[akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
[akpm@linux-foundation.org: coding-style fixes]
[mhocko@kernel.org: enhance comment, per Johannes]
  Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
Fixes: c3b94f44fc ("memcg: further prevent OOM with too many dirty pages")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Liu Bo <bo.liu@linux.alibaba.com>
Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08 17:15:11 -08:00
Qian Cai
7bff3c0699 mm/usercopy.c: no check page span for stack objects
It is easy to trigger this with CONFIG_HARDENED_USERCOPY_PAGESPAN=y,

  usercopy: Kernel memory overwrite attempt detected to spans multiple pages (offset 0, size 23)!
  kernel BUG at mm/usercopy.c:102!

For example,

print_worker_info
char name[WQ_NAME_LEN] = { };
char desc[WORKER_DESC_LEN] = { };
  probe_kernel_read(name, wq->name, sizeof(name) - 1);
  probe_kernel_read(desc, worker->desc, sizeof(desc) - 1);
    __copy_from_user_inatomic
      check_object_size
        check_heap_object
          check_page_span

This is because on-stack variables could cross PAGE_SIZE boundary, and
failed this check,

if (likely(((unsigned long)ptr & (unsigned long)PAGE_MASK) ==
	   ((unsigned long)end & (unsigned long)PAGE_MASK)))

ptr = FFFF889007D7EFF8
end = FFFF889007D7F00E

Hence, fix it by checking if it is a stack object first.

[keescook@chromium.org: improve comments after reorder]
  Link: http://lkml.kernel.org/r/20190103165151.GA32845@beast
Link: http://lkml.kernel.org/r/20181231030254.99441-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08 17:15:11 -08:00
Christoph Lameter
09c2e76ed7 slab: alien caches must not be initialized if the allocation of the alien cache failed
Callers of __alloc_alien() check for NULL.  We must do the same check in
__alloc_alien_cache to avoid NULL pointer dereferences on allocation
failures.

Link: http://lkml.kernel.org/r/010001680f42f192-82b4e12e-1565-4ee0-ae1f-1e98974906aa-000000@email.amazonses.com
Fixes: 49dfc304ba ("slab: use the lock on alien_cache, instead of the lock on array_cache")
Fixes: c8522a3a58 ("Slab: introduce alloc_alien")
Signed-off-by: Christoph Lameter <cl@linux.com>
Reported-by: syzbot+d6ed4ec679652b4fd4e4@syzkaller.appspotmail.com
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08 17:15:11 -08:00
Linus Torvalds
574823bfab Change mincore() to count "mapped" pages rather than "cached" pages
The semantics of what "in core" means for the mincore() system call are
somewhat unclear, but Linux has always (since 2.3.52, which is when
mincore() was initially done) treated it as "page is available in page
cache" rather than "page is mapped in the mapping".

The problem with that traditional semantic is that it exposes a lot of
system cache state that it really probably shouldn't, and that users
shouldn't really even care about.

So let's try to avoid that information leak by simply changing the
semantics to be that mincore() counts actual mapped pages, not pages
that might be cheaply mapped if they were faulted (note the "might be"
part of the old semantics: being in the cache doesn't actually guarantee
that you can access them without IO anyway, since things like network
filesystems may have to revalidate the cache before use).

In many ways the old semantics were somewhat insane even aside from the
information leak issue.  From the very beginning (and that beginning is
a long time ago: 2.3.52 was released in March 2000, I think), the code
had a comment saying

  Later we can get more picky about what "in core" means precisely.

and this is that "later".  Admittedly it is much later than is really
comfortable.

NOTE! This is a real semantic change, and it is for example known to
change the output of "fincore", since that program literally does a
mmmap without populating it, and then doing "mincore()" on that mapping
that doesn't actually have any pages in it.

I'm hoping that nobody actually has any workflow that cares, and the
info leak is real.

We may have to do something different if it turns out that people have
valid reasons to want the old semantics, and if we can limit the
information leak sanely.

Cc: Kevin Easton <kevin@guarana.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Masatake YAMATO <yamato@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-06 13:43:02 -08:00
Linus Torvalds
a65981109f Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - procfs updates

 - various misc bits

 - lib/ updates

 - epoll updates

 - autofs

 - fatfs

 - a few more MM bits

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (58 commits)
  mm/page_io.c: fix polled swap page in
  checkpatch: add Co-developed-by to signature tags
  docs: fix Co-Developed-by docs
  drivers/base/platform.c: kmemleak ignore a known leak
  fs: don't open code lru_to_page()
  fs/: remove caller signal_pending branch predictions
  mm/: remove caller signal_pending branch predictions
  arch/arc/mm/fault.c: remove caller signal_pending_branch predictions
  kernel/sched/: remove caller signal_pending branch predictions
  kernel/locking/mutex.c: remove caller signal_pending branch predictions
  mm: select HAVE_MOVE_PMD on x86 for faster mremap
  mm: speed up mremap by 20x on large regions
  mm: treewide: remove unused address argument from pte_alloc functions
  initramfs: cleanup incomplete rootfs
  scripts/gdb: fix lx-version string output
  kernel/kcov.c: mark write_comp_data() as notrace
  kernel/sysctl: add panic_print into sysctl
  panic: add options to print system info when panic happens
  bfs: extra sanity checking and static inode bitmap
  exec: separate MM_ANONPAGES and RLIMIT_STACK accounting
  ...
2019-01-05 09:16:18 -08:00
Jens Axboe
b685a7350a mm/page_io.c: fix polled swap page in
swap_readpage() wants to do polling to bring in pages if asked to, but
it doesn't mark the bio as being polled.  Additionally, the looping
around the blk_poll() check isn't correct - if we get a zero return, we
should call io_schedule(), we can't just assume that the bio has
completed.  The regular bio->bi_private check should be used for that.

Link: http://lkml.kernel.org/r/e15243a8-2cdf-c32c-ecee-f289377c8ef9@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:48 -08:00
Nikolay Borisov
f86196ea87 fs: don't open code lru_to_page()
Multiple filesystems open code lru_to_page().  Rectify this by moving
the macro from mm_inline (which is specific to lru stuff) to the more
generic mm.h header and start using the macro where appropriate.

No functional changes.

Link: http://lkml.kernel.org/r/20181129104810.23361-1-nborisov@suse.com
Link: https://lkml.kernel.org/r/20181129075301.29087-1-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Pankaj gupta <pagupta@redhat.com>
Acked-by: "Yan, Zheng" <zyan@redhat.com>		[ceph]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:48 -08:00
Davidlohr Bueso
fa45f1162f mm/: remove caller signal_pending branch predictions
This is already done for us internally by the signal machinery.

Link: http://lkml.kernel.org/r/20181116002713.8474-5-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:48 -08:00
Joel Fernandes (Google)
2c91bd4a4e mm: speed up mremap by 20x on large regions
Android needs to mremap large regions of memory during memory management
related operations.  The mremap system call can be really slow if THP is
not enabled.  The bottleneck is move_page_tables, which is copying each
pte at a time, and can be really slow across a large map.  Turning on
THP may not be a viable option, and is not for us.  This patch speeds up
the performance for non-THP system by copying at the PMD level when
possible.

The speedup is an order of magnitude on x86 (~20x).  On a 1GB mremap,
the mremap completion times drops from 3.4-3.6 milliseconds to 144-160
microseconds.

Before:
Total mremap time for 1GB data: 3521942 nanoseconds.
Total mremap time for 1GB data: 3449229 nanoseconds.
Total mremap time for 1GB data: 3488230 nanoseconds.

After:
Total mremap time for 1GB data: 150279 nanoseconds.
Total mremap time for 1GB data: 144665 nanoseconds.
Total mremap time for 1GB data: 158708 nanoseconds.

If THP is enabled the optimization is mostly skipped except in certain
situations.

[joel@joelfernandes.org: fix 'move_normal_pmd' unused function warning]
  Link: http://lkml.kernel.org/r/20181108224457.GB209347@google.com
Link: http://lkml.kernel.org/r/20181108181201.88826-3-joelaf@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:48 -08:00
Joel Fernandes (Google)
4cf5892495 mm: treewide: remove unused address argument from pte_alloc functions
Patch series "Add support for fast mremap".

This series speeds up the mremap(2) syscall by copying page tables at
the PMD level even for non-THP systems.  There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work.  Also we find that there is no point in passing the 'address' to
pte_alloc since its unused.  This patch therefore removes this argument
tree-wide resulting in a nice negative diff as well.  Also ensuring
along the way that the enabled architectures do not do anything funky
with the 'address' argument that goes unnoticed by the optimization.

Build and boot tested on x86-64.  Build tested on arm64.  The config
enablement patch for arm64 will be posted in the future after more
testing.

The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from  pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.

// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.

virtual patch

@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@

 fn(...
- , T2 E2
 )
 { ... }

@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)

@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)

@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

 fn(...
-,  E2
 )

@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@

(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)

Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:47 -08:00
Linus Torvalds
96d4f267e4 Remove 'type' argument from access_ok() function
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access.  But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model.  And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

 - csky still had the old "verify_area()" name as an alias.

 - the iter_iov code had magical hardcoded knowledge of the actual
   values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
   really used it)

 - microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something.  Any missed conversion should be trivially fixable, though.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-03 18:57:57 -08:00
Linus Torvalds
1ac5cd4978 block: don't use un-ordered __set_current_state(TASK_UNINTERRUPTIBLE)
This mostly reverts commit 849a370016 ("block: avoid ordered task
state change for polled IO").  It was wrongly claiming that the ordering
wasn't necessary.  The memory barrier _is_ necessary.

If something is truly polling and not going to sleep, it's the whole
state setting that is unnecessary, not the memory barrier.  Whenever you
set your state to a sleeping state, you absolutely need the memory
barrier.

Note that sometimes the memory barrier can be elsewhere.  For example,
the ordering might be provided by an external lock, or by setting the
process state to sleeping before adding yourself to the wait queue list
that is used for waking up (where the wait queue lock itself will
guarantee that any wakeup will correctly see the sleeping state).

But none of those cases were true here.

NOTE! Some of the polling paths may indeed be able to drop the state
setting entirely, at which point the memory barrier also goes away.

(Also note that this doesn't revert the TASK_RUNNING cases: there is no
race between a wakeup and setting the process state to TASK_RUNNING,
since the end result doesn't depend on ordering).

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-02 10:46:03 -08:00
Linus Torvalds
3868772b99 A fairly normal cycle for documentation stuff. We have a new
document on perf security, more Italian translations, more
 improvements to the memory-management docs, improvements to the
 pathname lookup documentation, and the usual array of smaller
 fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAlwmSPkPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5Y9ZoH/joPnMFykOxS0SmdfI7Z+F4EiJct/ZwF9bHx
 T673T0RC30IgnUXGmBl5OtktfWqVh9aGqHOGwgh65ybp2QvzemdP0k6Lu6RtwNk9
 6LfkpvuUb8FzaQmCHnSMzMSDmXtZUw3Z/mOjCBcQtfGAsUULNT08xl+Dr+gwWIWt
 H+gPEEP+MCXTOQO1jm2dHOHW8NGm6XOijMTpOxp/pkoEY5tUxkVB1T//8EeX7LVh
 c1QHzFrufE3bmmubCLtIuyVqZbm/V5l6rHREDQ46fnH/G9fM4gojzsrAL/Y2m4bt
 E4y0XJHycjLMRDimAnYhbPm1ryTFAX1lNzHP3M/EF6Heqx8YHAk=
 =vtwu
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.0' of git://git.lwn.net/linux

Pull documentation update from Jonathan Corbet:
 "A fairly normal cycle for documentation stuff. We have a new document
  on perf security, more Italian translations, more improvements to the
  memory-management docs, improvements to the pathname lookup
  documentation, and the usual array of smaller fixes.

  As is often the case, there are a few reaches outside of
  Documentation/ to adjust kerneldoc comments"

* tag 'docs-5.0' of git://git.lwn.net/linux: (38 commits)
  docs: improve pathname-lookup document structure
  configfs: fix wrong name of struct in documentation
  docs/mm-api: link slab_common.c to "The Slab Cache" section
  slab: make kmem_cache_create{_usercopy} description proper kernel-doc
  doc:process: add links where missing
  docs/core-api: make mm-api.rst more structured
  x86, boot: documentation whitespace fixup
  Documentation: devres: note checking needs when converting
  doc🇮🇹 add some process/* translations
  doc🇮🇹 fixes in process/1.Intro
  Documentation: convert path-lookup from markdown to resturctured text
  Documentation/admin-guide: update admin-guide index.rst
  Documentation/admin-guide: introduce perf-security.rst file
  scripts/kernel-doc: Fix struct and struct field attribute processing
  Documentation: dev-tools: Fix typos in index.rst
  Correct gen_init_cpio tool's documentation
  Document /proc/pid PID reuse behavior
  Documentation: update path-lookup.md for parallel lookups
  Documentation: Use "while" instead of "whilst"
  dmaengine: Add mailing list address to the documentation
  ...
2018-12-29 11:21:49 -08:00
Linus Torvalds
55db91fbaa Merge branch 'for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu update from Dennis Zhou:
 "Michael Cree noted generic UP Alpha has been broken since v3.18. This
  is a small fix for locking in UP percpu code that fixes the issue"

* 'for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
  percpu: convert spin_lock_irq to spin_lock_irqsave.
2018-12-29 10:13:23 -08:00
Linus Torvalds
f346b0becb Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - large KASAN update to use arm's "software tag-based mode"

 - a few misc things

 - sh updates

 - ocfs2 updates

 - just about all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (167 commits)
  kernel/fork.c: mark 'stack_vm_area' with __maybe_unused
  memcg, oom: notify on oom killer invocation from the charge path
  mm, swap: fix swapoff with KSM pages
  include/linux/gfp.h: fix typo
  mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
  hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
  hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  memory_hotplug: add missing newlines to debugging output
  mm: remove __hugepage_set_anon_rmap()
  include/linux/vmstat.h: remove unused page state adjustment macro
  mm/page_alloc.c: allow error injection
  mm: migrate: drop unused argument of migrate_page_move_mapping()
  blkdev: avoid migration stalls for blkdev pages
  mm: migrate: provide buffer_migrate_page_norefs()
  mm: migrate: move migrate_page_lock_buffers()
  mm: migrate: lock buffers before migrate_page_move_mapping()
  mm: migration: factor out code to compute expected number of page references
  mm, page_alloc: enable pcpu_drain with zone capability
  kmemleak: add config to select auto scan
  mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
  ...
2018-12-28 16:55:46 -08:00
Linus Torvalds
0e9da3fbf7 for-4.21/block-20181221
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlwb7R8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjiID/97oDjMhNT7rwpuMbHw855h62j1hEN/m+N3
 FI0uxivYoYZLD+eJRnMcBwHlKjrCX8iJQAcv9ffI3ThtFW7dnZT3atUacaZVR/Dt
 IrxdymdBP3qsmuaId5NYBug7rJ+AiqFJKjEvCcSPu5X397J4I3SEbzhfvYLJ/aZX
 16o0HJlVVIrcbmq1IP4HwiIIOaKXvPaw04L4z4fpeynRSWG7EAi8NLSnhlR4Rxbb
 BTiMkCTsjRCFdyO6da4fvNQKWmPGPa3bJkYy3qR99cvJCeIbQjRyCloQlWNJRRgi
 3eJpCHVxqFmN0/+DNTJVQEEr4H8o0AVucrLVct1Jc4pessenkpoUniP8vELqwlng
 Z2VHLkhTfCEmvFlk82grrYdNvGATRsrbswt/PlP4T7rBfr1IpDk8kXDWF59EL2dy
 ly35Sk3wJGHBl8qa+vEPXOAnaWdqJXuVGpwB4ifOIatOls8mOxwfZjiRc7x05/fC
 1O4rR2IfLwRqwoYHs0AJ+h6ohOSn1mkGezl2Tch1VSFcJUOHmuYvraTaUi6hblpA
 SslaAoEhO39hRBL0HsvsMeqVWM9uzqvFkLDCfNPdiA81H1258CIbo4vF8z6czCIS
 eeXnTJxVhPVbZgb3a1a93SPwM6KIDZFoIijyd+NqjpU94thlnhYD0QEcKJIKH7os
 2p4aHs6ktw==
 =TRdW
 -----END PGP SIGNATURE-----

Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "This is the main pull request for block/storage for 4.21.

  Larger than usual, it was a busy round with lots of goodies queued up.
  Most notable is the removal of the old IO stack, which has been a long
  time coming. No new features for a while, everything coming in this
  week has all been fixes for things that were previously merged.

  This contains:

   - Use atomic counters instead of semaphores for mtip32xx (Arnd)

   - Cleanup of the mtip32xx request setup (Christoph)

   - Fix for circular locking dependency in loop (Jan, Tetsuo)

   - bcache (Coly, Guoju, Shenghui)
      * Optimizations for writeback caching
      * Various fixes and improvements

   - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
      * host and target support for NVMe over TCP
      * Error log page support
      * Support for separate read/write/poll queues
      * Much improved polling
      * discard OOM fallback
      * Tracepoint improvements

   - lightnvm (Hans, Hua, Igor, Matias, Javier)
      * Igor added packed metadata to pblk. Now drives without metadata
        per LBA can be used as well.
      * Fix from Geert on uninitialized value on chunk metadata reads.
      * Fixes from Hans and Javier to pblk recovery and write path.
      * Fix from Hua Su to fix a race condition in the pblk recovery
        code.
      * Scan optimization added to pblk recovery from Zhoujie.
      * Small geometry cleanup from me.

   - Conversion of the last few drivers that used the legacy path to
     blk-mq (me)

   - Removal of legacy IO path in SCSI (me, Christoph)

   - Removal of legacy IO stack and schedulers (me)

   - Support for much better polling, now without interrupts at all.
     blk-mq adds support for multiple queue maps, which enables us to
     have a map per type. This in turn enables nvme to have separate
     completion queues for polling, which can then be interrupt-less.
     Also means we're ready for async polled IO, which is hopefully
     coming in the next release.

   - Killing of (now) unused block exports (Christoph)

   - Unification of the blk-rq-qos and blk-wbt wait handling (Josef)

   - Support for zoned testing with null_blk (Masato)

   - sx8 conversion to per-host tag sets (Christoph)

   - IO priority improvements (Damien)

   - mq-deadline zoned fix (Damien)

   - Ref count blkcg series (Dennis)

   - Lots of blk-mq improvements and speedups (me)

   - sbitmap scalability improvements (me)

   - Make core inflight IO accounting per-cpu (Mikulas)

   - Export timeout setting in sysfs (Weiping)

   - Cleanup the direct issue path (Jianchao)

   - Export blk-wbt internals in block debugfs for easier debugging
     (Ming)

   - Lots of other fixes and improvements"

* tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits)
  kyber: use sbitmap add_wait_queue/list_del wait helpers
  sbitmap: add helpers for add/del wait queue handling
  block: save irq state in blkg_lookup_create()
  dm: don't reuse bio for flushes
  nvme-pci: trace SQ status on completions
  nvme-rdma: implement polling queue map
  nvme-fabrics: allow user to pass in nr_poll_queues
  nvme-fabrics: allow nvmf_connect_io_queue to poll
  nvme-core: optionally poll sync commands
  block: make request_to_qc_t public
  nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
  nvme-tcp: fix endianess annotations
  nvmet-tcp: fix endianess annotations
  nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
  nvme-pci: only set nr_maps to 2 if poll queues are supported
  nvmet: use a macro for default error location
  nvmet: fix comparison of a u16 with -1
  blk-mq: enable IO poll if .nr_queues of type poll > 0
  blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
  blk-mq: skip zero-queue maps in blk_mq_map_swqueue
  ...
2018-12-28 13:19:59 -08:00
Michal Hocko
7056d3a37d memcg, oom: notify on oom killer invocation from the charge path
Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via
eventfd anymore.  The reason is that 29ef680ae7 ("memcg, oom: move
out_of_memory back to the charge path") has moved the oom handling back to
the charge path.  While doing so the notification was left behind in
mem_cgroup_oom_synchronize.

Fix the issue by replicating the oom hierarchy locking and the
notification.

Link: http://lkml.kernel.org/r/20181224091107.18354-1-mhocko@kernel.org
Fixes: 29ef680ae7 ("memcg, oom: move out_of_memory back to the charge path")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Burt Holzman <burt@fnal.gov>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com
Cc: <stable@vger.kernel.org>	[4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:52 -08:00
Huang Ying
7af7a8e19f mm, swap: fix swapoff with KSM pages
KSM pages may be mapped to the multiple VMAs that cannot be reached from
one anon_vma.  So during swapin, a new copy of the page need to be
generated if a different anon_vma is needed, please refer to comments of
ksm_might_need_to_copy() for details.

During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and
virtual address mapped to the page, so not all mappings to a swapped out
KSM page could be found.  So in try_to_unuse(), even if the swap count of
a swap entry isn't zero, the page needs to be deleted from swap cache, so
that, in the next round a new page could be allocated and swapin for the
other mappings of the swapped out KSM page.

But this contradicts with the THP swap support.  Where the THP could be
deleted from swap cache only after the swap count of every swap entry in
the huge swap cluster backing the THP has reach 0.  So try_to_unuse() is
changed in commit e07098294a ("mm, THP, swap: support to reclaim swap
space for THP swapped out") to check that before delete a page from swap
cache, but this has broken KSM swapoff too.

Fortunately, KSM is for the normal pages only, so the original behavior
for KSM pages could be restored easily via checking PageTransCompound().
That is how this patch works.

The bug is introduced by e07098294a ("mm, THP, swap: support to reclaim
swap space for THP swapped out"), which is merged by v4.14-rc1.  So I
think we should backport the fix to from 4.14 on.  But Hugh thinks it may
be rare for the KSM pages being in the swap device when swapoff, so nobody
reports the bug so far.

Link: http://lkml.kernel.org/r/20181226051522.28442-1-ying.huang@intel.com
Fixes: e07098294a ("mm, THP, swap: support to reclaim swap space for THP swapped out")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reported-by: Hugh Dickins <hughd@google.com>
Tested-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:52 -08:00
Dan Williams
063a7d1d36 mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
The kbuild robot reported the following on a development branch that used
memremap.h in a new path:

   In file included from arch/m68k/include/asm/pgtable_mm.h:148:0,
                     from arch/m68k/include/asm/pgtable.h:5,
                     from include/linux/memremap.h:7,
                     from drivers//dax/bus.c:3:
    arch/m68k/include/asm/motorola_pgtable.h: In function 'pgd_offset':
 >> arch/m68k/include/asm/motorola_pgtable.h:199:11: error: dereferencing pointer to incomplete type 'const struct mm_struct'
      return mm->pgd + pgd_index(address);
               ^~

The ->page_fault() callback is specific to HMM.  Move it to 'struct
hmm_devmem' where the unusual asm/pgtable.h dependency can be contained in
include/linux/hmm.h.  Longer term refactoring this dependency out of HMM
is recommended, but in the meantime memremap.h remains generic.

Link: http://lkml.kernel.org/r/154534090899.3120190.6652620807617715272.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 5042db43cc ("mm/ZONE_DEVICE: new type of ZONE_DEVICE memory...")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:52 -08:00
Mike Kravetz
c86aa7bbfd hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race.  One obvious omission in the
current code is removing a page newly added to the page cache.  This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations.  To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out.  There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv.  Backing out a reservation may
require memory allocation which could fail so that needs to be taken into
account as well.

Instead of writing the required complicated code for this rare occurrence,
just eliminate the race.  i_mmap_rwsem is now held in read mode for the
duration of page fault processing.  Hold i_mmap_rwsem longer in truncation
and hold punch code to cover the call to remove_inode_hugepages.

With this modification, code in remove_inode_hugepages checking for races
becomes 'dead' as it can not longer happen.  Remove the dead code and
expand comments to explain reasoning.  Similarly, checks for races with
truncation in the page fault path can be simplified and removed.

[mike.kravetz@oracle.com: incorporat suggestions from Kirill]
  Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
Fixes: ebed4bfc8d ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:52 -08:00
Mike Kravetz
b43a999005 hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table.  Consider the following:

A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
shared pmd.

Now, another task truncates the hugetlbfs file.  As part of truncation, it
unmaps everyone who has the file mapped.  If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd.  If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse.  This leads to bad things such as incorrect page
map/reference counts or invalid memory references.

To fix, expand the use of i_mmap_rwsem as follows:

- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
  huge_pmd_share is only called via huge_pte_alloc, so callers of
  huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
  of huge_pte_alloc continue to hold the semaphore until finished with the
  ptep.

- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
  called.

[mike.kravetz@oracle.com: add explicit check for mapping != null]
Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
Fixes: 39dde65c99 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:51 -08:00
Kirill Tkhai
451b9514a5 mm: remove __hugepage_set_anon_rmap()
This function is identical to __page_set_anon_rmap() since the time, when
it was introduced (8 years ago).  The patch removes the function, and
makes its users to use __page_set_anon_rmap() instead.

Link: http://lkml.kernel.org/r/154504875359.30235.6237926369392564851.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:51 -08:00
Benjamin Poirier
af3b854492 mm/page_alloc.c: allow error injection
Model call chain after should_failslab().  Likewise, we can now use a
kprobe to override the return value of should_fail_alloc_page() and inject
allocation failures into alloc_page*().

This will allow injecting allocation failures using the BCC tools even
without building kernel with CONFIG_FAIL_PAGE_ALLOC and booting it with a
fail_page_alloc= parameter, which incurs some overhead even when failures
are not being injected.  On the other hand, this patch adds an
unconditional call to should_fail_alloc_page() from page allocation
hotpath.  That overhead should be rather negligible with
CONFIG_FAIL_PAGE_ALLOC=n when there's no kprobe attached, though.

[vbabka@suse.cz: changelog addition]
Link: http://lkml.kernel.org/r/20181214074330.18917-1-bpoirier@suse.com
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:51 -08:00
Jan Kara
ab41ee6879 mm: migrate: drop unused argument of migrate_page_move_mapping()
All callers of migrate_page_move_mapping() now pass NULL for 'head'
argument.  Drop it.

Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:51 -08:00
Jan Kara
89cb0888ca mm: migrate: provide buffer_migrate_page_norefs()
Provide a variant of buffer_migrate_page() that also checks whether there
are no unexpected references to buffer heads.  This function will then be
safe to use for block device pages.

[akpm@linux-foundation.org: remove EXPORT_SYMBOL(buffer_migrate_page_norefs)]
Link: http://lkml.kernel.org/r/20181211172143.7358-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:51 -08:00
Jan Kara
84ade7c15c mm: migrate: move migrate_page_lock_buffers()
buffer_migrate_page() is the only caller of migrate_page_lock_buffers()
move it close to it and also drop the now unused stub for !CONFIG_BLOCK.

Link: http://lkml.kernel.org/r/20181211172143.7358-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:51 -08:00
Jan Kara
cc4f11e69f mm: migrate: lock buffers before migrate_page_move_mapping()
Lock buffers before calling into migrate_page_move_mapping() so that that
function doesn't have to know about buffers (which is somewhat unexpected
anyway) and all the buffer head logic is in buffer_migrate_page().

Link: http://lkml.kernel.org/r/20181211172143.7358-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:51 -08:00
Jan Kara
0b3901b38d mm: migration: factor out code to compute expected number of page references
Patch series "mm: migrate: Fix page migration stalls for blkdev pages".

This patchset deals with page migration stalls that were reported by our
customer due to a block device page that had a bufferhead that was in the
bh LRU cache.

The patchset modifies the page migration code so that bufferheads are
completely handled inside buffer_migrate_page() and then provides a new
migration helper for pages with buffer heads that is safe to use even for
block device pages and that also deals with bh lrus.

This patch (of 6):

Factor out function to compute number of expected page references in
migrate_page_move_mapping().  Note that we move hpage_nr_pages() and
page_has_private() checks from under xas_lock_irq() however this is safe
since we hold page lock.

[jack@suse.cz: fix expected_page_refs()]
  Link: http://lkml.kernel.org/r/20181217131710.GB8611@quack2.suse.cz
Link: http://lkml.kernel.org/r/20181211172143.7358-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:51 -08:00
Wei Yang
d9367bd06f mm, page_alloc: enable pcpu_drain with zone capability
drain_all_pages is documented to drain per-cpu pages for a given zone (if
non-NULL).  The current implementation doesn't match the description
though.  It will drain all pcp pages for all zones that happen to have
cached pages on the same cpu as the given zone.  This will lead to
premature pcp cache draining for zones that are not of any interest to the
caller - e.g.  compaction, hwpoison or memory offline.

This forces the page allocator to take locks and potential lock contention
as a result.

There is no real reason for this sub-optimal implementation.  Replace
per-cpu work item with a dedicated structure which contains a pointer to
the zone and pass it over to the worker.  This will get the zone
information all the way down to the worker function and do the right job.

[akpm@linux-foundation.org: avoid 80-col tricks]
[mhocko@suse.com: refactor the whole changelog]
Link: http://lkml.kernel.org/r/20181212142550.61686-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:51 -08:00
Sri Krishna chowdary
d53ce04227 kmemleak: add config to select auto scan
Kmemleak scan can be cpu intensive and can stall user tasks at times.  To
prevent this, add config DEBUG_KMEMLEAK_AUTO_SCAN to enable/disable auto
scan on boot up.  Also protect first_run with DEBUG_KMEMLEAK_AUTO_SCAN as
this is meant for only first automatic scan.

Link: http://lkml.kernel.org/r/1540231723-7087-1-git-send-email-prpatel@nvidia.com
Signed-off-by: Sri Krishna chowdary <schowdary@nvidia.com>
Signed-off-by: Sachin Nikam <snikam@nvidia.com>
Signed-off-by: Prateek <prpatel@nvidia.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:51 -08:00
Waiman Long
3c0c12cc8f mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
pages initialization can take a long time.  Below were the reported init
times on a 8-socket 96-core 4TB IvyBridge system.

  1) Non-debug kernel without CONFIG_KASAN
     [    8.764222] node 1 initialised, 132086516 pages in 7027ms

  2) Debug kernel with CONFIG_KASAN
     [  146.288115] node 1 initialised, 132075466 pages in 143052ms

So the page init time in a debug kernel was 20X of the non-debug kernel.
The long init time can be problematic as the page initialization is done
with interrupt disabled.  In this particular case, it caused the
appearance of following warning messages as well as NMI backtraces of all
the cores that were doing the initialization.

[   68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[   68.241000] rcu: 	25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
[   68.241000] rcu: 	44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
[   68.241000] rcu: 	54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
[   68.241000] rcu: 	60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
[   68.241000] rcu: 	72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
[   68.241000] rcu: 	84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
[   68.241000] rcu: 	111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
[   68.241000] rcu: 	(detected by 13, t=65018 jiffies, g=249, q=2)

The long init time was mainly caused by the call to kasan_free_pages() to
poison the newly initialized pages.  On a 4TB system, we are talking about
almost 500GB of memory probably on the same node.

In reality, we may not need to poison the newly initialized pages before
they are ever allocated.  So KASAN poisoning of freed pages before the
completion of deferred memory initialization is now disabled.  Those pages
will be properly poisoned when they are allocated or freed after deferred
pages initialization is done.

With this change, the new page initialization time became:

[   21.948010] node 1 initialised, 132075466 pages in 18702ms

This was still about double the non-debug kernel time, but was much
better than before.

Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:51 -08:00
Pingfan Liu
125b860b25 mm/pageblock: throw compile error if pageblock_bits cannot hold MIGRATE_TYPES
Currently, NR_PAGEBLOCK_BITS and MIGRATE_TYPES are not associated by code.
If someone adds extra migrate type, then he may forget to enlarge the
NR_PAGEBLOCK_BITS.  Hence it requires some way to fix.

NR_PAGEBLOCK_BITS depends on MIGRATE_TYPES, while these macro spread on
two different .h file with reverse dependency, it is a little hard to
refer to MIGRATE_TYPES in pageblock-flag.h.  This patch tries to remind
such relation in compiling-time.

Link: http://lkml.kernel.org/r/1544508709-11358-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:51 -08:00
Kirill Tkhai
fcf9a0ef8d ksm: react on changing "sleep_millisecs" parameter faster
ksm thread unconditionally sleeps in ksm_scan_thread() after each
iteration:

	schedule_timeout_interruptible(
		msecs_to_jiffies(ksm_thread_sleep_millisecs))

The timeout is configured in /sys/kernel/mm/ksm/sleep_millisecs.

In case of user writes a big value by a mistake, and the thread enters
into schedule_timeout_interruptible(), it's not possible to cancel the
sleep by writing a new smaler value; the thread is just sleeping till
timeout expires.

The patch fixes the problem by waking the thread each time after the value
is updated.

This also may be useful for debug purposes; and also for userspace
daemons, which change sleep_millisecs value in dependence of system load.

Link: http://lkml.kernel.org/r/154454107680.3258.3558002210423531566.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:51 -08:00
Michal Hocko
e0975b2aae mm, fault_around: do not take a reference to a locked page
filemap_map_pages takes a speculative reference to each page in the range
before it tries to lock that page.  While this is correct it also can
influence page migration which will bail out when seeing an elevated
reference count.  The faultaround code would bail on seeing a locked page
so we can pro-actively check the PageLocked bit before
page_cache_get_speculative and prevent from pointless reference count
churn.

Link: http://lkml.kernel.org/r/20181211142741.2607-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:51 -08:00
Michal Hocko
bb8965bd82 mm, memory_hotplug: deobfuscate migration part of offlining
Memory migration might fail during offlining and we keep retrying in that
case.  This is currently obfuscated by goto retry loop.  The code is hard
to follow and as a result it is even suboptimal becase each retry round
scans the full range from start_pfn even though we have successfully
scanned/migrated [start_pfn, pfn] range already.  This is all only because
check_pages_isolated failure has to rescan the full range again.

De-obfuscate the migration retry loop by promoting it to a real for loop.
In fact remove the goto altogether by making it a proper double loop
(yeah, gotos are nasty in this specific case).  In the end we will get a
slightly more optimal code which is better readable.

[akpm@linux-foundation.org: reflow comments to 80 cols]
Link: http://lkml.kernel.org/r/20181211142741.2607-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:50 -08:00
Michal Hocko
a85009c377 mm, memory_hotplug: try to migrate full pfn range
Patch series "few memory offlining enhancements".

I have been chasing memory offlining not making progress recently.  On the
way I have noticed few weird decisions in the code.  The migration itself
is restricted without a reasonable justification and the retry loop around
the migration is quite messy.  This is addressed by patch 1 and patch 2.

Patch 3 is targeting on the faultaround code which has been a hot
candidate for the initial issue reported upstream [2] and that I am
debugging internally.  It turned out to be not the main contributor in the
end but I believe we should address it regardless.  See the patch
description for more details.

[1] http://lkml.kernel.org/r/20181120134323.13007-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20181114070909.GB2653@MiWiFi-R3L-srv

This patch (of 3):

do_migrate_range has been limiting the number of pages to migrate to 256
for some reason which is not documented.  Even if the limit made some
sense back then when it was introduced it doesn't really serve a good
purpose these days.  If the range contains huge pages then we break out of
the loop too early and go through LRU and pcp caches draining and
scan_movable_pages is quite suboptimal.

The only reason to limit the number of pages I can think of is to reduce
the potential time to react on the fatal signal.  But even then the number
of pages is a questionable metric because even a single page migration
might block in a non-killable state (e.g.  __unmap_and_move).

Remove the limit and offline the full requested range (this is one
memblock worth of pages with the current code).  Should we ever get a
report that offlining takes too long to react on fatal signal then we
should rather fix the core migration to use killable waits and bailout
on a signal.

Link: http://lkml.kernel.org/r/20181211142741.2607-1-mhocko@kernel.org
Link: http://lkml.kernel.org/r/20181211142741.2607-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:50 -08:00
Michal Hocko
7635d9cbe8 mm, thp, proc: report THP eligibility for each vma
Userspace falls short when trying to find out whether a specific memory
range is eligible for THP.  There are usecases that would like to know
that
http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
: This is used to identify heap mappings that should be able to fault thp
: but do not, and they normally point to a low-on-memory or fragmentation
: issue.

The only way to deduce this now is to query for hg resp.  nh flags and
confronting the state with the global setting.  Except that there is also
PR_SET_THP_DISABLE that might change the picture.  So the final logic is
not trivial.  Moreover the eligibility of the vma depends on the type of
VMA as well.  In the past we have supported only anononymous memory VMAs
but things have changed and shmem based vmas are supported as well these
days and the query logic gets even more complicated because the
eligibility depends on the mount option and another global configuration
knob.

Simplify the current state and report the THP eligibility in
/proc/<pid>/smaps for each existing vma.  Reuse
transparent_hugepage_enabled for this purpose.  The original
implementation of this function assumes that the caller knows that the vma
itself is supported for THP so make the core checks into
__transparent_hugepage_enabled and use it for existing callers.
__show_smap just use the new transparent_hugepage_enabled which also
checks the vma support status (please note that this one has to be out of
line due to include dependency issues).

[mhocko@kernel.org: fix oops with NULL ->f_mapping]
  Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Oppenheimer <bepvte@gmail.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:50 -08:00
Jérôme Glisse
ac46d4f3c4 mm/mmu_notifier: use structure for invalidate_range_start/end calls v2
To avoid having to change many call sites everytime we want to add a
parameter use a structure to group all parameters for the mmu_notifier
invalidate_range_start/end cakks.  No functional changes with this patch.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
From: Jérôme Glisse <jglisse@redhat.com>
Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:50 -08:00
Jérôme Glisse
5d6527a784 mm/mmu_notifier: use structure for invalidate_range_start/end callback
Patch series "mmu notifier contextual informations", v2.

This patchset adds contextual information, why an invalidation is
happening, to mmu notifier callback.  This is necessary for user of mmu
notifier that wish to maintains their own data structure without having to
add new fields to struct vm_area_struct (vma).

For instance device can have they own page table that mirror the process
address space.  When a vma is unmap (munmap() syscall) the device driver
can free the device page table for the range.

Today we do not have any information on why a mmu notifier call back is
happening and thus device driver have to assume that it is always an
munmap().  This is inefficient at it means that it needs to re-allocate
device page table on next page fault and rebuild the whole device driver
data structure for the range.

Other use case beside munmap() also exist, for instance it is pointless
for device driver to invalidate the device page table when the
invalidation is for the soft dirtyness tracking.  Or device driver can
optimize away mprotect() that change the page table permission access for
the range.

This patchset enables all this optimizations for device drivers.  I do not
include any of those in this series but another patchset I am posting will
leverage this.

The patchset is pretty simple from a code point of view.  The first two
patches consolidate all mmu notifier arguments into a struct so that it is
easier to add/change arguments.  The last patch adds the contextual
information (munmap, protection, soft dirty, clear, ...).

This patch (of 3):

To avoid having to change many callback definition everytime we want to
add a parameter use a structure to group all parameters for the
mmu_notifier invalidate_range_start/end callback.  No functional changes
with this patch.

[akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>	[infiniband]
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:50 -08:00
Michal Hocko
b15c87263a hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined
We have received a bug report that an injected MCE about faulty memory
prevents memory offline to succeed on 4.4 base kernel.  The underlying
reason was that the HWPoison page has an elevated reference count and the
migration keeps failing.  There are two problems with that.  First of all
it is dubious to migrate the poisoned page because we know that accessing
that memory is possible to fail.  Secondly it doesn't make any sense to
migrate a potentially broken content and preserve the memory corruption
over to a new location.

Oscar has found out that 4.4 and the current upstream kernels behave
slightly differently with his simply testcase

===

int main(void)
{
        int ret;
        int i;
        int fd;
        char *array = malloc(4096);
        char *array_locked = malloc(4096);

        fd = open("/tmp/data", O_RDONLY);
        read(fd, array, 4095);

        for (i = 0; i < 4096; i++)
                array_locked[i] = 'd';

        ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
        if (ret)
                perror("mlock");

        sleep (20);

        ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
        if (ret)
                perror("madvise");

        for (i = 0; i < 4096; i++)
                array_locked[i] = 'd';

        return 0;
}
===

+ offline this memory.

In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
list
kernel:  [<ffffffff81019ac9>] dump_trace+0x59/0x340
kernel:  [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
kernel:  [<ffffffff8101ac71>] show_stack+0x21/0x40
kernel:  [<ffffffff8132bb90>] dump_stack+0x5c/0x7c
kernel:  [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0
kernel:  [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160
kernel:  [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100
kernel:  [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0
kernel:  [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70
kernel:  [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs]
kernel:  [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200
kernel:  [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0
kernel:  [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660
kernel:  [<ffffffff8120e50d>] __vfs_read+0xcd/0x140
kernel:  [<ffffffff8120e9ea>] vfs_read+0x7a/0x120
kernel:  [<ffffffff8121404b>] kernel_read+0x3b/0x50
kernel:  [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0
kernel:  [<ffffffff81215f08>] do_execve+0x28/0x30
kernel:  [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130
kernel:  [<ffffffff8161c045>] ret_from_fork+0x55/0x80

And that latter confuses the hotremove path because an LRU page is
attempted to be migrated and that fails due to an elevated reference
count.  It is quite possible that the reuse of the HWPoisoned page is some
kind of fixed race condition but I am not really sure about that.

With the upstream kernel the failure is slightly different.  The page
doesn't seem to have LRU bit set but isolate_movable_page simply fails and
do_migrate_range simply puts all the isolated pages back to LRU and
therefore no progress is made and scan_movable_pages finds same set of
pages over and over again.

Fix both cases by explicitly checking HWPoisoned pages before we even try
to get reference on the page, try to unmap it if it is still mapped.  As
explained by Naoya:

: Hwpoison code never unmapped those for no big reason because
: Ksm pages never dominate memory, so we simply didn't have strong
: motivation to save the pages.

Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
HWPoison pages which shouldn't happen but I couldn't convince myself about
that.  Naoya has noted the following:

: Theoretically no such gurantee, because try_to_unmap() doesn't have a
: guarantee of success and then memory_failure() returns immediately
: when hwpoison_user_mappings fails.
: Or the following code (comes after hwpoison_user_mappings block) also impli=
: es
: that the target page can still have PageLRU flag.
:
:         /*
:          * Torn down by someone else?
:          */
:         if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
:                 action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
:                 res =3D -EBUSY;
:                 goto out;
:         }
:
: So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
: current version of your patch.

Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.com>
Debugged-by: Oscar Salvador <osalvador@suse.com>
Tested-by: Oscar Salvador <osalvador@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:50 -08:00
Oscar Salvador
9f1eb38e0e mm, kmemleak: little optimization while scanning
kmemleak_scan() goes through all online nodes and tries to scan all used
pages.

We can do better and use pfn_to_online_page(), so in case we have
CONFIG_MEMORY_HOTPLUG, offlined pages will be skiped automatically.  For
boxes where CONFIG_MEMORY_HOTPLUG is not present, pfn_to_online_page()
will fallback to pfn_valid().

Another little optimization is to check if the page belongs to the node we
are currently checking, so in case we have nodes interleaved we will not
check the same pfn multiple times.

I ran some tests:

Add some memory to node1 and node2 making it interleaved:

(qemu) object_add memory-backend-ram,id=ram0,size=1G
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1
(qemu) object_add memory-backend-ram,id=ram1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=ram1,node=2
(qemu) object_add memory-backend-ram,id=ram2,size=1G
(qemu) device_add pc-dimm,id=dimm2,memdev=ram2,node=1

Then, we offline that memory:
 # for i in {32..39} ; do echo "offline" > /sys/devices/system/node/node1/memory$i/state;done
 # for i in {48..55} ; do echo "offline" > /sys/devices/system/node/node1/memory$i/state;don
 # for i in {40..47} ; do echo "offline" > /sys/devices/system/node/node2/memory$i/state;done

And we run kmemleak_scan:

 # echo "scan" > /sys/kernel/debug/kmemleak

before the patch:

kmemleak: time spend: 41596 us

after the patch:

kmemleak: time spend: 34899 us

[akpm@linux-foundation.org: remove stray newline, per Oscar]
Link: http://lkml.kernel.org/r/20181206131918.25099-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:50 -08:00
Kirill Tkhai
c16eb000ca mm/filemap.c: remove useless check in pagecache_get_page()
page always is not NULL, so we may remove this useless check.

Link: http://lkml.kernel.org/r/154419752044.18559.2452963074922917720.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:50 -08:00
Oscar Salvador
bbe5d9939e mm/page_alloc.c: drop uneeded __meminit and __meminitdata
Since commit 03e85f9d5f ("mm/page_alloc: Introduce
free_area_init_core_hotplug"), some functions changed to only be called
during system initialization.  Concretly, free_area_init_node() and the
functions that hang from it.

Also, some variables are no longer used after the system has gone
through initialization.  So this could be considered as a late clean-up
for that patch.

This patch changes the functions from __meminit to __init, and the
variables from __meminitdata to __initdata.

In return, we get some KBs back:

Before:
  Freeing unused kernel image memory: 2472K

After:
  Freeing unused kernel image memory: 2480K

Link: http://lkml.kernel.org/r/20181204111507.4808-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:49 -08:00
Brian Foster
3fa750dcf2 mm/page-writeback.c: don't break integrity writeback on ->writepage() error
write_cache_pages() is used in both background and integrity writeback
scenarios by various filesystems.  Background writeback is mostly
concerned with cleaning a certain number of dirty pages based on various
mm heuristics.  It may not write the full set of dirty pages or wait for
I/O to complete.  Integrity writeback is responsible for persisting a set
of dirty pages before the writeback job completes.  For example, an
fsync() call must perform integrity writeback to ensure data is on disk
before the call returns.

write_cache_pages() unconditionally breaks out of its processing loop in
the event of a ->writepage() error.  This is fine for background
writeback, which had no strict requirements and will eventually come
around again.  This can cause problems for integrity writeback on
filesystems that might need to clean up state associated with failed page
writeouts.  For example, XFS performs internal delayed allocation
accounting before returning a ->writepage() error, where applicable.  If
the current writeback happens to be associated with an unmount and
write_cache_pages() completes the writeback prematurely due to error, the
filesystem is unmounted in an inconsistent state if dirty+delalloc pages
still exist.

To handle this problem, update write_cache_pages() to always process the
full set of pages for integrity writeback regardless of ->writepage()
errors.  Save the first encountered error and return it to the caller once
complete.  This facilitates XFS (or any other fs that expects integrity
writeback to process the entire set of dirty pages) to clean up its
internal state completely in the event of persistent mapping errors.
Background writeback continues to exit on the first error encountered.

[akpm@linux-foundation.org: fix typo in comment]
Link: http://lkml.kernel.org/r/20181116134304.32440-1-bfoster@redhat.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:49 -08:00
YueHaibing
0ecea993d0 mm/hmm.c: remove set but not used variable 'devmem'
Fixes gcc '-Wunused-but-set-variable' warning:

mm/hmm.c: In function 'hmm_devmem_ref_kill':
mm/hmm.c:995:21: warning:
 variable 'devmem' set but not used [-Wunused-but-set-variable]

It not used any more since 35d39f953d4e ("mm, hmm: replace
hmm_devmem_pages_create() with devm_memremap_pages()")

Link: http://lkml.kernel.org/r/1543629971-128377-1-git-send-email-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:49 -08:00
Wei Yang
fa004ab736 mm, hotplug: move init_currently_empty_zone() under zone_span_lock protection
During online_pages phase, pgdat->nr_zones will be updated in case this
zone is empty.

Currently the online_pages phase is protected by the global locks
(device_device_hotplug_lock and mem_hotplug_lock), which ensures there is
no contention during the update of nr_zones.

These global locks introduces scalability issues (especially the second
one), which slow down code relying on get_online_mems().  This is also a
preparation for not having to rely on get_online_mems() but instead some
more fine grained locks.

The patch moves init_currently_empty_zone under both zone_span_writelock
and pgdat_resize_lock because both the pgdat state is changed (nr_zones)
and the zone's start_pfn.  Also this patch changes the documentation of
node_size_lock to include the protection of nr_zones.

Link: http://lkml.kernel.org/r/20181203205016.14123-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:49 -08:00
Wei Yang
4e0d2e7ef1 mm, sparse: pass nid instead of pgdat to sparse_add_one_section()
Since the information needed in sparse_add_one_section() is node id to
allocate proper memory, it is not necessary to pass its pgdat.

This patch changes the prototype of sparse_add_one_section() to pass node
id directly.  This is intended to reduce misleading that
sparse_add_one_section() would touch pgdat.

Link: http://lkml.kernel.org/r/20181204085657.20472-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:49 -08:00
Wei Yang
83af658898 mm, sparse: drop pgdat_resize_lock in sparse_add/remove_one_section()
pgdat_resize_lock is used to protect pgdat's memory region information
like: node_start_pfn, node_present_pages, etc.  While in function
sparse_add/remove_one_section(), pgdat_resize_lock is used to protect
initialization/release of one mem_section.  This looks not proper.

These code paths are currently protected by mem_hotplug_lock currently but
should there ever be any reason for locking at the sparse layer a
dedicated lock should be introduced.

Following is the current call trace of sparse_add/remove_one_section()

    mem_hotplug_begin()
    arch_add_memory()
       add_pages()
           __add_pages()
               __add_section()
                   sparse_add_one_section()
    mem_hotplug_done()

    mem_hotplug_begin()
    arch_remove_memory()
        __remove_pages()
            __remove_section()
                sparse_remove_one_section()
    mem_hotplug_done()

The comment above the pgdat_resize_lock also mentions "Holding this will
also guarantee that any pfn_valid() stays that way.", which is true with
the current implementation and false after this patch.  But current
implementation doesn't meet this comment.  There isn't any pfn walkers to
take the lock so this looks like a relict from the past.  This patch also
removes this comment.

[richard.weiyang@gmail.com: v4]
  Link: http://lkml.kernel.org/r/20181204085657.20472-1-richard.weiyang@gmail.com
[mhocko@suse.com: changelog suggestion]
Link: http://lkml.kernel.org/r/20181128091243.19249-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:49 -08:00
Qian Cai
fed84c7852 mm/memblock.c: skip kmemleak for kasan_init()
Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and
Huawei TaiShan 2280 aarch64 servers).

After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early
log buffer went from something like 280 to 260000 which caused kmemleak
disabled and crash dump memory reservation failed.  The multitude of
kmemleak_alloc() calls is from nested loops while KASAN is setting up full
memory mappings, so let early kmemleak allocations skip those
memblock_alloc_internal() calls came from kasan_init() given that those
early KASAN memory mappings should not reference to other memory.  Hence,
no kmemleak false positives.

kasan_init
  kasan_map_populate [1]
    kasan_pgd_populate [2]
      kasan_pud_populate [3]
        kasan_pmd_populate [4]
          kasan_pte_populate [5]
            kasan_alloc_zeroed_page
              memblock_alloc_try_nid
                memblock_alloc_internal
                  kmemleak_alloc

[1] for_each_memblock(memory, reg)
[2] while (pgdp++, addr = next, addr != end)
[3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)))
[4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)))
[5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)))

Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.us
Signed-off-by: Qian Cai <cai@gmx.us>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:49 -08:00
Oscar Salvador
2c2a5af6fe mm, memory_hotplug: add nid parameter to arch_remove_memory
Patch series "Do not touch pages in hot-remove path", v2.

This patchset aims for two things:

 1) A better definition about offline and hot-remove stage
 2) Solving bugs where we can access non-initialized pages
    during hot-remove operations [2] [3].

This is achieved by moving all page/zone handling to the offline
stage, so we do not need to access pages when hot-removing memory.

[1] https://patchwork.kernel.org/cover/10691415/
[2] https://patchwork.kernel.org/patch/10547445/
[3] https://www.spinics.net/lists/linux-mm/msg161316.html

This patch (of 5):

This is a preparation for the following-up patches.  The idea of passing
the nid is that it will allow us to get rid of the zone parameter
afterwards.

Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:49 -08:00
Wei Yang
23b68cfaae mm: check nr_initialised with PAGES_PER_SECTION directly in defer_init()
When DEFERRED_STRUCT_PAGE_INIT is configured, only the first section of
each node's highest zone is initialized before defer stage.

static_init_pgcnt is used to store the number of pages like this:

    pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                                              pgdat->node_spanned_pages);

because we don't want to overflow zone's range.

But this is not necessary, since defer_init() is called like this:

  memmap_init_zone()
    for pfn in [start_pfn, end_pfn)
      defer_init(pfn, end_pfn)

In case (pgdat->node_spanned_pages < PAGES_PER_SECTION), the loop would
stop before calling defer_init().

BTW, comparing PAGES_PER_SECTION with node_spanned_pages is not correct,
since nr_initialised is zone based instead of node based.  Even
node_spanned_pages is bigger than PAGES_PER_SECTION, its highest zone
would have pages less than PAGES_PER_SECTION.

Link: http://lkml.kernel.org/r/20181122094807.6985-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:49 -08:00
Hugh Dickins
9a1ea439b1 mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.

And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page.  With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).

This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.

David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2

Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db9165 ("sched/wait: Break up long wake
list walk") 11a19c7b09 ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")

Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2

We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.

But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it.  That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().

Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.

__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced.  Should we perhaps disable preemption across this?

shrink_page_list()'s __ClearPageLocked(): that was a surprise!  This
survived a lot of testing before that showed up.  PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used.  Follow the suggestion from
Michal Hocko, just revert a978d6f521 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.

It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held?  Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.

[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
yuzhoujian
f0c867d958 mm, oom: add oom victim's memcg to the oom context information
The current oom report doesn't display victim's memcg context during the
global OOM situation.  While this information is not strictly needed, it
can be really helpful for containerized environments to locate which
container has lost a process.  Now that we have a single line for the oom
context, we can trivially add both the oom memcg (this can be either
global_oom or a specific memcg which hits its hard limits) and task_memcg
which is the victim's memcg.

Below is the single line output in the oom report after this patch.

- global oom context information:

oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,global_oom,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid>

- memcg oom context information:

oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,oom_memcg=<memcg>,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid>

[penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()]
  Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp
Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.com
Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Roman Gushchin <guro@fb.com>
Cc: Yang Shi <yang.s@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
yuzhoujian
ef8444ea01 mm, oom: reorganize the oom report in dump_header
OOM report contains several sections.  The first one is the allocation
context that has triggered the OOM.  Then we have cpuset context followed
by the stack trace of the OOM path.  The tird one is the OOM memory
information.  Followed by the current memory state of all system tasks.
At last, we will show oom eligible tasks and the information about the
chosen oom victim.

One thing that makes parsing more awkward than necessary is that we do not
have a single and easily parsable line about the oom context.  This patch
is reorganizing the oom report to

1) who invoked oom and what was the allocation request

[  515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0

2) OOM stack trace

[  515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3
[  515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016
[  515.906821] Call Trace:
[  515.908062]  dump_stack+0x5a/0x73
[  515.909311]  dump_header+0x55/0x28c
[  515.914260]  oom_kill_process+0x2d8/0x300
[  515.916708]  out_of_memory+0x145/0x4a0
[  515.917932]  __alloc_pages_slowpath+0x7d2/0xa16
[  515.919157]  __alloc_pages_nodemask+0x277/0x290
[  515.920367]  filemap_fault+0x3d0/0x6c0
[  515.921529]  ? filemap_map_pages+0x2b8/0x420
[  515.922709]  ext4_filemap_fault+0x2c/0x40 [ext4]
[  515.923884]  __do_fault+0x20/0x80
[  515.925032]  __handle_mm_fault+0xbc0/0xe80
[  515.926195]  handle_mm_fault+0xfa/0x210
[  515.927357]  __do_page_fault+0x233/0x4c0
[  515.928506]  do_page_fault+0x32/0x140
[  515.929646]  ? page_fault+0x8/0x30
[  515.930770]  page_fault+0x1e/0x30

3) OOM memory information

[  515.958093] Mem-Info:
[  515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0
 active_file:4402672 inactive_file:483963 isolated_file:1344
 unevictable:0 dirty:4886753 writeback:0 unstable:0
 slab_reclaimable:148442 slab_unreclaimable:18741
 mapped:1347 shmem:1347 pagetables:58669 bounce:0
 free:88663 free_pcp:0 free_cma:0
...

4) current memory state of all system tasks

[  516.079544] [    744]     0   744     9211     1345   114688       82             0 systemd-journal
[  516.082034] [    787]     0   787    31764        0   143360       92             0 lvmetad
[  516.084465] [    792]     0   792    10930        1   110592      208         -1000 systemd-udevd
[  516.086865] [   1199]     0  1199    13866        0   131072      112         -1000 auditd
[  516.089190] [   1222]     0  1222    31990        1   110592      157             0 smartd
[  516.091477] [   1225]     0  1225     4864       85    81920       43             0 irqbalance
[  516.093712] [   1226]     0  1226    52612        0   258048      426             0 abrtd
[  516.112128] [   1280]     0  1280   109774       55   299008      400             0 NetworkManager
[  516.113998] [   1295]     0  1295    28817       37    69632       24             0 ksmtuned
[  516.144596] [  10718]     0 10718  2622484  1721372 15998976   267219             0 panic
[  516.145792] [  10719]     0 10719  2622484  1164767  9818112    53576             0 panic
[  516.146977] [  10720]     0 10720  2622484  1174361  9904128    53709             0 panic
[  516.148163] [  10721]     0 10721  2622484  1209070 10194944    54824             0 panic
[  516.149329] [  10722]     0 10722  2622484  1745799 14774272    91138             0 panic

5) oom context (contrains and the chosen victim).

oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0

An admin can easily get the full oom context at a single line which
makes parsing much easier.

Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com
Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Yang Shi <yang.s@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
Alexey Dobriyan
e5cb113f2d mm: make free_reserved_area() return "const char *"
and propagate through down the call stack.

Link: http://lkml.kernel.org/r/20181124091411.GC10969@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
Alexey Dobriyan
9a2f45ff32 mm/debug.c: make "migrate_reason_names[]" const char *
Those strings are immutable as well.

Link: http://lkml.kernel.org/r/20181124090508.GB10877@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
Alexey Dobriyan
c999fbd3dc mm/mmzone.c: make "migratetype_names" const char *
Those strings are immutable in fact.

Link: http://lkml.kernel.org/r/20181124090327.GA10877@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
Mel Gorman
1c30844d2d mm: reclaim small amounts of memory when an external fragmentation event occurs
An external fragmentation event was previously described as

    When the page allocator fragments memory, it records the event using
    the mm_page_alloc_extfrag event. If the fallback_order is smaller
    than a pageblock order (order-9 on 64-bit x86) then it's considered
    an event that will cause external fragmentation issues in the future.

The kernel reduces the probability of such events by increasing the
watermark sizes by calling set_recommended_min_free_kbytes early in the
lifetime of the system.  This works reasonably well in general but if
there are enough sparsely populated pageblocks then the problem can still
occur as enough memory is free overall and kswapd stays asleep.

This patch introduces a watermark_boost_factor sysctl that allows a zone
watermark to be temporarily boosted when an external fragmentation causing
events occurs.  The boosting will stall allocations that would decrease
free memory below the boosted low watermark and kswapd is woken if the
calling context allows to reclaim an amount of memory relative to the size
of the high watermark and the watermark_boost_factor until the boost is
cleared.  When kswapd finishes, it wakes kcompactd at the pageblock order
to clean some of the pageblocks that may have been affected by the
fragmentation event.  kswapd avoids any writeback, slab shrinkage and swap
from reclaim context during this operation to avoid excessive system
disruption in the name of fragmentation avoidance.  Care is taken so that
kswapd will do normal reclaim work if the system is really low on memory.

This was evaluated using the same workloads as "mm, page_alloc: Spread
allocations across zones before introducing fragmentation".

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.20-rc3 extfrag events < order 9:   804694
4.20-rc3+patch:                      408912 (49% reduction)
4.20-rc3+patch1-4:                    18421 (98% reduction)

                                   4.20.0-rc3             4.20.0-rc3
                                 lowzone-v5r8             boost-v5r8
Amean     fault-base-1      653.58 (   0.00%)      652.71 (   0.13%)
Amean     fault-huge-1        0.00 (   0.00%)      178.93 * -99.00%*

                              4.20.0-rc3             4.20.0-rc3
                            lowzone-v5r8             boost-v5r8
Percentage huge-1        0.00 (   0.00%)        5.12 ( 100.00%)

Note that external fragmentation causing events are massively reduced by
this path whether in comparison to the previous kernel or the vanilla
kernel.  The fault latency for huge pages appears to be increased but that
is only because THP allocations were successful with the patch applied.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  291392
4.20-rc3+patch:                     191187 (34% reduction)
4.20-rc3+patch1-4:                   13464 (95% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                 lowzone-v5r8             boost-v5r8
Min       fault-base-1      912.00 (   0.00%)      905.00 (   0.77%)
Min       fault-huge-1      127.00 (   0.00%)      135.00 (  -6.30%)
Amean     fault-base-1     1467.55 (   0.00%)     1481.67 (  -0.96%)
Amean     fault-huge-1     1127.11 (   0.00%)     1063.88 *   5.61%*

                              4.20.0-rc3             4.20.0-rc3
                            lowzone-v5r8             boost-v5r8
Percentage huge-1       77.64 (   0.00%)       83.46 (   7.49%)

As before, massive reduction in external fragmentation events, some jitter
on latencies and an increase in THP allocation success rates.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  215698
4.20-rc3+patch:                     200210 (7% reduction)
4.20-rc3+patch1-4:                   14263 (93% reduction)

                                   4.20.0-rc3             4.20.0-rc3
                                 lowzone-v5r8             boost-v5r8
Amean     fault-base-5     1346.45 (   0.00%)     1306.87 (   2.94%)
Amean     fault-huge-5     3418.60 (   0.00%)     1348.94 (  60.54%)

                              4.20.0-rc3             4.20.0-rc3
                            lowzone-v5r8             boost-v5r8
Percentage huge-5        0.78 (   0.00%)        7.91 ( 910.64%)

There is a 93% reduction in fragmentation causing events, there is a big
reduction in the huge page fault latency and allocation success rate is
higher.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9: 166352
4.20-rc3+patch:                    147463 (11% reduction)
4.20-rc3+patch1-4:                  11095 (93% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                 lowzone-v5r8             boost-v5r8
Amean     fault-base-5     6217.43 (   0.00%)     7419.67 * -19.34%*
Amean     fault-huge-5     3163.33 (   0.00%)     3263.80 (  -3.18%)

                              4.20.0-rc3             4.20.0-rc3
                            lowzone-v5r8             boost-v5r8
Percentage huge-5       95.14 (   0.00%)       87.98 (  -7.53%)

There is a large reduction in fragmentation events with some jitter around
the latencies and success rates.  As before, the high THP allocation
success rate does mean the system is under a lot of pressure.  However, as
the fragmentation events are reduced, it would be expected that the
long-term allocation success rate would be higher.

Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
Mel Gorman
0a79cdad5e mm: use alloc_flags to record if kswapd can wake
This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM
into alloc_flags.  This is a preparation patch only that avoids having to
pass gfp_mask through a long callchain in a future patch.

Note that the setting in the fast path happens in alloc_flags_nofragment()
and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT.
That's true in this patch but is not true later so it's done now for
easier review to show where the flag needs to be recorded.

No functional change.

[mgorman@techsingularity.net: ALLOC_KSWAPD flag needs to be applied in the !CONFIG_ZONE_DMA32 case]
  Link: http://lkml.kernel.org/r/20181126143503.GO23260@techsingularity.net
Link: http://lkml.kernel.org/r/20181123114528.28802-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
Mel Gorman
a921444382 mm: move zone watermark accesses behind an accessor
This is a preparation patch only, no functional change.

Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
Mel Gorman
6bb154504f mm, page_alloc: spread allocations across zones before introducing fragmentation
Patch series "Fragmentation avoidance improvements", v5.

It has been noted before that fragmentation avoidance (aka
anti-fragmentation) is not perfect. Given sufficient time or an adverse
workload, memory gets fragmented and the long-term success of high-order
allocations degrades. This series defines an adverse workload, a definition
of external fragmentation events (including serious) ones and a series
that reduces the level of those fragmentation events.

The details of the workload and the consequences are described in more
detail in the changelogs. However, from patch 1, this is a high-level
summary of the adverse workload. The exact details are found in the
mmtests implementation.

The broad details of the workload are as follows;

1. Create an XFS filesystem (not specified in the configuration but done
   as part of the testing for this patch)
2. Start 4 fio threads that write a number of 64K files inefficiently.
   Inefficiently means that files are created on first access and not
   created in advance (fio parameterr create_on_open=1) and fallocate
   is not used (fallocate=none). With multiple IO issuers this creates
   a mix of slab and page cache allocations over time. The total size
   of the files is 150% physical memory so that the slabs and page cache
   pages get mixed
3. Warm up a number of fio read-only threads accessing the same files
   created in step 2. This part runs for the same length of time it
   took to create the files. It'll fault back in old data and further
   interleave slab and page cache allocations. As it's now low on
   memory due to step 2, fragmentation occurs as pageblocks get
   stolen.
4. While step 3 is still running, start a process that tries to allocate
   75% of memory as huge pages with a number of threads. The number of
   threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
   threads contending with fio, any other threads or forcing cross-NUMA
   scheduling. Note that the test has not been used on a machine with less
   than 8 cores. The benchmark records whether huge pages were allocated
   and what the fault latency was in microseconds
5. Measure the number of events potentially causing external fragmentation,
   the fault latency and the huge page allocation success rate.
6. Cleanup

Overall the series reduces external fragmentation causing events by over 94%
on 1 and 2 socket machines, which in turn impacts high-order allocation
success rates over the long term. There are differences in latencies and
high-order allocation success rates. Latencies are a mixed bag as they
are vulnerable to exact system state and whether allocations succeeded
so they are treated as a secondary metric.

Patch 1 uses lower zones if they are populated and have free memory
	instead of fragmenting a higher zone. It's special cased to
	handle a Normal->DMA32 fallback with the reasons explained
	in the changelog.

Patch 2-4 boosts watermarks temporarily when an external fragmentation
	event occurs. kswapd wakes to reclaim a small amount of old memory
	and then wakes kcompactd on completion to recover the system
	slightly. This introduces some overhead in the slowpath. The level
	of boosting can be tuned or disabled depending on the tolerance
	for fragmentation vs allocation latency.

Patch 5 stalls some movable allocation requests to let kswapd from patch 4
	make some progress. The duration of the stalls is very low but it
	is possible to tune the system to avoid fragmentation events if
	larger stalls can be tolerated.

The bulk of the improvement in fragmentation avoidance is from patches
1-4 but patch 5 can deal with a rare corner case and provides the option
of tuning a system for THP allocation success rates in exchange for
some stalls to control fragmentation.

This patch (of 5):

The page allocator zone lists are iterated based on the watermarks of each
zone which does not take anti-fragmentation into account.  On x86, node 0
may have multiple zones while other nodes have one zone.  A consequence is
that tasks running on node 0 may fragment ZONE_NORMAL even though
ZONE_DMA32 has plenty of free memory.  This patch special cases the
allocator fast path such that it'll try an allocation from a lower local
zone before fragmenting a higher zone.  In this case, stealing of
pageblocks or orders larger than a pageblock are still allowed in the fast
path as they are uninteresting from a fragmentation point of view.

This was evaluated using a benchmark designed to fragment memory before
attempting THP allocations.  It's implemented in mmtests as the following
configurations

configs/config-global-dhp__workload_thpfioscale
configs/config-global-dhp__workload_thpfioscale-defrag
configs/config-global-dhp__workload_thpfioscale-madvhugepage

e.g. from mmtests
./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1

The broad details of the workload are as follows;

1. Create an XFS filesystem (not specified in the configuration but done
   as part of the testing for this patch).
2. Start 4 fio threads that write a number of 64K files inefficiently.
   Inefficiently means that files are created on first access and not
   created in advance (fio parameter create_on_open=1) and fallocate
   is not used (fallocate=none). With multiple IO issuers this creates
   a mix of slab and page cache allocations over time. The total size
   of the files is 150% physical memory so that the slabs and page cache
   pages get mixed.
3. Warm up a number of fio read-only processes accessing the same files
   created in step 2. This part runs for the same length of time it
   took to create the files. It'll refault old data and further
   interleave slab and page cache allocations. As it's now low on
   memory due to step 2, fragmentation occurs as pageblocks get
   stolen.
4. While step 3 is still running, start a process that tries to allocate
   75% of memory as huge pages with a number of threads. The number of
   threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
   threads contending with fio, any other threads or forcing cross-NUMA
   scheduling. Note that the test has not been used on a machine with less
   than 8 cores. The benchmark records whether huge pages were allocated
   and what the fault latency was in microseconds.
5. Measure the number of events potentially causing external fragmentation,
   the fault latency and the huge page allocation success rate.
6. Cleanup the test files.

Note that due to the use of IO and page cache that this benchmark is not
suitable for running on large machines where the time to fragment memory
may be excessive.  Also note that while this is one mix that generates
fragmentation that it's not the only mix that generates fragmentation.
Differences in workload that are more slab-intensive or whether SLUB is
used with high-order pages may yield different results.

When the page allocator fragments memory, it records the event using the
mm_page_alloc_extfrag ftrace event.  If the fallback_order is smaller than
a pageblock order (order-9 on 64-bit x86) then it's considered to be an
"external fragmentation event" that may cause issues in the future.
Hence, the primary metric here is the number of external fragmentation
events that occur with order < 9.  The secondary metric is allocation
latency and huge page allocation success rates but note that differences
in latencies and what the success rate also can affect the number of
external fragmentation event which is why it's a secondary metric.

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.20-rc3 extfrag events < order 9:   804694
4.20-rc3+patch:                      408912 (49% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                      vanilla           lowzone-v5r8
Amean     fault-base-1      662.92 (   0.00%)      653.58 *   1.41%*
Amean     fault-huge-1        0.00 (   0.00%)        0.00 (   0.00%)

                              4.20.0-rc3             4.20.0-rc3
                                 vanilla           lowzone-v5r8
Percentage huge-1        0.00 (   0.00%)        0.00 (   0.00%)

Fault latencies are slightly reduced while allocation success rates remain
at zero as this configuration does not make any special effort to allocate
THP and fio is heavily active at the time and either filling memory or
keeping pages resident.  However, a 49% reduction of serious fragmentation
events reduces the changes of external fragmentation being a problem in
the future.

Vlastimil asked during review for a breakdown of the allocation types
that are falling back.

vanilla
   3816 MIGRATE_UNMOVABLE
 800845 MIGRATE_MOVABLE
     33 MIGRATE_UNRECLAIMABLE

patch
    735 MIGRATE_UNMOVABLE
 408135 MIGRATE_MOVABLE
     42 MIGRATE_UNRECLAIMABLE

The majority of the fallbacks are due to movable allocations and this is
consistent for the workload throughout the series so will not be presented
again as the primary source of fallbacks are movable allocations.

Movable fallbacks are sometimes considered "ok" to fallback because they
can be migrated.  The problem is that they can fill an
unmovable/reclaimable pageblock causing those allocations to fallback
later and polluting pageblocks with pages that cannot move.  If there is a
movable fallback, it is pretty much guaranteed to affect an
unmovable/reclaimable pageblock and while it might not be enough to
actually cause a unmovable/reclaimable fallback in the future, we cannot
know that in advance so the patch takes the only option available to it.
Hence, it's important to control them.  This point is also consistent
throughout the series and will not be repeated.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  291392
4.20-rc3+patch:                     191187 (34% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                      vanilla           lowzone-v5r8
Amean     fault-base-1     1495.14 (   0.00%)     1467.55 (   1.85%)
Amean     fault-huge-1     1098.48 (   0.00%)     1127.11 (  -2.61%)

thpfioscale Percentage Faults Huge
                              4.20.0-rc3             4.20.0-rc3
                                 vanilla           lowzone-v5r8
Percentage huge-1       78.57 (   0.00%)       77.64 (  -1.18%)

Fragmentation events were reduced quite a bit although this is known
to be a little variable. The latencies and allocation success rates
are similar but they were already quite high.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  215698
4.20-rc3+patch:                     200210 (7% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                      vanilla           lowzone-v5r8
Amean     fault-base-5     1350.05 (   0.00%)     1346.45 (   0.27%)
Amean     fault-huge-5     4181.01 (   0.00%)     3418.60 (  18.24%)

                              4.20.0-rc3             4.20.0-rc3
                                 vanilla           lowzone-v5r8
Percentage huge-5        1.15 (   0.00%)        0.78 ( -31.88%)

The reduction of external fragmentation events is slight and this is
partially due to the removal of __GFP_THISNODE in commit ac5b2c1891
("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP
allocations can now spill over to remote nodes instead of fragmenting
local memory.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9: 166352
4.20-rc3+patch:                    147463 (11% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                      vanilla           lowzone-v5r8
Amean     fault-base-5     6138.97 (   0.00%)     6217.43 (  -1.28%)
Amean     fault-huge-5     2294.28 (   0.00%)     3163.33 * -37.88%*

thpfioscale Percentage Faults Huge
                              4.20.0-rc3             4.20.0-rc3
                                 vanilla           lowzone-v5r8
Percentage huge-5       96.82 (   0.00%)       95.14 (  -1.74%)

There was a slight reduction in external fragmentation events although the
latencies were higher.  The allocation success rate is high enough that
the system is struggling and there is quite a lot of parallel reclaim and
compaction activity.  There is also a certain degree of luck on whether
processes start on node 0 or not for this patch but the relevance is
reduced later in the series.

Overall, the patch reduces the number of external fragmentation causing
events so the success of THP over long periods of time would be improved
for this adverse workload.

Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
David Hildenbrand
f29d8e9c01 mm/memory_hotplug: drop "online" parameter from add_memory_resource()
Userspace should always be in charge of how to online memory and if memory
should be onlined automatically in the kernel.  Let's drop the parameter
to overwrite this - XEN passes memhp_auto_online, just like add_memory(),
so we can directly use that instead internally.

Link: http://lkml.kernel.org/r/20181123123740.27652-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
Mike Rapoport
4d72868c8f memblock: replace usage of __memblock_free_early() with memblock_free()
__memblock_free_early() is only used by the convenience wrappers, so
essentially we wrap a call to memblock_free() twice.  Replace calls of
__memblock_free_early() with calls to memblock_free() and drop the former.

Link: http://lkml.kernel.org/r/20181125102940.GE28634@rapoport-lnx
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Wentao Wang <witallwang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
Wentao Wang
d31cfe7bff mm/page_alloc.c: deduplicate __memblock_free_early() and memblock_free()
Link: http://lkml.kernel.org/r/C8ECE1B7A767434691FEEFA3A01765D72AFB8E78@MX203CL03.corp.emc.com
Signed-off-by: Wentao Wang <witallwang@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
Aaron Lu
742aa7fb52 mm/page_alloc.c: use a single function to free page
There are multiple places of freeing a page, they all do the same things
so a common function can be used to reduce code duplicate.

It also avoids bug fixed in one function but left in another.

Link: http://lkml.kernel.org/r/20181119134834.17765-3-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pankaj gupta <pagupta@redhat.com>
Cc: Pawel Staszewski <pstaszewski@itcare.pl>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
Aaron Lu
65895b67ad mm/page_alloc.c: free order-0 pages through PCP in page_frag_free()
page_frag_free() calls __free_pages_ok() to free the page back to Buddy.
This is OK for high order page, but for order-0 pages, it misses the
optimization opportunity of using Per-Cpu-Pages and can cause zone lock
contention when called frequently.

Pawel Staszewski recently shared his result of 'how Linux kernel handles
normal traffic'[1] and from perf data, Jesper Dangaard Brouer found the
lock contention comes from page allocator:

  mlx5e_poll_tx_cq
  |
   --16.34%--napi_consume_skb
             |
             |--12.65%--__free_pages_ok
             |          |
             |           --11.86%--free_one_page
             |                     |
             |                     |--10.10%--queued_spin_lock_slowpath
             |                     |
             |                      --0.65%--_raw_spin_lock
             |
             |--1.55%--page_frag_free
             |
              --1.44%--skb_release_data

Jesper explained how it happened: mlx5 driver RX-page recycle mechanism is
not effective in this workload and pages have to go through the page
allocator.  The lock contention happens during mlx5 DMA TX completion
cycle.  And the page allocator cannot keep up at these speeds.[2]

I thought that __free_pages_ok() are mostly freeing high order pages and
thought this is an lock contention for high order pages but Jesper
explained in detail that __free_pages_ok() here are actually freeing
order-0 pages because mlx5 is using order-0 pages to satisfy its page pool
allocation request.[3]

The free path as pointed out by Jesper is:
skb_free_head()
  -> skb_free_frag()
    -> page_frag_free()
And the pages being freed on this path are order-0 pages.

Fix this by doing similar things as in __page_frag_cache_drain() - send
the being freed page to PCP if it's an order-0 page, or directly to Buddy
if it is a high order page.

With this change, Paweł hasn't noticed lock contention yet in his
workload and Jesper has noticed a 7% performance improvement using a micro
benchmark and lock contention is gone.  Ilias' test on a 'low' speed 1Gbit
interface on an cortex-a53 shows ~11% performance boost testing with
64byte packets and __free_pages_ok() disappeared from perf top.

[1]: https://www.spinics.net/lists/netdev/msg531362.html
[2]: https://www.spinics.net/lists/netdev/msg531421.html
[3]: https://www.spinics.net/lists/netdev/msg531556.html

[akpm@linux-foundation.org: add comment]
Link: http://lkml.kernel.org/r/20181120014544.GB10657@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Tested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Pankaj gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
Dan Williams
02917e9f86 mm, hmm: mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL
At Maintainer Summit, Greg brought up a topic I proposed around
EXPORT_SYMBOL_GPL usage.  The motivation was considerations for when
EXPORT_SYMBOL_GPL is warranted and the criteria for taking the exceptional
step of reclassifying an existing export.  Specifically, I wanted to make
the case that although the line is fuzzy and hard to specify in abstract
terms, it is nonetheless clear that devm_memremap_pages() and HMM
(Heterogeneous Memory Management) have crossed it.  The
devm_memremap_pages() facility should have been EXPORT_SYMBOL_GPL from the
beginning, and HMM as a derivative of that functionality should have
naturally picked up that designation as well.

Contrary to typical rules, the HMM infrastructure was merged upstream with
zero in-tree consumers.  There was a promise at the time that those users
would be merged "soon", but it has been over a year with no drivers
arriving.  While the Nouveau driver is about to belatedly make good on
that promise it is clear that HMM was targeted first and foremost at an
out-of-tree consumer.

HMM is derived from devm_memremap_pages(), a facility Christoph and I
spearheaded to support persistent memory.  It combines a device lifetime
model with a dynamically created 'struct page' / memmap array for any
physical address range.  It enables coordination and control of the many
code paths in the kernel built to interact with memory via 'struct page'
objects.  With HMM the integration goes even deeper by allowing device
drivers to hook and manipulate page fault and page free events.

One interpretation of when EXPORT_SYMBOL is suitable is when it is
exporting stable and generic leaf functionality.  The
devm_memremap_pages() facility continues to see expanding use cases,
peer-to-peer DMA being the most recent, with no clear end date when it
will stop attracting reworks and semantic changes.  It is not suitable to
export devm_memremap_pages() as a stable 3rd party driver API due to the
fact that it is still changing and manipulates core behavior.  Moreover,
it is not in the best interest of the long term development of the core
memory management subsystem to permit any external driver to effectively
define its own system-wide memory management policies with no
encouragement to engage with upstream.

I am also concerned that HMM was designed in a way to minimize further
engagement with the core-MM.  That, with these hooks in place,
device-drivers are free to implement their own policies without much
consideration for whether and how the core-MM could grow to meet that
need.  Going forward not only should HMM be EXPORT_SYMBOL_GPL, but the
core-MM should be allowed the opportunity and stimulus to change and
address these new use cases as first class functionality.

Original changelog:

hmm_devmem_add(), and hmm_devmem_add_resource() duplicated
devm_memremap_pages() and are now simple now wrappers around the core
facility to inject a dev_pagemap instance into the global pgmap_radix and
hook page-idle events.  The devm_memremap_pages() interface is base
infrastructure for HMM.  HMM has more and deeper ties into the kernel
memory management implementation than base ZONE_DEVICE which is itself a
EXPORT_SYMBOL_GPL facility.

Originally, the HMM page structure creation routines copied the
devm_memremap_pages() code and reused ZONE_DEVICE.  A cleanup to unify the
implementations was discussed during the initial review:
http://lkml.iu.edu/hypermail/linux/kernel/1701.2/00812.html Recent work to
extend devm_memremap_pages() for the peer-to-peer-DMA facility enabled
this cleanup to move forward.

In addition to the integration with devm_memremap_pages() HMM depends on
other GPL-only symbols:

    mmu_notifier_unregister_no_release
    percpu_ref
    region_intersects
    __class_create

It goes further to consume / indirectly expose functionality that is not
exported to any other driver:

    alloc_pages_vma
    walk_page_range

HMM is derived from devm_memremap_pages(), and extends deep core-kernel
fundamentals. Similar to devm_memremap_pages(), mark its entry points
EXPORT_SYMBOL_GPL().

[logang@deltatee.com: PCI/P2PDMA: match interface changes to devm_memremap_pages()]
  Link: http://lkml.kernel.org/r/20181130225911.2900-1-logang@deltatee.com
Link: http://lkml.kernel.org/r/154275560565.76910.15919297436557795278.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Balbir Singh <bsingharora@gmail.com>,
Cc: Michal Hocko <mhocko@suse.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
Dan Williams
bbecd94e6c mm, hmm: replace hmm_devmem_pages_create() with devm_memremap_pages()
Commit e8d5134833 ("memremap: change devm_memremap_pages interface to
use struct dev_pagemap") refactored devm_memremap_pages() to allow a
dev_pagemap instance to be supplied.  Passing in a dev_pagemap interface
simplifies the design of pgmap type drivers in that they can rely on
container_of() to lookup any private data associated with the given
dev_pagemap instance.

In addition to the cleanups this also gives hmm users multi-order-radix
improvements that arrived with commit ab1b597ee0 "mm,
devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups"

As part of the conversion to the devm_memremap_pages() method of
handling the percpu_ref relative to when pages are put, the percpu_ref
completion needs to move to hmm_devmem_ref_exit().  See 7138970383
("mm, zone_device: Replace {get, put}_zone_device_page...") for details.

Link: http://lkml.kernel.org/r/154275560053.76910.10870962637383152392.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Dan Williams
58ef15b765 mm, hmm: use devm semantics for hmm_devmem_{add, remove}
devm semantics arrange for resources to be torn down when
device-driver-probe fails or when device-driver-release completes.
Similar to devm_memremap_pages() there is no need to support an explicit
remove operation when the users properly adhere to devm semantics.

Note that devm_kzalloc() automatically handles allocating node-local
memory.

Link: http://lkml.kernel.org/r/154275559545.76910.9186690723515469051.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Huang Shijie
7ead334215 mm/page_alloc.c: change the order of MIGRATE_RECLAIMABLE/MIGRATE_MOVABLE in fallbacks
In the enum migratetype definition, MIGRATE_MOVABLE is before
MIGRATE_RECLAIMABLE.  Change the order of them to match the enumeration's
order.

Link: http://lkml.kernel.org/r/20181121085821.3442-1-sjhuang@iluvatar.ai
Signed-off-by: Huang Shijie <sjhuang@iluvatar.ai>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Aaron Lu
66f71da9dd mm/swap: use nr_node_ids for avail_lists in swap_info_struct
Since a2468cc9bf ("swap: choose swap device according to numa node"),
avail_lists field of swap_info_struct is changed to an array with
MAX_NUMNODES elements.  This made swap_info_struct size increased to 40KiB
and needs an order-4 page to hold it.

This is not optimal in that:
1 Most systems have way less than MAX_NUMNODES(1024) nodes so it
  is a waste of memory;
2 It could cause swapon failure if the swap device is swapped on
  after system has been running for a while, due to no order-4
  page is available as pointed out by Vasily Averin.

Solve the above two issues by using nr_node_ids(which is the actual
possible node number the running system has) for avail_lists instead of
MAX_NUMNODES.

nr_node_ids is unknown at compile time so can't be directly used when
declaring this array.  What I did here is to declare avail_lists as zero
element array and allocate space for it when allocating space for
swap_info_struct.  The reason why keep using array but not pointer is
plist_for_each_entry needs the field to be part of the struct, so pointer
will not work.

This patch is on top of Vasily Averin's fix commit.  I think the use of
kvzalloc for swap_info_struct is still needed in case nr_node_ids is
really big on some systems.

Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Wei Yang
8b09549c2b vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n
Commit fa5e084e43 ("vmscan: do not unconditionally treat zones that
fail zone_reclaim() as full") changed the return value of
node_reclaim().  The original return value 0 means NODE_RECLAIM_SOME
after this commit.

While the return value of node_reclaim() when CONFIG_NUMA is n is not
changed.  This will leads to call zone_watermark_ok() again.

This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN.
Since node_reclaim() is only called in page_alloc.c, move it to
mm/internal.h.

Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Arun KS
476567e873 mm: remove managed_page_count_lock spinlock
Now that totalram_pages and managed_pages are atomic varibles, no need of
managed_page_count spinlock.  The lock had really a weak consistency
guarantee.  It hasn't been used for anything but the update but no reader
actually cares about all the values being updated to be in sync.

Link: http://lkml.kernel.org/r/1542090790-21750-5-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Arun KS
ca79b0c211 mm: convert totalram_pages and totalhigh_pages variables to atomic
totalram_pages and totalhigh_pages are made static inline function.

Main motivation was that managed_page_count_lock handling was complicating
things.  It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Arun KS
9705bea5f8 mm: convert zone->managed_pages to atomic variable
totalram_pages, zone->managed_pages and totalhigh_pages updates are
protected by managed_page_count_lock, but readers never care about it.
Convert these variables to atomic to avoid readers potentially seeing a
store tear.

This patch converts zone->managed_pages.  Subsequent patches will convert
totalram_panges, totalhigh_pages and eventually managed_page_count_lock
will be removed.

Main motivation was that managed_page_count_lock handling was complicating
things.  It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

Link: http://lkml.kernel.org/r/1542090790-21750-3-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Arun KS
3d6357de8a mm: reference totalram_pages and managed_pages once per function
Patch series "mm: convert totalram_pages, totalhigh_pages and managed
pages to atomic", v5.

This series converts totalram_pages, totalhigh_pages and
zone->managed_pages to atomic variables.

totalram_pages, zone->managed_pages and totalhigh_pages updates are
protected by managed_page_count_lock, but readers never care about it.
Convert these variables to atomic to avoid readers potentially seeing a
store tear.

Main motivation was that managed_page_count_lock handling was complicating
things.  It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better
to remove the lock and convert variables to atomic.  With the change,
preventing poteintial store-to-read tearing comes as a bonus.

This patch (of 4):

This is in preparation to a later patch which converts totalram_pages and
zone->managed_pages to atomic variables.  Please note that re-reading the
value might lead to a different value and as such it could lead to
unexpected behavior.  There are no known bugs as a result of the current
code but it is better to prevent from them in principle.

Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Wei Yang
fecd4a50ba mm: remove reset of pcp->counter in pageset_init()
per_cpu_pageset is cleared by memset, it is not necessary to reset it
again.

Link: http://lkml.kernel.org/r/20181021023920.5501-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Michal Hocko
46a3679b81 mm, memory_hotplug: do not clear numa_node association after hot_remove
Per-cpu numa_node provides a default node for each possible cpu.  The
association gets initialized during the boot when the architecture
specific code explores cpu->NUMA affinity.  When the whole NUMA node is
removed though we are clearing this association

try_offline_node
  check_and_unmap_cpu_on_node
    unmap_cpu_on_node
      numa_clear_node
        numa_set_node(cpu, NUMA_NO_NODE)

This means that whoever calls cpu_to_node for a cpu associated with such a
node will get NUMA_NO_NODE.  This is problematic for two reasons.  First
it is fragile because __alloc_pages_node would simply blow up on an
out-of-bound access.  We have encountered this when loading kvm module

  BUG: unable to handle kernel paging request at 00000000000021c0
  IP: __alloc_pages_nodemask+0x93/0xb70
  PGD 800000ffe853e067 PUD 7336bbc067 PMD 0
  Oops: 0000 [#1] SMP
  [...]
  CPU: 88 PID: 1223749 Comm: modprobe Tainted: G        W          4.4.156-94.64-default #1
  RIP: __alloc_pages_nodemask+0x93/0xb70
  RSP: 0018:ffff887354493b40  EFLAGS: 00010202
  RAX: 00000000000021c0 RBX: 0000000000000000 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000014000c0
  RBP: 00000000014000c0 R08: ffffffffffffffff R09: 0000000000000000
  R10: ffff88fffc89e790 R11: 0000000000014000 R12: 0000000000000101
  R13: ffffffffa0772cd4 R14: ffffffffa0769ac0 R15: 0000000000000000
  FS:  00007fdf2f2f1700(0000) GS:ffff88fffc880000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000021c0 CR3: 00000077205ee000 CR4: 0000000000360670
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
    alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
    hardware_setup+0x781/0x849 [kvm_intel]
    kvm_arch_hardware_setup+0x28/0x190 [kvm]
    kvm_init+0x7c/0x2d0 [kvm]
    vmx_init+0x1e/0x32c [kvm_intel]
    do_one_initcall+0xca/0x1f0
    do_init_module+0x5a/0x1d7
    load_module+0x1393/0x1c90
    SYSC_finit_module+0x70/0xa0
    entry_SYSCALL_64_fastpath+0x1e/0xb7
  DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7

on an older kernel but the code is basically the same in the current Linus
tree as well.  alloc_vmcs_cpu could use alloc_pages_nodemask which would
recognize NUMA_NO_NODE and use alloc_pages_node which would translate it
to numa_mem_id but that is wrong as well because it would use a cpu
affinity of the local CPU which might be quite far from the original node.
It is also reasonable to expect that cpu_to_node will provide a sane
value and there might be many more callers like that.

The second problem is that __register_one_node relies on cpu_to_node to
properly associate cpus back to the node when it is onlined.  We do not
want to lose that link as there is no arch independent way to get it from
the early boot time AFAICS.

Drop the whole check_and_unmap_cpu_on_node machinery and keep the
association to fix both issues.  The NODE_DATA(nid) is not deallocated so
it will stay in place and if anybody wants to allocate from that node then
a fallback node will be used.

Thanks to Vlastimil Babka for his live system debugging skills that helped
debugging the issue.

Link: http://lkml.kernel.org/r/20181108100413.966-1-mhocko@kernel.org
Fixes: e13fe8695c ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Debugged-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Yangtao Li
9cabf929e7 mm/mmap.c: remove verify_mm_writelocked()
We should get rid of this function.  It no longer serves its purpose.
This is a historical artifact from 2005 where do_brk was called outside of
the core mm.  We do have a proper abstraction in vm_brk_flags and that one
does the locking properly so there is no need to use this function.

Link: http://lkml.kernel.org/r/20181108174856.10811-1-tiny.windzz@gmail.com
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:47 -08:00
Timofey Titovets
59e1a2f4bf ksm: replace jhash2 with xxhash
Replace jhash2 with xxhash.

Perf numbers:
Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
ksm: crc32c   hash() 12081 MB/s
ksm: xxh64    hash()  8770 MB/s
ksm: xxh32    hash()  4529 MB/s
ksm: jhash2   hash()  1569 MB/s

Sioh Lee did some testing:

crc32c_intel: 1084.10ns
crc32c (no hardware acceleration): 7012.51ns
xxhash32: 2227.75ns
xxhash64: 1413.16ns
jhash2: 5128.30ns

As jhash2 always will be slower (for data size like PAGE_SIZE).  Don't use
it in ksm at all.

Use only xxhash for now, because for using crc32c, cryptoapi must be
initialized first - that requires some tricky solution to work well in all
situations.

Link: http://lkml.kernel.org/r/20181023182554.23464-3-nefelim4ag@gmail.com
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: leesioh <solee@os.korea.ac.kr>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:46 -08:00
Michal Hocko
d381c54760 mm: only report isolation failures when offlining memory
Heiko has complained that his log is swamped by warnings from
has_unmovable_pages

[   20.536664] page dumped because: has_unmovable_pages
[   20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0
[   20.536794] flags: 0x3fffe0000010200(slab|head)
[   20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600
[   20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000
[   20.536797] page dumped because: has_unmovable_pages
[   20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0
[   20.536815] flags: 0x7fffe0000000000()
[   20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000
[   20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000

which are not triggered by the memory hotplug but rather CMA allocator.
The original idea behind dumping the page state for all call paths was
that these messages will be helpful debugging failures.  From the above it
seems that this is not the case for the CMA path because we are lacking
much more context.  E.g the second reported page might be a CMA allocated
page.  It is still interesting to see a slab page in the CMA area but it
is hard to tell whether this is bug from the above output alone.

Address this issue by dumping the page state only on request.  Both
start_isolate_page_range and has_unmovable_pages already have an argument
to ignore hwpoison pages so make this argument more generic and turn it
into flags and allow callers to combine non-default modes into a mask.
While we are at it, has_unmovable_pages call from
is_pageblock_removable_nolock (sysfs removable file) is questionable to
report the failure so drop it from there as well.

Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:46 -08:00
Michal Hocko
2932c8b050 mm, memory_hotplug: be more verbose for memory offline failures
There is only very limited information printed when the memory offlining
fails:

[ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff

This tells us that the failure is triggered by the userspace intervention
but it doesn't tell us much more about the underlying reason.  It might be
that the page migration failes repeatedly and the userspace timeout
expires and send a signal or it might be some of the earlier steps
(isolation, memory notifier) takes too long.

If the migration failes then it would be really helpful to see which page
that and its state.  The same applies to the isolation phase.  If we fail
to isolate a page from the allocator then knowing the state of the page
would be helpful as well.

Dump the page state that fails to get isolated or migrated.  This will
tell us more about the failure and what to focus on during debugging.

[akpm@linux-foundation.org: add missing printk arg]
[mhocko@suse.com: tweak dump_page() `reason' text]
  Link: http://lkml.kernel.org/r/20181116083020.20260-6-mhocko@kernel.org
Link: http://lkml.kernel.org/r/20181107101830.17405-6-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:46 -08:00
Michal Hocko
7960509329 mm, memory_hotplug: print reason for the offlining failure
The memory offlining failure reporting is inconsistent and insufficient.
Some error paths simply do not report the failure to the log at all.  When
we do report there are no details about the reason of the failure and
there are several of them which makes memory offlining failures hard to
debug.

Make sure that the
	memory offlining [mem %#010llx-%#010llx] failed
message is printed for all failures and also provide a short textual
reason for the failure e.g.

[ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff

this tells us that the offlining has failed because of a signal pending
aka user intervention.

[akpm@linux-foundation.org: tweak messages a bit]
Link: http://lkml.kernel.org/r/20181107101830.17405-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:46 -08:00
Michal Hocko
6cc2baf600 mm, memory_hotplug: drop pointless block alignment checks from __offline_pages
This function is never called from a context which would provide
misaligned pfn range so drop the pointless check.

Link: http://lkml.kernel.org/r/20181107101830.17405-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:46 -08:00
Michal Hocko
e0392cf7c5 mm: lower the printk loglevel for __dump_page messages
__dump_page messages use KERN_EMERG resp.  KERN_ALERT loglevel (this is
the case since 2004).  Most callers of this function are really detecting
a critical page state and BUG right after.  On the other hand the function
is called also from contexts which just want to inform about the page
state and those would rather not disrupt logs that much (e.g.  some
systems route these messages to the normal console).

Reduce the loglevel to KERN_WARNING to make dump_page easier to reuse for
other contexts while those messages will still make it to the kernel log
in most setups.  Even if the loglevel setup filters warnings away those
paths that are really critical already print the more targeted error or
panic and that should make it to the kernel log.

[mhocko@kernel.org: fix __dump_page()]
  Link: http://lkml.kernel.org/r/20181212142540.GA7378@dhcp22.suse.cz
[akpm@linux-foundation.org: s/KERN_WARN/KERN_WARNING/, per Michal]
Link: http://lkml.kernel.org/r/20181107101830.17405-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:46 -08:00
Michal Hocko
1c6fb1d89e mm: print more information about mapping in __dump_page
I have been promissing to improve memory offlining failures debugging for
quite some time.  As things stand now we get only very limited information
in the kernel log when the offlining fails.  It is usually only

[ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed

with no further details.  We do not know what exactly fails and for what
reason.  Whenever I was forced to debug such a failure I've always had to
do a debugging patch to tell me more.  We can enable some tracepoints but
it would be much better to get a better picture without using them.

This patch series does 2 things.  The first one is to make dump_page more
usable by printing more information about the mapping patch 1.  Then it
reduces the log level from emerg to warning so that this function is
usable from less critical context patch 2.  Then I have added more
detailed information about the offlining failure patch 4 and finally add
dump_page to isolation and offlining migration paths.  Patch 3 is a
trivial cleanup.

This patch (of 6):

__dump_page prints the mapping pointer but that is quite unhelpful for
many reports because the pointer itself only helps to distinguish anon/ksm
mappings from other ones (because of lowest bits set).  Sometimes it would
be much more helpful to know what kind of mapping that is actually and if
we know this is a file mapping then also try to resolve the dentry name.

[dan.carpenter@oracle.com: fix a width vs precision bug in printk]
  Link: http://lkml.kernel.org/r/20181123072135.gqvblm2vdujbvfjs@kili.mountain
[mhocko@kernel.org: use %dp to print dentry]
  Link: http://lkml.kernel.org/r/20181125080834.GB12455@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181107101830.17405-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:46 -08:00
Gao Xiang
20ff1c9505 mm/readahead.c: simplify get_next_ra_size()
It's a trivial simplification for get_next_ra_size() and clear enough for
humans to understand.

It also fixes potential overflow if ra->size(< ra_pages) is too large.

Link: http://lkml.kernel.org/r/1540707206-19649-1-git-send-email-hsiangkao@aol.com
Signed-off-by: Gao Xiang <hsiangkao@aol.com>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:46 -08:00
Sean Christopherson
6a90a83f1d mm/mmu_notifier.c: remove mmu_notifier_synchronize()
Contrary to its name, mmu_notifier_synchronize() does not synchronize the
notifier's SRCU instance, but rather waits for RCU callbacks to finish.
i.e.  it invokes rcu_barrier().  The RCU documentation is quite clear on
this matter, explicitly calling out that rcu_barrier() does not imply
synchronize_rcu().

As there are no callers of mmu_notifier_synchronize() and it's unclear
whether any user of mmu_notifier_call_srcu() will ever want to barrier on
their callbacks, simply remove the function.

Link: http://lkml.kernel.org/r/20181106134705.14197-1-sean.j.christopherson@intel.com
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:46 -08:00
Balbir Singh
5eb570a8d9 mm/hotplug: optimize clear_hwpoisoned_pages()
In hot remove, we try to clear poisoned pages, but a small optimization to
check if num_poisoned_pages is 0 helps remove the iteration through
nr_pages.

[akpm@linux-foundation.org: tweak comment text]
Link: http://lkml.kernel.org/r/20181102120001.4526-1-bsingharora@gmail.com
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:46 -08:00
Miles Chen
c8f61cfc87 mm/page_owner: clamp read count to PAGE_SIZE
The (root-only) page owner read might allocate a large size of memory with
a large read count.  Allocation fails can easily occur when doing high
order allocations.

Clamp buffer size to PAGE_SIZE to avoid arbitrary size allocation
and avoid allocation fails due to high order allocation.

[akpm@linux-foundation.org: use min_t()]
Link: http://lkml.kernel.org/r/1541091607-27402-1-git-send-email-miles.chen@mediatek.com
Signed-off-by: Miles Chen <miles.chen@mediatek.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Joe Perches <joe@perches.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:46 -08:00
Wei Yang
88349a2837 mm/slub.c: record final state of slub action in deactivate_slab()
If __cmpxchg_double_slab() fails and (l != m), current code records
transition states of slub action.

Update the action after __cmpxchg_double_slab() success to record the
final state.

[akpm@linux-foundation.org: more whitespace cleanup]
Link: http://lkml.kernel.org/r/20181107013119.3816-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:46 -08:00
Wei Yang
6159d0f5c0 mm/slub.c: page is always non-NULL in node_match()
node_match() is a static function and is only invoked in slub.c.

In all three places, `page' is ensured to be valid.

Link: http://lkml.kernel.org/r/20181106150245.1668-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:46 -08:00
Wei Yang
1265ef2de4 mm/slub.c: remove validation on cpu_slab in __flush_cpu_slab()
cpu_slab is a per cpu variable which is allocated in all or none.  If a
cpu_slab failed to be allocated, the slub is not usable.

We could use cpu_slab without validation in __flush_cpu_slab().

Link: http://lkml.kernel.org/r/20181103141218.22844-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:46 -08:00
Yangtao Li
221d7da66c mm, slab: remove unnecessary unlikely()
WARN_ON() already contains an unlikely(), so it's not necessary to use
unlikely.

Also change WARN_ON() back to WARN_ON_ONCE() to avoid potentially
spamming dmesg with user-triggerable large allocations.

[akpm@linux-foundation.org: s/WARN_ON/WARN_ON_ONCE/, per Vlastimil]
Link: http://lkml.kernel.org/r/20181104125028.3572-1-tiny.windzz@gmail.com
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:46 -08:00
Andrey Konovalov
e886bf9d9a kasan: add SPDX-License-Identifier mark to source files
This patch adds a "SPDX-License-Identifier: GPL-2.0" mark to all source
files under mm/kasan.

Link: http://lkml.kernel.org/r/bce2d1e618afa5142e81961ab8fa4b4165337380.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:44 -08:00
Andrey Konovalov
66afc7f1e0 kasan: add __must_check annotations to kasan hooks
This patch adds __must_check annotations to kasan hooks that return a
pointer to make sure that a tagged pointer always gets propagated.

Link: http://lkml.kernel.org/r/03b269c5e453945f724bfca3159d4e1333a8fb1c.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Suggested-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:44 -08:00
Andrey Konovalov
2813b9c029 kasan, mm, arm64: tag non slab memory allocated via pagealloc
Tag-based KASAN doesn't check memory accesses through pointers tagged with
0xff.  When page_address is used to get pointer to memory that corresponds
to some page, the tag of the resulting pointer gets set to 0xff, even
though the allocated memory might have been tagged differently.

For slab pages it's impossible to recover the correct tag to return from
page_address, since the page might contain multiple slab objects tagged
with different values, and we can't know in advance which one of them is
going to get accessed.  For non slab pages however, we can recover the tag
in page_address, since the whole page was marked with the same tag.

This patch adds tagging to non slab memory allocated with pagealloc.  To
set the tag of the pointer returned from page_address, the tag gets stored
to page->flags when the memory gets allocated.

Link: http://lkml.kernel.org/r/d758ddcef46a5abc9970182b9137e2fbee202a2c.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:44 -08:00
Andrey Konovalov
7f94ffbc4c kasan: add hooks implementation for tag-based mode
This commit adds tag-based KASAN specific hooks implementation and
adjusts common generic and tag-based KASAN ones.

1. When a new slab cache is created, tag-based KASAN rounds up the size of
   the objects in this cache to KASAN_SHADOW_SCALE_SIZE (== 16).

2. On each kmalloc tag-based KASAN generates a random tag, sets the shadow
   memory, that corresponds to this object to this tag, and embeds this
   tag value into the top byte of the returned pointer.

3. On each kfree tag-based KASAN poisons the shadow memory with a random
   tag to allow detection of use-after-free bugs.

The rest of the logic of the hook implementation is very much similar to
the one provided by generic KASAN. Tag-based KASAN saves allocation and
free stack metadata to the slab object the same way generic KASAN does.

Link: http://lkml.kernel.org/r/bda78069e3b8422039794050ddcb2d53d053ed41.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:44 -08:00
Andrey Konovalov
5b7c414822 mm: move obj_to_index to include/linux/slab_def.h
While with SLUB we can actually preassign tags for caches with contructors
and store them in pointers in the freelist, SLAB doesn't allow that since
the freelist is stored as an array of indexes, so there are no pointers to
store the tags.

Instead we compute the tag twice, once when a slab is created before
calling the constructor and then again each time when an object is
allocated with kmalloc.  Tag is computed simply by taking the lowest byte
of the index that corresponds to the object.  However in kasan_kmalloc we
only have access to the objects pointer, so we need a way to find out
which index this object corresponds to.

This patch moves obj_to_index from slab.c to include/linux/slab_def.h to
be reused by KASAN.

Link: http://lkml.kernel.org/r/c02cd9e574cfd93858e43ac94b05e38f891fef64.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:44 -08:00
Andrey Konovalov
121e8f81d3 kasan: add bug reporting routines for tag-based mode
This commit adds rountines, that print tag-based KASAN error reports.
Those are quite similar to generic KASAN, the difference is:

1. The way tag-based KASAN finds the first bad shadow cell (with a
   mismatching tag). Tag-based KASAN compares memory tags from the shadow
   memory to the pointer tag.

2. Tag-based KASAN reports all bugs with the "KASAN: invalid-access"
   header.

Also simplify generic KASAN find_first_bad_addr.

Link: http://lkml.kernel.org/r/aee6897b1bd077732a315fd84c6b4f234dbfdfcb.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:44 -08:00
Andrey Konovalov
11cd3cd69a kasan: split out generic_report.c from report.c
Move generic KASAN specific error reporting routines to generic_report.c
without any functional changes, leaving common error reporting code in
report.c to be later reused by tag-based KASAN.

Link: http://lkml.kernel.org/r/ba48c32f8e5aefedee78998ccff0413bee9e0f5b.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:43 -08:00
Andrey Konovalov
772a2fa50f kasan, mm: perform untagged pointers comparison in krealloc
The krealloc function checks where the same buffer was reused or a new one
allocated by comparing kernel pointers.  Tag-based KASAN changes memory
tag on the krealloc'ed chunk of memory and therefore also changes the
pointer tag of the returned pointer.  Therefore we need to perform
comparison on untagged (with tags reset) pointers to check whether it's
the same memory region or not.

Link: http://lkml.kernel.org/r/14f6190d7846186a3506cd66d82446646fe65090.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:43 -08:00
Andrey Konovalov
4d176711ea kasan: preassign tags to objects with ctors or SLAB_TYPESAFE_BY_RCU
An object constructor can initialize pointers within this objects based on
the address of the object.  Since the object address might be tagged, we
need to assign a tag before calling constructor.

The implemented approach is to assign tags to objects with constructors
when a slab is allocated and call constructors once as usual.  The
downside is that such object would always have the same tag when it is
reallocated, so we won't catch use-after-frees on it.

Also pressign tags for objects from SLAB_TYPESAFE_BY_RCU caches, since
they can be validy accessed after having been freed.

Link: http://lkml.kernel.org/r/f158a8a74a031d66f0a9398a5b0ed453c37ba09a.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:43 -08:00
Andrey Konovalov
3c9e3aa110 kasan: add tag related helper functions
This commit adds a few helper functions, that are meant to be used to work
with tags embedded in the top byte of kernel pointers: to set, to get or
to reset the top byte.

Link: http://lkml.kernel.org/r/f6c6437bb8e143bc44f42c3c259c62e734be7935.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:43 -08:00
Andrey Konovalov
080eb83f54 kasan: initialize shadow to 0xff for tag-based mode
A tag-based KASAN shadow memory cell contains a memory tag, that
corresponds to the tag in the top byte of the pointer, that points to that
memory.  The native top byte value of kernel pointers is 0xff, so with
tag-based KASAN we need to initialize shadow memory to 0xff.

[cai@lca.pw: arm64: skip kmemleak for KASAN again\
  Link: http://lkml.kernel.org/r/20181226020550.63712-1-cai@lca.pw
Link: http://lkml.kernel.org/r/5cc1b789aad7c99cf4f3ec5b328b147ad53edb40.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:43 -08:00
Andrey Konovalov
9577dd7486 kasan: rename kasan_zero_page to kasan_early_shadow_page
With tag based KASAN mode the early shadow value is 0xff and not 0x00, so
this patch renames kasan_zero_(page|pte|pmd|pud|p4d) to
kasan_early_shadow_(page|pte|pmd|pud|p4d) to avoid confusion.

Link: http://lkml.kernel.org/r/3fed313280ebf4f88645f5b89ccbc066d320e177.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:43 -08:00
Andrey Konovalov
2bd926b439 kasan: add CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS
This commit splits the current CONFIG_KASAN config option into two:
1. CONFIG_KASAN_GENERIC, that enables the generic KASAN mode (the one
   that exists now);
2. CONFIG_KASAN_SW_TAGS, that enables the software tag-based KASAN mode.

The name CONFIG_KASAN_SW_TAGS is chosen as in the future we will have
another hardware tag-based KASAN mode, that will rely on hardware memory
tagging support in arm64.

With CONFIG_KASAN_SW_TAGS enabled, compiler options are changed to
instrument kernel files with -fsantize=kernel-hwaddress (except the ones
for which KASAN_SANITIZE := n is set).

Both CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS support both
CONFIG_KASAN_INLINE and CONFIG_KASAN_OUTLINE instrumentation modes.

This commit also adds empty placeholder (for now) implementation of
tag-based KASAN specific hooks inserted by the compiler and adjusts
common hooks implementation.

While this commit adds the CONFIG_KASAN_SW_TAGS config option, this option
is not selectable, as it depends on HAVE_ARCH_KASAN_SW_TAGS, which we will
enable once all the infrastracture code has been added.

Link: http://lkml.kernel.org/r/b2550106eb8a68b10fefbabce820910b115aa853.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:43 -08:00
Andrey Konovalov
b938fcf427 kasan: rename source files to reflect the new naming scheme
We now have two KASAN modes: generic KASAN and tag-based KASAN.  Rename
kasan.c to generic.c to reflect that.  Also rename kasan_init.c to init.c
as it contains initialization code for both KASAN modes.

Link: http://lkml.kernel.org/r/88c6fd2a883e459e6242030497230e5fb0d44d44.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:43 -08:00
Andrey Konovalov
bffa986c6f kasan: move common generic and tag-based code to common.c
Tag-based KASAN reuses a significant part of the generic KASAN code, so
move the common parts to common.c without any functional changes.

Link: http://lkml.kernel.org/r/114064d002356e03bb8cc91f7835e20dc61b51d9.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:43 -08:00
Andrey Konovalov
12b2238699 kasan, slub: handle pointer tags in early_kmem_cache_node_alloc
The previous patch updated KASAN hooks signatures and their usage in SLAB
and SLUB code, except for the early_kmem_cache_node_alloc function.  This
patch handles that function separately, as it requires to reorder some of
the initialization code to correctly propagate a tagged pointer in case a
tag is assigned by kasan_kmalloc.

Link: http://lkml.kernel.org/r/fc8d0fdcf733a7a52e8d0daaa650f4736a57de8c.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:43 -08:00
Andrey Konovalov
0116523cff kasan, mm: change hooks signatures
Patch series "kasan: add software tag-based mode for arm64", v13.

This patchset adds a new software tag-based mode to KASAN [1].  (Initially
this mode was called KHWASAN, but it got renamed, see the naming rationale
at the end of this section).

The plan is to implement HWASan [2] for the kernel with the incentive,
that it's going to have comparable to KASAN performance, but in the same
time consume much less memory, trading that off for somewhat imprecise bug
detection and being supported only for arm64.

The underlying ideas of the approach used by software tag-based KASAN are:

1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
   pointer tags in the top byte of each kernel pointer.

2. Using shadow memory, we can store memory tags for each chunk of kernel
   memory.

3. On each memory allocation, we can generate a random tag, embed it into
   the returned pointer and set the memory tags that correspond to this
   chunk of memory to the same value.

4. By using compiler instrumentation, before each memory access we can add
   a check that the pointer tag matches the tag of the memory that is being
   accessed.

5. On a tag mismatch we report an error.

With this patchset the existing KASAN mode gets renamed to generic KASAN,
with the word "generic" meaning that the implementation can be supported
by any architecture as it is purely software.

The new mode this patchset adds is called software tag-based KASAN.  The
word "tag-based" refers to the fact that this mode uses tags embedded into
the top byte of kernel pointers and the TBI arm64 CPU feature that allows
to dereference such pointers.  The word "software" here means that shadow
memory manipulation and tag checking on pointer dereference is done in
software.  As it is the only tag-based implementation right now, "software
tag-based" KASAN is sometimes referred to as simply "tag-based" in this
patchset.

A potential expansion of this mode is a hardware tag-based mode, which
would use hardware memory tagging support (announced by Arm [3]) instead
of compiler instrumentation and manual shadow memory manipulation.

Same as generic KASAN, software tag-based KASAN is strictly a debugging
feature.

[1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html

[2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html

[3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a

====== Rationale

On mobile devices generic KASAN's memory usage is significant problem.
One of the main reasons to have tag-based KASAN is to be able to perform a
similar set of checks as the generic one does, but with lower memory
requirements.

Comment from Vishwath Mohan <vishwath@google.com>:

I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
problematic to enable for environments that don't tolerate the increased
memory pressure well.  This includes

(a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
(c) Connected components like Pixel's visual core [1].

These are both places I'd love to have a low(er) memory footprint option at
my disposal.

Comment from Evgenii Stepanov <eugenis@google.com>:

Looking at a live Android device under load, slab (according to
/proc/meminfo) + kernel stack take 8-10% available RAM (~350MB).  KASAN's
overhead of 2x - 3x on top of it is not insignificant.

Not having this overhead enables near-production use - ex.  running
KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
not reproduce in test configuration.  These are the ones that often cost
the most engineering time to track down.

CPU overhead is bad, but generally tolerable.  RAM is critical, in our
experience.  Once it gets low enough, OOM-killer makes your life
miserable.

[1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/

====== Technical details

Software tag-based KASAN mode is implemented in a very similar way to the
generic one. This patchset essentially does the following:

1. TCR_TBI1 is set to enable Top Byte Ignore.

2. Shadow memory is used (with a different scale, 1:16, so each shadow
   byte corresponds to 16 bytes of kernel memory) to store memory tags.

3. All slab objects are aligned to shadow scale, which is 16 bytes.

4. All pointers returned from the slab allocator are tagged with a random
   tag and the corresponding shadow memory is poisoned with the same value.

5. Compiler instrumentation is used to insert tag checks. Either by
   calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
   CONFIG_KASAN_INLINE flags are reused).

6. When a tag mismatch is detected in callback instrumentation mode
   KASAN simply prints a bug report. In case of inline instrumentation,
   clang inserts a brk instruction, and KASAN has it's own brk handler,
   which reports the bug.

7. The memory in between slab objects is marked with a reserved tag, and
   acts as a redzone.

8. When a slab object is freed it's marked with a reserved tag.

Bug detection is imprecise for two reasons:

1. We won't catch some small out-of-bounds accesses, that fall into the
   same shadow cell, as the last byte of a slab object.

2. We only have 1 byte to store tags, which means we have a 1/256
   probability of a tag match for an incorrect access (actually even
   slightly less due to reserved tag values).

Despite that there's a particular type of bugs that tag-based KASAN can
detect compared to generic KASAN: use-after-free after the object has been
allocated by someone else.

====== Testing

Some kernel developers voiced a concern that changing the top byte of
kernel pointers may lead to subtle bugs that are difficult to discover.
To address this concern deliberate testing has been performed.

It doesn't seem feasible to do some kind of static checking to find
potential issues with pointer tagging, so a dynamic approach was taken.
All pointer comparisons/subtractions have been instrumented in an LLVM
compiler pass and a kernel module that would print a bug report whenever
two pointers with different tags are being compared/subtracted (ignoring
comparisons with NULL pointers and with pointers obtained by casting an
error code to a pointer type) has been used.  Then the kernel has been
booted in QEMU and on an Odroid C2 board and syzkaller has been run.

This yielded the following results.

The two places that look interesting are:

is_vmalloc_addr in include/linux/mm.h
is_kernel_rodata in mm/util.c

Here we compare a pointer with some fixed untagged values to make sure
that the pointer lies in a particular part of the kernel address space.
Since tag-based KASAN doesn't add tags to pointers that belong to rodata
or vmalloc regions, this should work as is.  To make sure debug checks to
those two functions that check that the result doesn't change whether we
operate on pointers with or without untagging has been added.

A few other cases that don't look that interesting:

Comparing pointers to achieve unique sorting order of pointee objects
(e.g. sorting locks addresses before performing a double lock):

tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
pipe_double_lock in fs/pipe.c
unix_state_double_lock in net/unix/af_unix.c
lock_two_nondirectories in fs/inode.c
mutex_lock_double in kernel/events/core.c

ep_cmp_ffd in fs/eventpoll.c
fsnotify_compare_groups fs/notify/mark.c

Nothing needs to be done here, since the tags embedded into pointers
don't change, so the sorting order would still be unique.

Checks that a pointer belongs to some particular allocation:

is_sibling_entry in lib/radix-tree.c
object_is_on_stack in include/linux/sched/task_stack.h

Nothing needs to be done here either, since two pointers can only belong
to the same allocation if they have the same tag.

Overall, since the kernel boots and works, there are no critical bugs.
As for the rest, the traditional kernel testing way (use until fails) is
the only one that looks feasible.

Another point here is that tag-based KASAN is available under a separate
config option that needs to be deliberately enabled. Even though it might
be used in a "near-production" environment to find bugs that are not found
during fuzzing or running tests, it is still a debug tool.

====== Benchmarks

The following numbers were collected on Odroid C2 board. Both generic and
tag-based KASAN were used in inline instrumentation mode.

Boot time [1]:
* ~1.7 sec for clean kernel
* ~5.0 sec for generic KASAN
* ~5.0 sec for tag-based KASAN

Network performance [2]:
* 8.33 Gbits/sec for clean kernel
* 3.17 Gbits/sec for generic KASAN
* 2.85 Gbits/sec for tag-based KASAN

Slab memory usage after boot [3]:
* ~40 kb for clean kernel
* ~105 kb (~260% overhead) for generic KASAN
* ~47 kb (~20% overhead) for tag-based KASAN

KASAN memory overhead consists of three main parts:
1. Increased slab memory usage due to redzones.
2. Shadow memory (the whole reserved once during boot).
3. Quaratine (grows gradually until some preset limit; the more the limit,
   the more the chance to detect a use-after-free).

Comparing tag-based vs generic KASAN for each of these points:
1. 20% vs 260% overhead.
2. 1/16th vs 1/8th of physical memory.
3. Tag-based KASAN doesn't require quarantine.

[1] Time before the ext4 driver is initialized.
[2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
[3] Measured as `cat /proc/meminfo | grep Slab`.

====== Some notes

A few notes:

1. The patchset can be found here:
   https://github.com/xairy/kasan-prototype/tree/khwasan

2. Building requires a recent Clang version (7.0.0 or later).

3. Stack instrumentation is not supported yet and will be added later.

This patch (of 25):

Tag-based KASAN changes the value of the top byte of pointers returned
from the kernel allocation functions (such as kmalloc).  This patch
updates KASAN hooks signatures and their usage in SLAB and SLUB code to
reflect that.

Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:43 -08:00
Linus Torvalds
792bf4d871 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The biggest RCU changes in this cycle were:

   - Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.

   - Replace calls of RCU-bh and RCU-sched update-side functions to
     their vanilla RCU counterparts. This series is a step towards
     complete removal of the RCU-bh and RCU-sched update-side functions.

     ( Note that some of these conversions are going upstream via their
       respective maintainers. )

   - Documentation updates, including a number of flavor-consolidation
     updates from Joel Fernandes.

   - Miscellaneous fixes.

   - Automate generation of the initrd filesystem used for rcutorture
     testing.

   - Convert spin_is_locked() assertions to instead use lockdep.

     ( Note that some of these conversions are going upstream via their
       respective maintainers. )

   - SRCU updates, especially including a fix from Dennis Krein for a
     bag-on-head-class bug.

   - RCU torture-test updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
  rcutorture: Don't do busted forward-progress testing
  rcutorture: Use 100ms buckets for forward-progress callback histograms
  rcutorture: Recover from OOM during forward-progress tests
  rcutorture: Print forward-progress test age upon failure
  rcutorture: Print time since GP end upon forward-progress failure
  rcutorture: Print histogram of CB invocation at OOM time
  rcutorture: Print GP age upon forward-progress failure
  rcu: Print per-CPU callback counts for forward-progress failures
  rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings
  rcutorture: Dump grace-period diagnostics upon forward-progress OOM
  rcutorture: Prepare for asynchronous access to rcu_fwd_startat
  torture: Remove unnecessary "ret" variables
  rcutorture: Affinity forward-progress test to avoid housekeeping CPUs
  rcutorture: Break up too-long rcu_torture_fwd_prog() function
  rcutorture: Remove cbflood facility
  torture: Bring any extra CPUs online during kernel startup
  rcutorture: Add call_rcu() flooding forward-progress tests
  rcutorture/formal: Replace synchronize_sched() with synchronize_rcu()
  tools/kernel.h: Replace synchronize_sched() with synchronize_rcu()
  net/decnet: Replace rcu_barrier_bh() with rcu_barrier()
  ...
2018-12-26 13:07:19 -08:00
Linus Torvalds
5694cecdb0 arm64 festive updates for 4.21
In the end, we ended up with quite a lot more than I expected:
 
 - Support for ARMv8.3 Pointer Authentication in userspace (CRIU and
   kernel-side support to come later)
 
 - Support for per-thread stack canaries, pending an update to GCC that
   is currently undergoing review
 
 - Support for kexec_file_load(), which permits secure boot of a kexec
   payload but also happens to improve the performance of kexec
   dramatically because we can avoid the sucky purgatory code from
   userspace. Kdump will come later (requires updates to libfdt).
 
 - Optimisation of our dynamic CPU feature framework, so that all
   detected features are enabled via a single stop_machine() invocation
 
 - KPTI whitelisting of Cortex-A CPUs unaffected by Meltdown, so that
   they can benefit from global TLB entries when KASLR is not in use
 
 - 52-bit virtual addressing for userspace (kernel remains 48-bit)
 
 - Patch in LSE atomics for per-cpu atomic operations
 
 - Custom preempt.h implementation to avoid unconditional calls to
   preempt_schedule() from preempt_enable()
 
 - Support for the new 'SB' Speculation Barrier instruction
 
 - Vectorised implementation of XOR checksumming and CRC32 optimisations
 
 - Workaround for Cortex-A76 erratum #1165522
 
 - Improved compatibility with Clang/LLD
 
 - Support for TX2 system PMUS for profiling the L3 cache and DMC
 
 - Reflect read-only permissions in the linear map by default
 
 - Ensure MMIO reads are ordered with subsequent calls to Xdelay()
 
 - Initial support for memory hotplug
 
 - Tweak the threshold when we invalidate the TLB by-ASID, so that
   mremap() performance is improved for ranges spanning multiple PMDs.
 
 - Minor refactoring and cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABCgAGBQJcE4TmAAoJELescNyEwWM0Nr0H/iaU7/wQSzHyNXtZoImyKTul
 Blu2ga4/EqUrTU7AVVfmkl/3NBILWlgQVpY6tH6EfXQuvnxqD7CizbHyLdyO+z0S
 B5PsFUH2GLMNAi48AUNqGqkgb2knFbg+T+9IimijDBkKg1G/KhQnRg6bXX32mLJv
 Une8oshUPBVJMsHN1AcQknzKariuoE3u0SgJ+eOZ9yA2ZwKxP4yy1SkDt3xQrtI0
 lojeRjxcyjTP1oGRNZC+BWUtGOT35p7y6cGTnBd/4TlqBGz5wVAJUcdoxnZ6JYVR
 O8+ob9zU+4I0+SKt80s7pTLqQiL9rxkKZ5joWK1pr1g9e0s5N5yoETXKFHgJYP8=
 =sYdt
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 festive updates from Will Deacon:
 "In the end, we ended up with quite a lot more than I expected:

   - Support for ARMv8.3 Pointer Authentication in userspace (CRIU and
     kernel-side support to come later)

   - Support for per-thread stack canaries, pending an update to GCC
     that is currently undergoing review

   - Support for kexec_file_load(), which permits secure boot of a kexec
     payload but also happens to improve the performance of kexec
     dramatically because we can avoid the sucky purgatory code from
     userspace. Kdump will come later (requires updates to libfdt).

   - Optimisation of our dynamic CPU feature framework, so that all
     detected features are enabled via a single stop_machine()
     invocation

   - KPTI whitelisting of Cortex-A CPUs unaffected by Meltdown, so that
     they can benefit from global TLB entries when KASLR is not in use

   - 52-bit virtual addressing for userspace (kernel remains 48-bit)

   - Patch in LSE atomics for per-cpu atomic operations

   - Custom preempt.h implementation to avoid unconditional calls to
     preempt_schedule() from preempt_enable()

   - Support for the new 'SB' Speculation Barrier instruction

   - Vectorised implementation of XOR checksumming and CRC32
     optimisations

   - Workaround for Cortex-A76 erratum #1165522

   - Improved compatibility with Clang/LLD

   - Support for TX2 system PMUS for profiling the L3 cache and DMC

   - Reflect read-only permissions in the linear map by default

   - Ensure MMIO reads are ordered with subsequent calls to Xdelay()

   - Initial support for memory hotplug

   - Tweak the threshold when we invalidate the TLB by-ASID, so that
     mremap() performance is improved for ranges spanning multiple PMDs.

   - Minor refactoring and cleanups"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (125 commits)
  arm64: kaslr: print PHYS_OFFSET in dump_kernel_offset()
  arm64: sysreg: Use _BITUL() when defining register bits
  arm64: cpufeature: Rework ptr auth hwcaps using multi_entry_cap_matches
  arm64: cpufeature: Reduce number of pointer auth CPU caps from 6 to 4
  arm64: docs: document pointer authentication
  arm64: ptr auth: Move per-thread keys from thread_info to thread_struct
  arm64: enable pointer authentication
  arm64: add prctl control for resetting ptrauth keys
  arm64: perf: strip PAC when unwinding userspace
  arm64: expose user PAC bit positions via ptrace
  arm64: add basic pointer authentication support
  arm64/cpufeature: detect pointer authentication
  arm64: Don't trap host pointer auth use to EL2
  arm64/kvm: hide ptrauth from guests
  arm64/kvm: consistently handle host HCR_EL2 flags
  arm64: add pointer authentication register bits
  arm64: add comments about EC exception levels
  arm64: perf: Treat EXCLUDE_EL* bit definitions as unsigned
  arm64: kpti: Whitelist Cortex-A CPUs that don't implement the CSV3 field
  arm64: enable per-task stack canaries
  ...
2018-12-25 17:41:56 -08:00
Linus Torvalds
4971f090aa drm pull request for 4.21-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJcExwOAAoJEAx081l5xIa+euIP/1NZZvSB+bsCtOwDG8I6uWsS
 OU5JUZ8q2dqyyFagRxzlkeSt3uWJqKp5NyNwuc9z/5u6AGF+3/97D0J1lG6Os/st
 4abF6NadivYJ4cXhJ1ddIHOFMVDcAsyMWNDb93NwPwncCsQ0jt5FFOsrCyj6BGY+
 ihHFlHrIyDrbBGDHz+u1E/EO5WkNnaLDoC+/k2fTRWCNI3bQL3O+orsYTI6S2uvU
 lQJnRfYAllgLD2p1k/rrBHcHXBv50roR0e8uhGmbdhGdp5bEW30UGBLHXxQjjSVy
 fQCwFwTO8X6zoxU53Zbbk+MVrp+jkTHcGKViHRuLkaHzE5mX26UXDwlXdN32ZUbK
 yHOJp+uDaWXX7MIz0LsB9Iqj2+eIUoFaIJMoZTMGVTNvqnTxKnoHnjAtbTH2u258
 teFgmy4BIgPgo2kwEnBEZjCapou0Eivyut2wq8bTAB2Fe8LwURJpr3cioTtMLlUO
 L5/PoD27eFvBCAeFrQIwF3b2XiQEnBpXocmilEwP1xDMPgoyeePAfIF2iEpDvi0U
 jce3rLd2yVvo92xYUgoHkVTD8si/pKKnZ1D0U3+RI6pxK6s0HJEHjcNEMdvdm+2S
 4qgvBQV3wlWFkXEK8PR5BHPoLntg18tKon/BTLBjgGkN9E1o9fWs1/s6KQGY4xdo
 l3Vvfx2LTdkgEoBssSwB
 =wh4W
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2018-12-14' of git://anongit.freedesktop.org/drm/drm

Pull drm updates from Dave Airlie:
 "Core:
   - shared fencing staging removal
   - drop transactional atomic helpers and move helpers to new location
   - DP/MST atomic cleanup
   - Leasing cleanups and drop EXPORT_SYMBOL
   - Convert drivers to atomic helpers and generic fbdev.
   - removed deprecated obj_ref/unref in favour of get/put
   - Improve dumb callback documentation
   - MODESET_LOCK_BEGIN/END helpers

  panels:
   - CDTech panels, Banana Pi Panel, DLC1010GIG,
   - Olimex LCD-O-LinuXino, Samsung S6D16D0, Truly NT35597 WQXGA,
   - Himax HX8357D, simulated RTSM AEMv8.
   - GPD Win2 panel
   - AUO G101EVN010

  vgem:
   - render node support

  ttm:
   - move global init out of drivers
   - fix LRU handling for ghost objects
   - Support for simultaneous submissions to multiple engines

  scheduler:
   - timeout/fault handling changes to help GPU recovery
   - helpers for hw with preemption support

  i915:
   - Scaler/Watermark fixes
   - DP MST + powerwell fixes
   - PSR fixes
   - Break long get/put shmemfs pages
   - Icelake fixes
   - Icelake DSI video mode enablement
   - Engine workaround improvements

  amdgpu:
   - freesync support
   - GPU reset enabled on CI, VI, SOC15 dGPUs
   - ABM support in DC
   - KFD support for vega12/polaris12
   - SDMA paging queue on vega
   - More amdkfd code sharing
   - DCC scanout on GFX9
   - DC kerneldoc
   - Updated SMU firmware for GFX8 chips
   - XGMI PSP + hive reset support
   - GPU reset
   - DC trace support
   - Powerplay updates for newer Polaris
   - Cursor plane update fast path
   - kfd dma-buf support

  virtio-gpu:
   - add EDID support

  vmwgfx:
   - pageflip with damage support

  nouveau:
   - Initial Turing TU104/TU106 modesetting support

  msm:
   - a2xx gpu support for apq8060 and imx5
   - a2xx gpummu support
   - mdp4 display support for apq8060
   - DPU fixes and cleanups
   - enhanced profiling support
   - debug object naming interface
   - get_iova/page pinning decoupling

  tegra:
   - Tegra194 host1x, VIC and display support enabled
   - Audio over HDMI for Tegra186 and Tegra194

  exynos:
   - DMA/IOMMU refactoring
   - plane alpha + blend mode support
   - Color format fixes for mixer driver

  rcar-du:
   - R8A7744 and R8A77470 support
   - R8A77965 LVDS support

  imx:
   - fbdev emulation fix
   - multi-tiled scalling fixes
   - SPDX identifiers

  rockchip
   - dw_hdmi support
   - dw-mipi-dsi + dual dsi support
   - mailbox read size fix

  qxl:
   - fix cursor pinning

  vc4:
   - YUV support (scaling + cursor)

  v3d:
   - enable TFU (Texture Formatting Unit)

  mali-dp:
   - add support for linear tiled formats

  sun4i:
   - Display Engine 3 support
   - H6 DE3 mixer 0 support
   - H6 display engine support
   - dw-hdmi support
   - H6 HDMI phy support
   - implicit fence waiting
   - BGRX8888 support

  meson:
   - Overlay plane support
   - implicit fence waiting
   - HDMI 1.4 4k modes

  bridge:
   - i2c fixes for sii902x"

* tag 'drm-next-2018-12-14' of git://anongit.freedesktop.org/drm/drm: (1403 commits)
  drm/amd/display: Add fast path for cursor plane updates
  drm/amdgpu: Enable GPU recovery by default for CI
  drm/amd/display: Fix duplicating scaling/underscan connector state
  drm/amd/display: Fix unintialized max_bpc state values
  Revert "drm/amd/display: Set RMX_ASPECT as default"
  drm/amdgpu: Fix stub function name
  drm/msm/dpu: Fix clock issue after bind failure
  drm/msm/dpu: Clean up dpu_media_info.h static inline functions
  drm/msm/dpu: Further cleanups for static inline functions
  drm/msm/dpu: Cleanup the debugfs functions
  drm/msm/dpu: Remove dpu_irq and unused functions
  drm/msm: Make irq_postinstall optional
  drm/msm/dpu: Cleanup callers of dpu_hw_blk_init
  drm/msm/dpu: Remove unused functions
  drm/msm/dpu: Remove dpu_crtc_is_enabled()
  drm/msm/dpu: Remove dpu_crtc_get_mixer_height
  drm/msm/dpu: Remove dpu_dbg
  drm/msm: dpu: Remove crtc_lock
  drm/msm: dpu: Remove vblank_requested flag from dpu_crtc
  drm/msm: dpu: Separate crtc assignment from vblank enable
  ...
2018-12-25 11:48:26 -08:00
Oscar Salvador
17e2e7d7e1 mm, page_alloc: fix has_unmovable_pages for HugePages
While playing with gigantic hugepages and memory_hotplug, I triggered
the following #PF when "cat memoryX/removable":

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  #PF error: [normal kernel read fault]
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP PTI
  CPU: 1 PID: 1481 Comm: cat Tainted: G            E     4.20.0-rc6-mm1-1-default+ #18
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
  RIP: 0010:has_unmovable_pages+0x154/0x210
  Call Trace:
   is_mem_section_removable+0x7d/0x100
   removable_show+0x90/0xb0
   dev_attr_show+0x1c/0x50
   sysfs_kf_seq_show+0xca/0x1b0
   seq_read+0x133/0x380
   __vfs_read+0x26/0x180
   vfs_read+0x89/0x140
   ksys_read+0x42/0x90
   do_syscall_64+0x5b/0x180
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

The reason is we do not pass the Head to page_hstate(), and so, the call
to compound_order() in page_hstate() returns 0, so we end up checking
all hstates's size to match PAGE_SIZE.

Obviously, we do not find any hstate matching that size, and we return
NULL.  Then, we dereference that NULL pointer in
hugepage_migration_supported() and we got the #PF from above.

Fix that by getting the head page before calling page_hstate().

Also, since gigantic pages span several pageblocks, re-adjust the logic
for skipping pages.  While are it, we can also get rid of the
round_up().

[osalvador@suse.de: remove round_up(), adjust skip pages logic per Michal]
  Link: http://lkml.kernel.org/r/20181221062809.31771-1-osalvador@suse.de
Link: http://lkml.kernel.org/r/20181217225113.17864-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-21 14:51:18 -08:00
Peter Xu
2e83ee1d86 mm: thp: fix flags for pmd migration when split
When splitting a huge migrating PMD, we'll transfer all the existing PMD
bits and apply them again onto the small PTEs.  However we are fetching
the bits unconditionally via pmd_soft_dirty(), pmd_write() or
pmd_yound() while actually they don't make sense at all when it's a
migration entry.  Fix them up.  Since at it, drop the ifdef together as
not needed.

Note that if my understanding is correct about the problem then if
without the patch there is chance to lose some of the dirty bits in the
migrating pmd pages (on x86_64 we're fetching bit 11 which is part of
swap offset instead of bit 2) and it could potentially corrupt the
memory of an userspace program which depends on the dirty bit.

Link: http://lkml.kernel.org/r/20181213051510.20306-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: <stable@vger.kernel.org>	[4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-21 14:51:18 -08:00
Mikhail Zaslonko
2830bf6f05 mm, memory_hotplug: initialize struct pages for the full memory section
If memory end is not aligned with the sparse memory section boundary,
the mapping of such a section is only partly initialized.  This may lead
to VM_BUG_ON due to uninitialized struct page access from
is_mem_section_removable() or test_pages_in_a_zone() function triggered
by memory_hotplug sysfs handlers:

Here are the the panic examples:
 CONFIG_DEBUG_VM=y
 CONFIG_DEBUG_VM_PGFLAGS=y

 kernel parameter mem=2050M
 --------------------------
 page:000003d082008000 is uninitialized and poisoned
 page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
 Call Trace:
 ( test_pages_in_a_zone+0xde/0x160)
   show_valid_zones+0x5c/0x190
   dev_attr_show+0x34/0x70
   sysfs_kf_seq_show+0xc8/0x148
   seq_read+0x204/0x480
   __vfs_read+0x32/0x178
   vfs_read+0x82/0x138
   ksys_read+0x5a/0xb0
   system_call+0xdc/0x2d8
 Last Breaking-Event-Address:
   test_pages_in_a_zone+0xde/0x160
 Kernel panic - not syncing: Fatal exception: panic_on_oops

 kernel parameter mem=3075M
 --------------------------
 page:000003d08300c000 is uninitialized and poisoned
 page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
 Call Trace:
 ( is_mem_section_removable+0xb4/0x190)
   show_mem_removable+0x9a/0xd8
   dev_attr_show+0x34/0x70
   sysfs_kf_seq_show+0xc8/0x148
   seq_read+0x204/0x480
   __vfs_read+0x32/0x178
   vfs_read+0x82/0x138
   ksys_read+0x5a/0xb0
   system_call+0xdc/0x2d8
 Last Breaking-Event-Address:
   is_mem_section_removable+0xb4/0x190
 Kernel panic - not syncing: Fatal exception: panic_on_oops

Fix the problem by initializing the last memory section of each zone in
memmap_init_zone() till the very end, even if it goes beyond the zone end.

Michal said:

: This has alwways been problem AFAIU.  It just went unnoticed because we
: have zeroed memmaps during allocation before f7f99100d8 ("mm: stop
: zeroing memory during allocation in vmemmap") and so the above test
: would simply skip these ranges as belonging to zone 0 or provided a
: garbage.
:
: So I guess we do care for post f7f99100d8 kernels mostly and
: therefore Fixes: f7f99100d8 ("mm: stop zeroing memory during
: allocation in vmemmap")

Link: http://lkml.kernel.org/r/20181212172712.34019-2-zaslonko@linux.ibm.com
Fixes: f7f99100d8 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-21 14:51:18 -08:00
Mike Rapoport
f496990f1f slab: make kmem_cache_create{_usercopy} description proper kernel-doc
Add the description for kmem_cache_create, fixup the return value paragraph
and make both kmem_cache_create and add the second '*' to the comment
opening.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2018-12-20 08:33:16 -07:00
Dennis Zhou
6ab7d47bcb percpu: convert spin_lock_irq to spin_lock_irqsave.
From Michael Cree:
  "Bisection lead to commit b38d08f318 ("percpu: restructure
   locking") as being the cause of lockups at initial boot on
   the kernel built for generic Alpha.

   On a suggestion by Tejun Heo that:

   So, the only thing I can think of is that it's calling
   spin_unlock_irq() while irq handling isn't set up yet.
   Can you please try the followings?

   1. Convert all spin_[un]lock_irq() to
      spin_lock_irqsave/unlock_irqrestore()."

Fixes: b38d08f318 ("percpu: restructure locking")
Reported-and-tested-by: Michael Cree <mcree@orcon.net.nz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2018-12-18 09:04:08 -08:00
Yongkai Wu
8ace22bce8 hugetlbfs: call VM_BUG_ON_PAGE earlier in free_huge_page()
A stack trace was triggered by VM_BUG_ON_PAGE(page_mapcount(page), page)
in free_huge_page().  Unfortunately, the page->mapping field was set to
NULL before this test.  This made it more difficult to determine the
root cause of the problem.

Move the VM_BUG_ON_PAGE tests earlier in the function so that if they do
trigger more information is present in the page struct.

Link: http://lkml.kernel.org/r/1543491843-23438-1-git-send-email-nic_w@163.com
Signed-off-by: Yongkai Wu <nic_w@163.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-14 15:05:45 -08:00
Yueyi Li
f5a222dc2f memblock: annotate memblock_is_reserved() with __init_memblock
Found warning:

  WARNING: EXPORT symbol "gsi_write_channel_scratch" [vmlinux] version generation failed, symbol will not be versioned.
  WARNING: vmlinux.o(.text+0x1e0a0): Section mismatch in reference from the function valid_phys_addr_range() to the function .init.text:memblock_is_reserved()
  The function valid_phys_addr_range() references
  the function __init memblock_is_reserved().
  This is often because valid_phys_addr_range lacks a __init
  annotation or the annotation of memblock_is_reserved is wrong.

Use __init_memblock instead of __init.

Link: http://lkml.kernel.org/r/BLUPR13MB02893411BF12EACB61888E80DFAE0@BLUPR13MB0289.namprd13.prod.outlook.com
Signed-off-by: Yueyi Li <liyueyi@live.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-14 15:05:45 -08:00
Logan Gunthorpe
9def36e0fa mm/sparse: add common helper to mark all memblocks present
Presently the arches arm64, arm and sh have a function which loops
through each memblock and calls memory present.  riscv will require a
similar function.

Introduce a common memblocks_present() function that can be used by all
the arches.  Subsequent patches will cleanup the arches that make use of
this.

Link: http://lkml.kernel.org/r/20181107205433.3875-3-logang@deltatee.com
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-14 15:05:45 -08:00
Linus Torvalds
880b9df1bf XArray updates for 4.20-rc7
Two bugfixes, each with test-suite updates, two improvements to the
 test-suite without associated bugs, and one patch adding a missing API.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCgAyFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAlwS8ZUUHHdpbGx5QGlu
 ZnJhZGVhZC5vcmcACgkQDpNsjXcpgj5h0wf9Fmc3z3WjmX05he+XKhGq1jQuHYVi
 zt8Eggsc7ns1hX8xPdwSw240CDOCBcbXxCyNL9KFCqlIkfxTAe8pYgoTDKuXhVAK
 U7VTCHCxJpsYzfhkEke5DaASGb/YP1kmvoTJs7qCfhBuI9ERXLVK6cESJNDZhlMA
 /d7VfRwRiqSLnK13AXPZAA9Pnw2GtAolMDU9CC9nOtMRlRDVwsQiwYiQ/mBRYK00
 u0LoruwBJ7XAoe7Bo1CFmkvJuIV794cmhqkEY2cY85e9aoj15+BDqOu1la8DTaOl
 e7+7PwK1I6Ed6DfPixGleUP7BYHHXCfb/RVEYn22qGC/YHUQRtpbwrY37Q==
 =b+pK
 -----END PGP SIGNATURE-----

Merge tag 'xarray-4.20-rc7' of git://git.infradead.org/users/willy/linux-dax

Pull XArray fixes from Matthew Wilcox:
 "Two bugfixes, each with test-suite updates, two improvements to the
  test-suite without associated bugs, and one patch adding a missing
  API"

* tag 'xarray-4.20-rc7' of git://git.infradead.org/users/willy/linux-dax:
  XArray: Fix xa_alloc when id exceeds max
  XArray tests: Check iterating over multiorder entries
  XArray tests: Handle larger indices more elegantly
  XArray: Add xa_cmpxchg_irq and xa_cmpxchg_bh
  radix tree: Don't return retry entries from lookup
2018-12-13 16:35:58 -08:00
Steve Capper
f6795053da mm: mmap: Allow for "high" userspace addresses
This patch adds support for "high" userspace addresses that are
optionally supported on the system and have to be requested via a hint
mechanism ("high" addr parameter to mmap).

Architectures such as powerpc and x86 achieve this by making changes to
their architectural versions of arch_get_unmapped_* functions. However,
on arm64 we use the generic versions of these functions.

Rather than duplicate the generic arch_get_unmapped_* implementations
for arm64, this patch instead introduces two architectural helper macros
and applies them to arch_get_unmapped_*:
 arch_get_mmap_end(addr) - get mmap upper limit depending on addr hint
 arch_get_mmap_base(addr, base) - get mmap_base depending on addr hint

If these macros are not defined in architectural code then they default
to (TASK_SIZE) and (base) so should not introduce any behavioural
changes to architectures that do not define them.

Signed-off-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-12-10 18:42:17 +00:00
Jens Axboe
96f774106e Linux 4.20-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlwNpb0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwGwH/00UHnXfxww3ixxz
 zwTVDzptA6SPm6s84yJOWatM5fXhPiAltZaHSYF9lzRzNU71NCq7Frhq3fQUIXKM
 OxqDn9nfSTWcjWTk2q5n2keyRV/KIn67YX7UgqFc1bO/mqtVjEgNWaMyblhI+e9E
 giu1ZXayHr43jK1cDOmGExZubXUq7Vsc9TOlrd+d2SwIqeEP7TCMrPhnHDwCNvX2
 UU5dtANpVzGtHaBcr37wJj+L8kODCc0f+PQ3g2ar5jTHst5SLlHp2u0AMRnUmgdi
 VkGx+mu/uk8mtwUqMIMqhplklVoqK6LTeLqsY5Xt32SKruw9UqyJGdphLjW2QP/g
 MkmA1lI=
 =7kaD
 -----END PGP SIGNATURE-----

Merge tag 'v4.20-rc6' into for-4.21/block

Pull in v4.20-rc6 to resolve the conflict in NVMe, but also to get the
two corruption fixes. We're going to be overhauling the direct dispatch
path, and we need to do that on top of the changes we made for that
in mainline.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-09 17:45:40 -07:00
Linus Torvalds
fa82dcbf2a dax fixes 4.20-rc6
* Fix the Xarray conversion of fsdax to properly handle
 dax_lock_mapping_entry() in the presense of pmd entries.
 
 * Fix inode destruction racing a new lock request.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJcDLC5AAoJEB7SkWpmfYgC0yAP/2NULGPapIoR2aw7GKPOStHR
 tOMGE8+ZDtpd19oVoEiB3PuhMkiCj5Pkyt2ka9eUXSk3VR/N6EtGwBANFUO+0tEh
 yU8nca2T2G1f+MsZbraWA9zj9aPnfpLt46r7p/AjIB2j1/vyXkYMkXmkXtCvMEUi
 3Q9dVsT53fWncJBlTe/WW84wWQp77VsVafN4/UP75dv8cN9F+u91P1xvUXu2AKus
 RBqjlp6G6xMZ9HcWCQmStdPCi+qbO4k3oTZcohd/DYLb70ZL1kLoEMOjK2/3GnaV
 YMHZGWRAK5jroDZbdXXnJHE+4/opZpSWhW/J1WFEJUSSr7YmZsWhwegmrBlQ5cp3
 V0fZ7LjBWjgtUKsjlmfmu4ccLY87LzVlfIEtJH2+QNiLy/Pl/O0P/Jzg9kHVLEgo
 zph2yIQlx3yshej8EEBPylDJVtYVzOyjLNLx7r3bL6AzQv+6CKqjxtYOiIDwYQsA
 Db1InfeDzAAoH6NYSbqN+yG01Bz7zeCYT7Ch4rYSKcA1i8vUMR6iObI32fC3xJQm
 +TxYmqngrJaU8XOMikIem/qGn4fBpYFULrIdK7lj/WmIxQq8Hp0mqz1ikMIoErzN
 ZXV5RcbbvSSZi5EOrrVKbr2IfIcx+HYo/bIAkRCR69o8bcIZnLPHDky/7gl7GS/5
 sc9A/kysseeF45L2+6QG
 =kUil
 -----END PGP SIGNATURE-----

Merge tag 'dax-fixes-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull dax fixes from Dan Williams:
 "The last of the known regression fixes and fallout from the Xarray
  conversion of the filesystem-dax implementation.

  On the path to debugging why the dax memory-failure injection test
  started failing after the Xarray conversion a couple more fixes for
  the dax_lock_mapping_entry(), now called dax_lock_page(), surfaced.
  Those plus the bug that started the hunt are now addressed. These
  patches have appeared in a -next release with no issues reported.

  Note the touches to mm/memory-failure.c are just the conversion to the
  new function signature for dax_lock_page().

  Summary:

   - Fix the Xarray conversion of fsdax to properly handle
     dax_lock_mapping_entry() in the presense of pmd entries

   - Fix inode destruction racing a new lock request"

* tag 'dax-fixes-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax: Fix unlock mismatch with updated API
  dax: Don't access a freed inode
  dax: Check page->mapping isn't NULL
2018-12-09 09:54:04 -08:00
David Rientjes
356ff8a9a7 Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"
This reverts commit 89c83fb539.

This should have been done as part of 2f0799a0ff ("mm, thp: restore
node-local hugepage allocations").  The movement of the thp allocation
policy from alloc_pages_vma() to alloc_hugepage_direct_gfpmask() was
intended to only set __GFP_THISNODE for mempolicies that are not
MPOL_BIND whereas the revert could set this regardless of mempolicy.

While the check for MPOL_BIND between alloc_hugepage_direct_gfpmask()
and alloc_pages_vma() was racy, that has since been removed since the
revert.  What is left is the possibility to use __GFP_THISNODE in
policy_node() when it is unexpected because the special handling for
hugepages in alloc_pages_vma()  was removed as part of the consolidation.

Secondly, prior to 89c83fb539, alloc_pages_vma() implemented a somewhat
different policy for hugepage allocations, which were allocated through
alloc_hugepage_vma().  For hugepage allocations, if the allocating
process's node is in the set of allowed nodes, allocate with
__GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with
__GFP_THISNODE instead).  This was changed for shmem_alloc_hugepage() to
allow fallback to other nodes in 89c83fb539 as it did for new_page() in
mm/mempolicy.c which is functionally different behavior and removes the
requirement to only allocate hugepages locally.

So this commit does a full revert of 89c83fb539 instead of the partial
revert that was done in 2f0799a0ff.  The result is the same thp
allocation policy for 4.20 that was in 4.19.

Fixes: 89c83fb539 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
Fixes: 2f0799a0ff ("mm, thp: restore node-local hugepage allocations")
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-08 10:26:20 -08:00
Dennis Zhou
6a7f6d86a5 blkcg: associate a blkg for pages being evicted by swap
A prior patch in this series added blkg association to bios issued by
cgroups. There are two other paths that we want to attribute work back
to the appropriate cgroup: swap and writeback. Here we modify the way
swap tags bios to include the blkg. Writeback will be tackle in the next
patch.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:37 -07:00
Matthew Wilcox
55f3f7eab7 XArray: Add xa_cmpxchg_irq and xa_cmpxchg_bh
These convenience wrappers match the other _irq and _bh wrappers we
already have.  It turns out I'd already open-coded xa_cmpxchg_irq()
in the shmem code, so convert that.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-12-06 08:26:17 -05:00
David Rientjes
2f0799a0ff mm, thp: restore node-local hugepage allocations
This is a full revert of ac5b2c1891 ("mm: thp: relax __GFP_THISNODE for
MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539 ("mm, thp:
consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").

By not setting __GFP_THISNODE, applications can allocate remote hugepages
when the local node is fragmented or low on memory when either the thp
defrag setting is "always" or the vma has been madvised with
MADV_HUGEPAGE.

Remote access to hugepages often has much higher latency than local pages
of the native page size.  On Haswell, ac5b2c1891 was shown to have a
13.9% access regression after this commit for binaries that remap their
text segment to be backed by transparent hugepages.

The intent of ac5b2c1891 is to address an issue where a local node is
low on memory or fragmented such that a hugepage cannot be allocated.  In
every scenario where this was described as a fix, there is abundant and
unfragmented remote memory available to allocate from, even with a greater
access latency.

If remote memory is also low or fragmented, not setting __GFP_THISNODE was
also measured on Haswell to have a 40% regression in allocation latency.

Restore __GFP_THISNODE for thp allocations.

Fixes: ac5b2c1891 ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
Fixes: 89c83fb539 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-05 15:45:54 -08:00
Matthew Wilcox
27359fd6e5 dax: Fix unlock mismatch with updated API
Internal to dax_unlock_mapping_entry(), dax_unlock_entry() is used to
store a replacement entry in the Xarray at the given xas-index with the
DAX_LOCKED bit clear. When called, dax_unlock_entry() expects the unlocked
value of the entry relative to the current Xarray state to be specified.

In most contexts dax_unlock_entry() is operating in the same scope as
the matched dax_lock_entry(). However, in the dax_unlock_mapping_entry()
case the implementation needs to recall the original entry. In the case
where the original entry is a 'pmd' entry it is possible that the pfn
performed to do the lookup is misaligned to the value retrieved in the
Xarray.

Change the api to return the unlock cookie from dax_lock_page() and pass
it to dax_unlock_page(). This fixes a bug where dax_unlock_page() was
assuming that the page was PMD-aligned if the entry was a PMD entry with
signatures like:

 WARNING: CPU: 38 PID: 1396 at fs/dax.c:340 dax_insert_entry+0x2b2/0x2d0
 RIP: 0010:dax_insert_entry+0x2b2/0x2d0
 [..]
 Call Trace:
  dax_iomap_pte_fault.isra.41+0x791/0xde0
  ext4_dax_huge_fault+0x16f/0x1f0
  ? up_read+0x1c/0xa0
  __do_fault+0x1f/0x160
  __handle_mm_fault+0x1033/0x1490
  handle_mm_fault+0x18b/0x3d0

Link: https://lkml.kernel.org/r/20181130154902.GL10377@bombadil.infradead.org
Fixes: 9f32d22130 ("dax: Convert dax_lock_mapping_entry to XArray")
Reported-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-12-04 21:32:00 -08:00
Jens Axboe
89d04ec349 Linux 4.20-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlwEZdIeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGAlQH/19oax2Za3IPqF4X
 DM3lal5M6zlUVkoYstqzpbR3MqUwgEnMfvoeMDC6mI9N4/+r2LkV7cRR8HzqQCCS
 jDfD69IzRGb52VSeJmbOrkxBWsR1Nn0t4Z3rEeLPxwaOoNpRc8H973MbAQ2FKMpY
 S4Y3jIK1dNiRRxdh52NupVkQF+djAUwkBuVk/rrvRJmTDij4la03cuCDAO+Di9lt
 GHlVvygKw2SJhDR+z3ArwZNmE0ceCcE6+W7zPHzj2KeWuKrZg22kfUD454f2YEIw
 FG0hu9qecgtpYCkLSm2vr4jQzmpsDoyq3ZfwhjGrP4qtvPC3Db3vL3dbQnkzUcJu
 JtwhVCE=
 =O1q1
 -----END PGP SIGNATURE-----

Merge tag 'v4.20-rc5' into for-4.21/block

Pull in v4.20-rc5, solving a conflict we'll otherwise get in aio.c and
also getting the merge fix that went into mainline that users are
hitting testing for-4.21/block and/or for-next.

* tag 'v4.20-rc5': (664 commits)
  Linux 4.20-rc5
  PCI: Fix incorrect value returned from pcie_get_speed_cap()
  MAINTAINERS: Update linux-mips mailing list address
  ocfs2: fix potential use after free
  mm/khugepaged: fix the xas_create_range() error path
  mm/khugepaged: collapse_shmem() do not crash on Compound
  mm/khugepaged: collapse_shmem() without freezing new_page
  mm/khugepaged: minor reorderings in collapse_shmem()
  mm/khugepaged: collapse_shmem() remember to clear holes
  mm/khugepaged: fix crashes due to misaccounted holes
  mm/khugepaged: collapse_shmem() stop if punched or truncated
  mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()
  mm/huge_memory: splitting set mapping+index before unfreeze
  mm/huge_memory: rename freeze_page() to unmap_page()
  initramfs: clean old path before creating a hardlink
  kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace
  psi: make disabling/enabling easier for vendor kernels
  proc: fixup map_files test on arm
  debugobjects: avoid recursive calls with kmemleak
  userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set
  ...
2018-12-04 09:38:05 -07:00
Ingo Molnar
4bbfd7467c Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:

- Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.

- Replace calls of RCU-bh and RCU-sched update-side functions
  to their vanilla RCU counterparts.  This series is a step
  towards complete removal of the RCU-bh and RCU-sched update-side
  functions.

  ( Note that some of these conversions are going upstream via their
    respective maintainers. )

- Documentation updates, including a number of flavor-consolidation
  updates from Joel Fernandes.

- Miscellaneous fixes.

- Automate generation of the initrd filesystem used for
  rcutorture testing.

- Convert spin_is_locked() assertions to instead use lockdep.

  ( Note that some of these conversions are going upstream via their
    respective maintainers. )

- SRCU updates, especially including a fix from Dennis Krein
  for a bag-on-head-class bug.

- RCU torture-test updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-04 07:52:30 +01:00
Paul E. McKenney
eaaf055f27 Merge branches 'bug.2018.11.12a', 'consolidate.2018.12.01a', 'doc.2018.11.12a', 'fixes.2018.11.12a', 'initrd.2018.11.08b', 'sil.2018.11.12a' and 'srcu.2018.11.27a' into HEAD
bug.2018.11.12a:  Get rid of BUG_ON() and friends
consolidate.2018.12.01a:  Continued RCU flavor-consolidation cleanup
doc.2018.11.12a:  Documentation updates
fixes.2018.11.12a:  Miscellaneous fixes
initrd.2018.11.08b:  Automate creation of rcutorture initrd
sil.2018.11.12a:  Remove more spin_unlock_wait() calls
2018-12-01 12:43:16 -08:00
Hugh Dickins
95feeabb77 mm/khugepaged: fix the xas_create_range() error path
collapse_shmem()'s xas_nomem() is very unlikely to fail, but it is
rightly given a failure path, so move the whole xas_create_range() block
up before __SetPageLocked(new_page): so that it does not need to
remember to unlock_page(new_page).

Add the missing mem_cgroup_cancel_charge(), and set (currently unused)
result to SCAN_FAIL rather than SCAN_SUCCEED.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261531200.2275@eggly.anvils
Fixes: 77da9389b9 ("mm: Convert collapse_shmem to XArray")
Signed-off-by: Hugh Dickins <hughd@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:15 -08:00
Hugh Dickins
06a5e1268a mm/khugepaged: collapse_shmem() do not crash on Compound
collapse_shmem()'s VM_BUG_ON_PAGE(PageTransCompound) was unsafe: before
it holds page lock of the first page, racing truncation then extension
might conceivably have inserted a hugepage there already.  Fail with the
SCAN_PAGE_COMPOUND result, instead of crashing (CONFIG_DEBUG_VM=y) or
otherwise mishandling the unexpected hugepage - though later we might
code up a more constructive way of handling it, with SCAN_SUCCESS.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261529310.2275@eggly.anvils
Fixes: f3f0e1d215 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>	[4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:15 -08:00
Hugh Dickins
87c460a0bd mm/khugepaged: collapse_shmem() without freezing new_page
khugepaged's collapse_shmem() does almost all of its work, to assemble
the huge new_page from 512 scattered old pages, with the new_page's
refcount frozen to 0 (and refcounts of all old pages so far also frozen
to 0).  Including shmem_getpage() to read in any which were out on swap,
memory reclaim if necessary to allocate their intermediate pages, and
copying over all the data from old to new.

Imagine the frozen refcount as a spinlock held, but without any lock
debugging to highlight the abuse: it's not good, and under serious load
heads into lockups - speculative getters of the page are not expecting
to spin while khugepaged is rescheduled.

One can get a little further under load by hacking around elsewhere; but
fortunately, freezing the new_page turns out to have been entirely
unnecessary, with no hacks needed elsewhere.

The huge new_page lock is already held throughout, and guards all its
subpages as they are brought one by one into the page cache tree; and
anything reading the data in that page, without the lock, before it has
been marked PageUptodate, would already be in the wrong.  So simply
eliminate the freezing of the new_page.

Each of the old pages remains frozen with refcount 0 after it has been
replaced by a new_page subpage in the page cache tree, until they are
all unfrozen on success or failure: just as before.  They could be
unfrozen sooner, but cause no problem once no longer visible to
find_get_entry(), filemap_map_pages() and other speculative lookups.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
Fixes: f3f0e1d215 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>	[4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:15 -08:00
Hugh Dickins
042a308248 mm/khugepaged: minor reorderings in collapse_shmem()
Several cleanups in collapse_shmem(): most of which probably do not
really matter, beyond doing things in a more familiar and reassuring
order.  Simplify the failure gotos in the main loop, and on success
update stats while interrupts still disabled from the last iteration.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261526400.2275@eggly.anvils
Fixes: f3f0e1d215 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>	[4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:15 -08:00
Hugh Dickins
2af8ff2918 mm/khugepaged: collapse_shmem() remember to clear holes
Huge tmpfs testing reminds us that there is no __GFP_ZERO in the gfp
flags khugepaged uses to allocate a huge page - in all common cases it
would just be a waste of effort - so collapse_shmem() must remember to
clear out any holes that it instantiates.

The obvious place to do so, where they are put into the page cache tree,
is not a good choice: because interrupts are disabled there.  Leave it
until further down, once success is assured, where the other pages are
copied (before setting PageUptodate).

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261525080.2275@eggly.anvils
Fixes: f3f0e1d215 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>	[4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:15 -08:00
Hugh Dickins
aaa52e3400 mm/khugepaged: fix crashes due to misaccounted holes
Huge tmpfs testing on a shortish file mapped into a pmd-rounded extent
hit shmem_evict_inode()'s WARN_ON(inode->i_blocks) followed by
clear_inode()'s BUG_ON(inode->i_data.nrpages) when the file was later
closed and unlinked.

khugepaged's collapse_shmem() was forgetting to update mapping->nrpages
on the rollback path, after it had added but then needs to undo some
holes.

There is indeed an irritating asymmetry between shmem_charge(), whose
callers want it to increment nrpages after successfully accounting
blocks, and shmem_uncharge(), when __delete_from_page_cache() already
decremented nrpages itself: oh well, just add a comment on that to them
both.

And shmem_recalc_inode() is supposed to be called when the accounting is
expected to be in balance (so it can deduce from imbalance that reclaim
discarded some pages): so change shmem_charge() to update nrpages
earlier (though it's rare for the difference to matter at all).

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261523450.2275@eggly.anvils
Fixes: 800d8c63b2 ("shmem: add huge pages support")
Fixes: f3f0e1d215 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>	[4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:15 -08:00
Hugh Dickins
701270fa19 mm/khugepaged: collapse_shmem() stop if punched or truncated
Huge tmpfs testing showed that although collapse_shmem() recognizes a
concurrently truncated or hole-punched page correctly, its handling of
holes was liable to refill an emptied extent.  Add check to stop that.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261522040.2275@eggly.anvils
Fixes: f3f0e1d215 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: <stable@vger.kernel.org>	[4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:15 -08:00
Hugh Dickins
006d3ff27e mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()
Huge tmpfs testing, on 32-bit kernel with lockdep enabled, showed that
__split_huge_page() was using i_size_read() while holding the irq-safe
lru_lock and page tree lock, but the 32-bit i_size_read() uses an
irq-unsafe seqlock which should not be nested inside them.

Instead, read the i_size earlier in split_huge_page_to_list(), and pass
the end offset down to __split_huge_page(): all while holding head page
lock, which is enough to prevent truncation of that extent before the
page tree lock has been taken.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261520070.2275@eggly.anvils
Fixes: baa355fd33 ("thp: file pages support for split_huge_page()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>	[4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:15 -08:00
Hugh Dickins
173d9d9fd3 mm/huge_memory: splitting set mapping+index before unfreeze
Huge tmpfs stress testing has occasionally hit shmem_undo_range()'s
VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page).

Move the setting of mapping and index up before the page_ref_unfreeze()
in __split_huge_page_tail() to fix this: so that a page cache lookup
cannot get a reference while the tail's mapping and index are unstable.

In fact, might as well move them up before the smp_wmb(): I don't see an
actual need for that, but if I'm missing something, this way round is
safer than the other, and no less efficient.

You might argue that VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page) is
misplaced, and should be left until after the trylock_page(); but left as
is has not crashed since, and gives more stringent assurance.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261516380.2275@eggly.anvils
Fixes: e9b61f1985 ("thp: reintroduce split_huge_page()")
Requires: 605ca5ede7 ("mm/huge_memory.c: reorder operations in __split_huge_page_tail()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>	[4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:15 -08:00
Hugh Dickins
906f9cdfc2 mm/huge_memory: rename freeze_page() to unmap_page()
The term "freeze" is used in several ways in the kernel, and in mm it
has the particular meaning of forcing page refcount temporarily to 0.
freeze_page() is just too confusing a name for a function that unmaps a
page: rename it unmap_page(), and rename unfreeze_page() remap_page().

Went to change the mention of freeze_page() added later in mm/rmap.c,
but found it to be incorrect: ordinary page reclaim reaches there too;
but the substance of the comment still seems correct, so edit it down.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261514080.2275@eggly.anvils
Fixes: e9b61f1985 ("thp: reintroduce split_huge_page()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>	[4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:14 -08:00
Andrea Arcangeli
dcf7fe9d89 userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set
Set the page dirty if VM_WRITE is not set because in such case the pte
won't be marked dirty and the page would be reclaimed without writepage
(i.e.  swapout in the shmem case).

This was found by source review.  Most apps (certainly including QEMU)
only use UFFDIO_COPY on PROT_READ|PROT_WRITE mappings or the app can't
modify the memory in the first place.  This is for correctness and it
could help the non cooperative use case to avoid unexpected data loss.

Link: http://lkml.kernel.org/r/20181126173452.26955-6-aarcange@redhat.com
Reviewed-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org
Fixes: 4c27fe4c4c ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:14 -08:00
Andrea Arcangeli
e2a50c1f64 userfaultfd: shmem: add i_size checks
With MAP_SHARED: recheck the i_size after taking the PT lock, to
serialize against truncate with the PT lock.  Delete the page from the
pagecache if the i_size_read check fails.

With MAP_PRIVATE: check the i_size after the PT lock before mapping
anonymous memory or zeropages into the MAP_PRIVATE shmem mapping.

A mostly irrelevant cleanup: like we do the delete_from_page_cache()
pagecache removal after dropping the PT lock, the PT lock is a spinlock
so drop it before the sleepable page lock.

Link: http://lkml.kernel.org/r/20181126173452.26955-5-aarcange@redhat.com
Fixes: 4c27fe4c4c ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Hugh Dickins <hughd@google.com>
Reported-by: Jann Horn <jannh@google.com>
Cc: <stable@vger.kernel.org>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:14 -08:00
Andrea Arcangeli
29ec90660d userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas
After the VMA to register the uffd onto is found, check that it has
VM_MAYWRITE set before allowing registration.  This way we inherit all
common code checks before allowing to fill file holes in shmem and
hugetlbfs with UFFDIO_COPY.

The userfaultfd memory model is not applicable for readonly files unless
it's a MAP_PRIVATE.

Link: http://lkml.kernel.org/r/20181126173452.26955-4-aarcange@redhat.com
Fixes: ff62a34210 ("hugetlb: implement memfd sealing")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Hugh Dickins <hughd@google.com>
Reported-by: Jann Horn <jannh@google.com>
Fixes: 4c27fe4c4c ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
Cc: <stable@vger.kernel.org>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:14 -08:00
Andrea Arcangeli
5b51072e97 userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmem
Userfaultfd did not create private memory when UFFDIO_COPY was invoked
on a MAP_PRIVATE shmem mapping.  Instead it wrote to the shmem file,
even when that had not been opened for writing.  Though, fortunately,
that could only happen where there was a hole in the file.

Fix the shmem-backed implementation of UFFDIO_COPY to create private
memory for MAP_PRIVATE mappings.  The hugetlbfs-backed implementation
was already correct.

This change is visible to userland, if userfaultfd has been used in
unintended ways: so it introduces a small risk of incompatibility, but
is necessary in order to respect file permissions.

An app that uses UFFDIO_COPY for anything like postcopy live migration
won't notice the difference, and in fact it'll run faster because there
will be no copy-on-write and memory waste in the tmpfs pagecache
anymore.

Userfaults on MAP_PRIVATE shmem keep triggering only on file holes like
before.

The real zeropage can also be built on a MAP_PRIVATE shmem mapping
through UFFDIO_ZEROPAGE and that's safe because the zeropage pte is
never dirty, in turn even an mprotect upgrading the vma permission from
PROT_READ to PROT_READ|PROT_WRITE won't make the zeropage pte writable.

Link: http://lkml.kernel.org/r/20181126173452.26955-3-aarcange@redhat.com
Fixes: 4c27fe4c4c ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:14 -08:00
Andrea Arcangeli
9e368259ad userfaultfd: use ENOENT instead of EFAULT if the atomic copy user fails
Patch series "userfaultfd shmem updates".

Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the
lack of the VM_MAYWRITE check and the lack of i_size checks.

Then looking into the above we also fixed the MAP_PRIVATE case.

Hugh by source review also found a data loss source if UFFDIO_COPY is
used on shmem MAP_SHARED PROT_READ mappings (the production usages
incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't
happen in those production usages like with QEMU).

The whole patchset is marked for stable.

We verified QEMU postcopy live migration with guest running on shmem
MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE.
Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU
unconditionally invokes a punch hole if the guest mapping is filebacked
and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and
for the anon backend).

This patch (of 5):

We internally used EFAULT to communicate with the caller, switch to
ENOENT, so EFAULT can be used as a non internal retval.

Link: http://lkml.kernel.org/r/20181126173452.26955-2-aarcange@redhat.com
Fixes: 4c27fe4c4c ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Hugh Dickins <hughd@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:14 -08:00
Wei Yang
8f416836c0 mm/page_alloc.c: fix calculation of pgdat->nr_zones
init_currently_empty_zone() will adjust pgdat->nr_zones and set it to
'zone_idx(zone) + 1' unconditionally.  This is correct in the normal
case, while not exact in hot-plug situation.

This function is used in two places:

  * free_area_init_core()
  * move_pfn_range_to_zone()

In the first case, we are sure zone index increase monotonically.  While
in the second one, this is under users control.

One way to reproduce this is:
----------------------------

1. create a virtual machine with empty node1

   -m 4G,slots=32,maxmem=32G \
   -smp 4,maxcpus=8          \
   -numa node,nodeid=0,mem=4G,cpus=0-3 \
   -numa node,nodeid=1,mem=0G,cpus=4-7

2. hot-add cpu 3-7

   cpu-add [3-7]

2. hot-add memory to nod1

   object_add memory-backend-ram,id=ram0,size=1G
   device_add pc-dimm,id=dimm0,memdev=ram0,node=1

3. online memory with following order

   echo online_movable > memory47/state
   echo online > memory40/state

After this, node1 will have its nr_zones equals to (ZONE_NORMAL + 1)
instead of (ZONE_MOVABLE + 1).

Michal said:
 "Having an incorrect nr_zones might result in all sorts of problems
  which would be quite hard to debug (e.g. reclaim not considering the
  movable zone). I do not expect many users would suffer from this it
  but still this is trivial and obviously right thing to do so
  backporting to the stable tree shouldn't be harmful (last famous
  words)"

Link: http://lkml.kernel.org/r/20181117022022.9956-1-richard.weiyang@gmail.com
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:14 -08:00
Yu Zhao
c1cb20d437 mm: use swp_offset as key in shmem_replace_page()
We changed the key of swap cache tree from swp_entry_t.val to
swp_offset.  We need to do so in shmem_replace_page() as well.

Hugh said:
 "shmem_replace_page() has been wrong since the day I wrote it: good
  enough to work on swap "type" 0, which is all most people ever use
  (especially those few who need shmem_replace_page() at all), but
  broken once there are any non-0 swp_type bits set in the higher order
  bits"

Link: http://lkml.kernel.org/r/20181121215442.138545-1-yuzhao@google.com
Fixes: f6ab1f7f6b ("mm, swap: use offset of swap entry as key of swap cache")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>	[4.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:14 -08:00
Pavel Tikhomirov
6ff38bd402 mm: cleancache: fix corruption on missed inode invalidation
If all pages are deleted from the mapping by memory reclaim and also
moved to the cleancache:

__delete_from_page_cache
  (no shadow case)
  unaccount_page_cache_page
    cleancache_put_page
  page_cache_delete
    mapping->nrpages -= nr
    (nrpages becomes 0)

We don't clean the cleancache for an inode after final file truncation
(removal).

truncate_inode_pages_final
  check (nrpages || nrexceptional) is false
    no truncate_inode_pages
      no cleancache_invalidate_inode(mapping)

These way when reading the new file created with same inode we may get
these trash leftover pages from cleancache and see wrong data instead of
the contents of the new file.

Fix it by always doing truncate_inode_pages which is already ready for
nrpages == 0 && nrexceptional == 0 case and just invalidates inode.

[akpm@linux-foundation.org: add comment, per Jan]
Link: http://lkml.kernel.org/r/20181112095734.17979-1-ptikhomirov@virtuozzo.com
Fixes: commit 91b0abe36a ("mm + fs: store shadow entries in page cache")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:14 -08:00
John Hubbard
08be37b798 mm/gup: finish consolidating error handling
Commit df06b37ffe ("mm/gup: cache dev_pagemap while pinning pages")
attempted to operate on each page that get_user_pages had retrieved.  In
order to do that, it created a common exit point from the routine.
However, one case was missed, which this patch fixes up.

Also, there was still an unnecessary shadow declaration (with a
different type) of the "ret" variable, which this patch removes.

Keith's description of the situation is:

  This also fixes a potentially leaked dev_pagemap reference count if a
  failure occurs when an iteration crosses a vma boundary.  I don't think
  it's normal to have different vma's on a users mapped zone device
  memory, but good to fix anyway.

I actually thought that this code:

    /* first iteration or cross vma bound */
    if (!vma || start >= vma->vm_end) {
	        vma = find_extend_vma(mm, start);
	        if (!vma && in_gate_area(mm, start)) {
		            ret = get_gate_page(mm, start & PAGE_MASK,
		                    gup_flags, &vma,
		                    pages ? &pages[i] : NULL);
		            if (ret)
		                goto out;

dealt with the "you're trying to pin the gate page, as part of this
call", rather than the generic case of crossing a vma boundary.  (I
think there's a fine point that I must be overlooking.) But it's still a
valid case, either way.

Link: http://lkml.kernel.org/r/20181121081402.29641-2-jhubbard@nvidia.com
Fixes: df06b37ffe ("mm/gup: cache dev_pagemap while pinning pages")
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 14:56:13 -08:00
Paul E. McKenney
b401ec1848 mm: Replace call_rcu_sched() with call_rcu()
Now that call_rcu()'s callback is not invoked until after all
preempt-disable regions of code have completed (in addition to explicitly
marked RCU read-side critical sections), call_rcu() can be used in place
of call_rcu_sched().  This commit therefore makes that change.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-27 09:21:46 -08:00
Paul E. McKenney
6564a25e6c slab: Replace synchronize_sched() with synchronize_rcu()
Now that synchronize_rcu() waits for preempt-disable regions of code
as well as RCU read-side critical sections, synchronize_sched() can be
replaced by synchronize_rcu().  This commit therefore makes this change.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
2018-11-27 09:21:45 -08:00
Jens Axboe
0a1b8b87d0 block: make blk_poll() take a parameter on whether to spin or not
blk_poll() has always kept spinning until it found an IO. This is
fine for SYNC polling, since we need to find one request we have
pending, but in preparation for ASYNC polling it can be beneficial
to just check if we have any entries available or not.

Existing callers are converted to pass in 'spin == true', to retain
the old behavior.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-26 08:25:53 -07:00
Jani Nikula
2ac5e38ea4 Merge drm/drm-next into drm-intel-next-queued
Pull in v4.20-rc3 via drm-next.

Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2018-11-20 13:14:08 +02:00
Jens Axboe
849a370016 block: avoid ordered task state change for polled IO
For the core poll helper, the task state setting don't need to imply any
atomics, as it's the current task itself that is being modified and
we're not going to sleep.

For IRQ driven, the wakeup path have the necessary barriers to not need
us using the heavy handed version of the task state setting.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-19 08:34:49 -07:00
Jens Axboe
a78b03bc73 Linux 4.20-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlvx2sAeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGycgIAIuxobwt0RRKa0zO
 ROS+34JGoC2yU2P9VdEGWdtxS6ANMVQgKPBhWL6s+xR89Kd+V4xSdJLD1pNTxxqP
 0DCva0np1/Q4juH+JbU50v/lykoLgteZ0P0LBRGf1y8p3WiLPv45IbnNsMDNYhB2
 7a8rOmZYakRY9CPznRDw3X8cJt3sddKgFJHIOGz1OQJVWtCD0KPGcJmQNsbDSagY
 Zx6Z5BKSIdjRqaAdN5gDa1Pft3WQo7TpaQGl80lSsgr5LcjmscXA3sClOCy+25Mo
 FZLx0PcwP+Efq8RTGzNK51WSOMa6d37hvjDqUAdQBOR0KbyjRyXQwyQVw/MGbPJs
 7J3Pzm0=
 =56Mt
 -----END PGP SIGNATURE-----

Merge tag 'v4.20-rc3' into for-4.21/block

Merge in -rc3 to resolve a few conflicts, but also to get a few
important fixes that have gone into mainline since the block
4.21 branch was forked off (most notably the SCSI queue issue,
which is both a conflict AND needed fix).

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-18 15:46:03 -07:00
Chen Chang
45e79815b8 mm/memblock.c: fix a typo in __next_mem_pfn_range() comments
Link: http://lkml.kernel.org/r/20181107100247.13359-1-rainccrun@gmail.com
Signed-off-by: Chen Chang <rainccrun@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-18 10:15:10 -08:00
Michal Hocko
c63ae43ba5 mm, page_alloc: check for max order in hot path
Konstantin has noticed that kvmalloc might trigger the following
warning:

  WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
  [...]
  Call Trace:
   fragmentation_index+0x76/0x90
   compaction_suitable+0x4f/0xf0
   shrink_node+0x295/0x310
   node_reclaim+0x205/0x250
   get_page_from_freelist+0x649/0xad0
   __alloc_pages_nodemask+0x12a/0x2a0
   kmalloc_large_node+0x47/0x90
   __kmalloc_node+0x22b/0x2e0
   kvmalloc_node+0x3e/0x70
   xt_alloc_table_info+0x3a/0x80 [x_tables]
   do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
   nf_setsockopt+0x44/0x60
   SyS_setsockopt+0x6f/0xc0
   do_syscall_64+0x67/0x120
   entry_SYSCALL_64_after_hwframe+0x3d/0xa2

the problem is that we only check for an out of bound order in the slow
path and the node reclaim might happen from the fast path already.  This
is fixable by making sure that kvmalloc doesn't ever use kmalloc for
requests that are larger than KMALLOC_MAX_SIZE but this also shows that
the code is rather fragile.  A recent UBSAN report just underlines that
by the following report

  UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
  shift exponent 51 is too large for 32-bit type 'int'
  CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0xd2/0x148 lib/dump_stack.c:113
   ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
   __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
   __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
   zone_watermark_fast mm/page_alloc.c:3216 [inline]
   get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
   __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
   alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
   alloc_pages include/linux/gfp.h:509 [inline]
   __get_free_pages+0x12/0x60 mm/page_alloc.c:4414
   dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
   raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
   raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
   fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
   fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
   __blkdev_driver_ioctl block/ioctl.c:303 [inline]
   blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
   block_ioctl+0x105/0x150 fs/block_dev.c:1883
   vfs_ioctl fs/ioctl.c:46 [inline]
   do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
   ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
   __do_sys_ioctl fs/ioctl.c:709 [inline]
   __se_sys_ioctl fs/ioctl.c:707 [inline]
   __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
   do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Note that this is not a kvmalloc path.  It is just that the fast path
really depends on having sanitzed order as well.  Therefore move the
order check to the fast path.

Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Kyungtae Kim <kt0755@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Byoungyoung Lee <lifeasageek@gmail.com>
Cc: "Dae R. Jeong" <threeearcat@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-18 10:15:10 -08:00
Yufen Yu
1a41364693 tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset
Other filesystems such as ext4, f2fs and ubifs all return ENXIO when
lseek (SEEK_DATA or SEEK_HOLE) requests a negative offset.

man 2 lseek says

:      EINVAL whence  is  not  valid.   Or: the resulting file offset would be
:             negative, or beyond the end of a seekable device.
:
:      ENXIO  whence is SEEK_DATA or SEEK_HOLE, and the file offset is  beyond
:             the end of the file.

Make tmpfs return ENXIO under these circumstances as well.  After this,
tmpfs also passes xfstests's generic/448.

[akpm@linux-foundation.org: rewrite changelog]
Link: http://lkml.kernel.org/r/1540434176-14349-1-git-send-email-yuyufen@huawei.com
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-18 10:15:10 -08:00
Janne Huttunen
13c9aaf7fa mm/vmstat.c: fix NUMA statistics updates
Scan through the whole array to see if an update is needed.  While we're
at it, use sizeof() to be safe against any possible type changes in the
future.

The bug here is that we wouldn't sync per-cpu counters into global ones
if there was an update of numa_stats for higher cpus.  Highly
theoretical one though because it is much more probable that zone_stats
are updated so we would refresh anyway.  So I wouldn't bother to mark
this for stable, yet something nice to fix.

[mhocko@suse.com: changelog enhancement]
Link: http://lkml.kernel.org/r/1541601517-17282-1-git-send-email-janne.huttunen@nokia.com
Fixes: 1d90ca897c ("mm: update NUMA counter threshold size")
Signed-off-by: Janne Huttunen <janne.huttunen@nokia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-18 10:15:10 -08:00
Mike Rapoport
78179556e7 mm/gup.c: fix follow_page_mask() kerneldoc comment
Commit df06b37ffe ("mm/gup: cache dev_pagemap while pinning pages")
modified the signature of follow_page_mask() but left the parameter
description behind.

Update the description to make the code and comments agree again.

While at it, update formatting of the return value description to match
Documentation/doc-guide/kernel-doc.rst guidelines.

Link: http://lkml.kernel.org/r/1541603316-27832-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-18 10:15:10 -08:00
Michal Hocko
9d7899999c mm, memory_hotplug: check zone_movable in has_unmovable_pages
Page state checks are racy.  Under a heavy memory workload (e.g.  stress
-m 200 -t 2h) it is quite easy to hit a race window when the page is
allocated but its state is not fully populated yet.  A debugging patch to
dump the struct page state shows

  has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0
  page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1
  flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked)

Note that the state has been checked for both PageLRU and PageSwapBacked
already.  Closing this race completely would require some sort of retry
logic.  This can be tricky and error prone (think of potential endless
or long taking loops).

Workaround this problem for movable zones at least.  Such a zone should
only contain movable pages.  Commit 15c30bc090 ("mm, memory_hotplug:
make has_unmovable_pages more robust") has told us that this is not
strictly true though.  Bootmem pages should be marked reserved though so
we can move the original check after the PageReserved check.  Pages from
other zones are still prone to races but we even do not pretend that
memory hotremove works for those so pre-mature failure doesn't hurt that
much.

Link: http://lkml.kernel.org/r/20181106095524.14629-1-mhocko@kernel.org
Fixes: 15c30bc090 ("mm, memory_hotplug: make has_unmovable_pages more robust")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-18 10:15:09 -08:00
Vasily Averin
873d7bcfd0 mm/swapfile.c: use kvzalloc for swap_info_struct allocation
Commit a2468cc9bf ("swap: choose swap device according to numa node")
changed 'avail_lists' field of 'struct swap_info_struct' to an array.
In popular linux distros it increased size of swap_info_struct up to 40
Kbytes and now swap_info_struct allocation requires order-4 page.
Switch to kvzmalloc allows to avoid unexpected allocation failures.

Link: http://lkml.kernel.org/r/fc23172d-3c75-21e2-d551-8b1808cbe593@virtuozzo.com
Fixes: a2468cc9bf ("swap: choose swap device according to numa node")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-18 10:15:09 -08:00
Mike Kravetz
5e41540c8a hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
This bug has been experienced several times by the Oracle DB team.  The
BUG is in remove_inode_hugepages() as follows:

	/*
	 * If page is mapped, it was faulted in after being
	 * unmapped in caller.  Unmap (again) now after taking
	 * the fault mutex.  The mutex will prevent faults
	 * until we finish removing the page.
	 *
	 * This race can only happen in the hole punch case.
	 * Getting here in a truncate operation is a bug.
	 */
	if (unlikely(page_mapped(page))) {
		BUG_ON(truncate_op);

In this case, the elevated map count is not the result of a race.
Rather it was incorrectly incremented as the result of a bug in the huge
pmd sharing code.  Consider the following:

 - Process A maps a hugetlbfs file of sufficient size and alignment
   (PUD_SIZE) that a pmd page could be shared.

 - Process B maps the same hugetlbfs file with the same size and
   alignment such that a pmd page is shared.

 - Process B then calls mprotect() to change protections for the mapping
   with the shared pmd. As a result, the pmd is 'unshared'.

 - Process B then calls mprotect() again to chage protections for the
   mapping back to their original value. pmd remains unshared.

 - Process B then forks and process C is created. During the fork
   process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
   tables. Copying page tables for hugetlb mappings is done in the
   routine copy_hugetlb_page_range.

In copy_hugetlb_page_range(), the destination pte is obtained by:

	dst_pte = huge_pte_alloc(dst, addr, sz);

If pmd sharing is possible, the returned pointer will be to a pte in an
existing page table.  In the situation above, process C could share with
either process A or process B.  Since process A is first in the list,
the returned pte is a pointer to a pte in process A's page table.

However, the check for pmd sharing in copy_hugetlb_page_range is:

	/* If the pagetables are shared don't copy or take references */
	if (dst_pte == src_pte)
		continue;

Since process C is sharing with process A instead of process B, the
above test fails.  The code in copy_hugetlb_page_range which follows
assumes dst_pte points to a huge_pte_none pte.  It copies the pte entry
from src_pte to dst_pte and increments this map count of the associated
page.  This is how we end up with an elevated map count.

To solve, check the dst_pte entry for huge_pte_none.  If !none, this
implies PMD sharing so do not copy.

Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
Fixes: c5c99429fa ("fix hugepages leak due to pagetable page sharing")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-18 10:15:09 -08:00
Vitaly Wool
ca0246bb97 z3fold: fix possible reclaim races
Reclaim and free can race on an object which is basically fine but in
order for reclaim to be able to map "freed" object we need to encode
object length in the handle.  handle_to_chunks() is then introduced to
extract object length from a handle and use it during mapping.

Moreover, to avoid racing on a z3fold "headless" page release, we should
not try to free that page in z3fold_free() if the reclaim bit is set.
Also, in the unlikely case of trying to reclaim a page being freed, we
should not proceed with that page.

While at it, fix the page accounting in reclaim function.

This patch supersedes "[PATCH] z3fold: fix reclaim lock-ups".

Link: http://lkml.kernel.org/r/20181105162225.74e8837d03583a9b707cf559@gmail.com
Signed-off-by: Vitaly Wool <vitaly.vul@sony.com>
Signed-off-by: Jongseok Kim <ks77sj@gmail.com>
Reported-by-by: Jongseok Kim <ks77sj@gmail.com>
Reviewed-by: Snild Dolkow <snild@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-18 10:15:09 -08:00
Jens Axboe
0619317ff8 block: add polled wakeup task helper
If we're polling for IO on a device that doesn't use interrupts, then
IO completion loop (and wake of task) is done by submitting task itself.
If that is the case, then we don't need to enter the wake_up_process()
function, we can simply mark ourselves as TASK_RUNNING.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-16 08:34:30 -07:00
Lance Roy
35f3aa39f2 mm: Replace spin_is_locked() with lockdep
lockdep_assert_held() is better suited to checking locking requirements,
since it only checks if the current thread holds the lock regardless of
whether someone else does. This is also a step towards possibly removing
spin_is_locked().

Signed-off-by: Lance Roy <ldr709@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <linux-mm@kvack.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12 09:06:22 -08:00
Kuo-Hsin Yang
64e3d12f76 mm, drm/i915: mark pinned shmemfs pages as unevictable
The i915 driver uses shmemfs to allocate backing storage for gem
objects. These shmemfs pages can be pinned (increased ref count) by
shmem_read_mapping_page_gfp(). When a lot of pages are pinned, vmscan
wastes a lot of time scanning these pinned pages. In some extreme case,
all pages in the inactive anon lru are pinned, and only the inactive
anon lru is scanned due to inactive_ratio, the system cannot swap and
invokes the oom-killer. Mark these pinned pages as unevictable to speed
up vmscan.

Export pagevec API check_move_unevictable_pages().

This patch was inspired by Chris Wilson's change [1].

[1]: https://patchwork.kernel.org/patch/9768741/

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Kuo-Hsin Yang <vovoy@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com> # mm part
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20181106132324.17390-1-chris@chris-wilson.co.uk
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
2018-11-07 15:28:32 +00:00
Michal Hocko
dd33ad7b25 memory_hotplug: cond_resched in __remove_pages
We have received a bug report that unbinding a large pmem (>1TB) can
result in a soft lockup:

  NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [ndctl:4365]
  [...]
  Supported: Yes
  CPU: 9 PID: 4365 Comm: ndctl Not tainted 4.12.14-94.40-default #1 SLE12-SP4
  Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018
  task: ffff9cce7d4410c0 task.stack: ffffbe9eb1bc4000
  RIP: 0010:__put_page+0x62/0x80
  Call Trace:
   devm_memremap_pages_release+0x152/0x260
   release_nodes+0x18d/0x1d0
   device_release_driver_internal+0x160/0x210
   unbind_store+0xb3/0xe0
   kernfs_fop_write+0x102/0x180
   __vfs_write+0x26/0x150
   vfs_write+0xad/0x1a0
   SyS_write+0x42/0x90
   do_syscall_64+0x74/0x150
   entry_SYSCALL_64_after_hwframe+0x3d/0xa2
  RIP: 0033:0x7fd13166b3d0

It has been reported on an older (4.12) kernel but the current upstream
code doesn't cond_resched in the hot remove code at all and the given
range to remove might be really large.  Fix the issue by calling
cond_resched once per memory section.

Link: http://lkml.kernel.org/r/20181031125840.23982-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Dan Williams <dan.j.williams@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:38 -07:00
Michal Hocko
89c83fb539 mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask
THP allocation mode is quite complex and it depends on the defrag mode.
This complexity is hidden in alloc_hugepage_direct_gfpmask from a large
part currently. The NUMA special casing (namely __GFP_THISNODE) is
however independent and placed in alloc_pages_vma currently. This both
adds an unnecessary branch to all vma based page allocation requests and
it makes the code more complex unnecessarily as well. Not to mention
that e.g. shmem THP used to do the node reclaiming unconditionally
regardless of the defrag mode until recently. This was not only
unexpected behavior but it was also hardly a good default behavior and I
strongly suspect it was just a side effect of the code sharing more than
a deliberate decision which suggests that such a layering is wrong.

Get rid of the thp special casing from alloc_pages_vma and move the
logic to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to the
resulting gfp mask only when the direct reclaim is not requested and
when there is no explicit numa binding to preserve the current logic.

Please note that there's also a slight difference wrt MPOL_BIND now. The
previous code would avoid using __GFP_THISNODE if the local node was
outside of policy_nodemask(). After this patch __GFP_THISNODE is avoided
for all MPOL_BIND policies. So there's a difference that if local node
is actually allowed by the bind policy's nodemask, previously
__GFP_THISNODE would be added, but now it won't be. From the behavior
POV this is still correct because the policy nodemask is used.

Link: http://lkml.kernel.org/r/20180925120326.24392-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Andrea Arcangeli
ac5b2c1891 mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
THP allocation might be really disruptive when allocated on NUMA system
with the local node full or hard to reclaim.  Stefan has posted an
allocation stall report on 4.12 based SLES kernel which suggests the
same issue:

  kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
  kvm cpuset=/ mems_allowed=0-1
  CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
  Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
  Call Trace:
   dump_stack+0x5c/0x84
   warn_alloc+0xe0/0x180
   __alloc_pages_slowpath+0x820/0xc90
   __alloc_pages_nodemask+0x1cc/0x210
   alloc_pages_vma+0x1e5/0x280
   do_huge_pmd_wp_page+0x83f/0xf00
   __handle_mm_fault+0x93d/0x1060
   handle_mm_fault+0xc6/0x1b0
   __do_page_fault+0x230/0x430
   do_page_fault+0x2a/0x70
   page_fault+0x7b/0x80
   [...]
  Mem-Info:
  active_anon:126315487 inactive_anon:1612476 isolated_anon:5
   active_file:60183 inactive_file:245285 isolated_file:0
   unevictable:15657 dirty:286 writeback:1 unstable:0
   slab_reclaimable:75543 slab_unreclaimable:2509111
   mapped:81814 shmem:31764 pagetables:370616 bounce:0
   free:32294031 free_pcp:6233 free_cma:0
  Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
  Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.

Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:

: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).

Fix this by removing __GFP_THISNODE for THP requests which are
requesting the direct reclaim.  This effectivelly reverts 5265047ac3
on the grounds that the zone/node reclaim was known to be disruptive due
to premature reclaim when there was memory free.  While it made sense at
the time for HPC workloads without NUMA awareness on rare machines, it
was ultimately harmful in the majority of cases.  The existing behaviour
is similar, if not as widespare as it applies to a corner case but
crucially, it cannot be tuned around like zone_reclaim_mode can.  The
default behaviour should always be to cause the least harm for the
common case.

If there are specialised use cases out there that want zone_reclaim_mode
in specific cases, then it can be built on top.  Longterm we should
consider a memory policy which allows for the node reclaim like behavior
for the specific memory ranges which would allow a

[1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

Mel said:

: Both patches look correct to me but I'm responding to this one because
: it's the fix.  The change makes sense and moves further away from the
: severe stalling behaviour we used to see with both THP and zone reclaim
: mode.
:
: I put together a basic experiment with usemem configured to reference a
: buffer multiple times that is 80% the size of main memory on a 2-socket
: box with symmetric node sizes and defrag set to "always".  The defrag
: setting is not the default but it would be functionally similar to
: accessing a buffer with madvise(MADV_HUGEPAGE).  Usemem is configured to
: reference the buffer multiple times and while it's not an interesting
: workload, it would be expected to complete reasonably quickly as it fits
: within memory.  The results were;
:
: usemem
:                                   vanilla           noreclaim-v1
: Amean     Elapsd-1       42.78 (   0.00%)       26.87 (  37.18%)
: Amean     Elapsd-3       27.55 (   0.00%)        7.44 (  73.00%)
: Amean     Elapsd-4        5.72 (   0.00%)        5.69 (   0.45%)
:
: This shows the elapsed time in seconds for 1 thread, 3 threads and 4
: threads referencing buffers 80% the size of memory.  With the patches
: applied, it's 37.18% faster for the single thread and 73% faster with two
: threads.  Note that 4 threads showing little difference does not indicate
: the problem is related to thread counts.  It's simply the case that 4
: threads gets spread so their workload mostly fits in one node.
:
: The overall view from /proc/vmstats is more startling
:
:                          4.19.0-rc1  4.19.0-rc1
:                             vanillanoreclaim-v1r1
: Minor Faults               35593425      708164
: Major Faults                 484088          36
: Swap Ins                    3772837           0
: Swap Outs                   3932295           0
:
: Massive amounts of swap in/out without the patch
:
: Direct pages scanned        6013214           0
: Kswapd pages scanned              0           0
: Kswapd pages reclaimed            0           0
: Direct pages reclaimed      4033009           0
:
: Lots of reclaim activity without the patch
:
: Kswapd efficiency              100%        100%
: Kswapd velocity               0.000       0.000
: Direct efficiency               67%        100%
: Direct velocity           11191.956       0.000
:
: Mostly from direct reclaim context as you'd expect without the patch.
:
: Page writes by reclaim  3932314.000       0.000
: Page writes file                 19           0
: Page writes anon            3932295           0
: Page reclaim immediate        42336           0
:
: Writes from reclaim context is never good but the patch eliminates it.
:
: We should never have default behaviour to thrash the system for such a
: basic workload.  If zone reclaim mode behaviour is ever desired but on a
: single task instead of a global basis then the sensible option is to build
: a mempolicy that enforces that behaviour.

This was a severe regression compared to previous kernels that made
important workloads unusable and it starts when __GFP_THISNODE was
added to THP allocations under MADV_HUGEPAGE.  It is not a significant
risk to go to the previous behavior before __GFP_THISNODE was added, it
worked like that for years.

This was simply an optimization to some lucky workloads that can fit in
a single node, but it ended up breaking the VM for others that can't
possibly fit in a single node, so going back is safe.

[mhocko@suse.com: rewrote the changelog based on the one from Andrea]
Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org
Fixes: 5265047ac3 ("mm, thp: really limit transparent hugepage allocation to local node")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: <stable@vger.kernel.org>	[4.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Roman Gushchin
e68599a3c3 mm: handle no memcg case in memcg_kmem_charge() properly
Mike Galbraith reported a regression caused by the commit 9b6f7e163c
("mm: rework memcg kernel stack accounting") on a system with
"cgroup_disable=memory" boot option: the system panics with the following
stack trace:

  BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8
  PGD 0 P4D 0
  Oops: 0002 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 1 Comm: systemd Not tainted 4.19.0-preempt+ #410
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fed4
  RIP: 0010:page_counter_try_charge+0x22/0xc0
  Code: 41 5d c3 c3 0f 1f 40 00 0f 1f 44 00 00 48 85 ff 0f 84 a7 00 00 00 41 56 48 89 f8 49 89 fe 49
  Call Trace:
   try_charge+0xcb/0x780
   memcg_kmem_charge_memcg+0x28/0x80
   memcg_kmem_charge+0x8b/0x1d0
   copy_process.part.41+0x1ca/0x2070
   _do_fork+0xd7/0x3d0
   do_syscall_64+0x5a/0x180
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

The problem occurs because get_mem_cgroup_from_current() returns the NULL
pointer if memory controller is disabled.  Let's check if this is a case
at the beginning of memcg_kmem_charge() and just return 0 if
mem_cgroup_disabled() returns true.  This is how we handle this case in
many other places in the memory controller code.

Link: http://lkml.kernel.org/r/20181029215123.17830-1-guro@fb.com
Fixes: 9b6f7e163c ("mm: rework memcg kernel stack accounting")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Mike Galbraith <efault@gmx.de>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Linus Torvalds
5f21585384 for-linus-20181102
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvchGgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj/1D/4kEQx4ncnFoZk8QshHV1L++rH3BbcLjQDd
 Wbh9ZSIQdI/gHTzS6bE7x3YfcbpMWPMO3+jFawdfRiFTEjlF8vQ+mnJ+Btb3z4D6
 mGEeFGVhHExlp2a0x/Ma8YWVNlMB7BE8Tq73bZEVMY+9lbpmDW/vp7Sfa87LBDKQ
 ZmY+My+VdHN7qLtQ7t3W/HtpbU+kcXMMd3ICjK4i+ofXy6mynk4+oQ2jwyXc5L86
 UCJCsTsSRr3CgbnkW/uprHo0XHk8i7O/4C3oR+x4pAIxCCa9g+vmw0EO9fvi/2iQ
 qe8jKdm7Y09xu/TiPBa7iz45tdh0cNMJKo3OezmSF9Np+r69KL5C/U4GRPKN3Iwm
 keoqn14ScABkYMSe4ys1AdEgKD6bNUaW3r/lJxTH2oUR23mjnCLp7c4WD/G+MlbB
 CzoakQyCHTZmDFLr2Kc8bkjmpil2T2UFfmLIDAu30LWIYeSGpiIO/V+g1foJMF2f
 06ERltNvgX1BJjoh4NSWySLEf1ZtkUU60NeATRol6gwhnIyLrHsgfm6OEhqlW/7x
 Xc1BWyzX7K6c3Dskk/u5aSRyXOyRC9KkMt3/2XexeDNHkte9yMH0IgSvopPBuER8
 +iPvPjNp7ychTKZB3zpSnlqGgePTjbufIEBtO3OyUmDZKjUqxahtxkQfmPhoclu+
 XdR4ArcqNg==
 =0zM4
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20181102' of git://git.kernel.dk/linux-block

Pull block layer fixes from Jens Axboe:
 "The biggest part of this pull request is the revert of the blkcg
  cleanup series. It had one fix earlier for a stacked device issue, but
  another one was reported. Rather than play whack-a-mole with this,
  revert the entire series and try again for the next kernel release.

  Apart from that, only small fixes/changes.

  Summary:

   - Indentation fixup for mtip32xx (Colin Ian King)

   - The blkcg cleanup series revert (Dennis Zhou)

   - Two NVMe fixes. One fixing a regression in the nvme request
     initialization in this merge window, causing nvme-fc to not work.
     The other is a suspend/resume p2p resource issue (James, Keith)

   - Fix sg discard merge, allowing us to merge in cases where we didn't
     before (Jianchao Wang)

   - Call rq_qos_exit() after the queue is frozen, preventing a hang
     (Ming)

   - Fix brd queue setup, fixing an oops if we fail setting up all
     devices (Ming)"

* tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
  nvme-pci: fix conflicting p2p resource adds
  nvme-fc: fix request private initialization
  blkcg: revert blkcg cleanups series
  block: brd: associate with queue until adding disk
  block: call rq_qos_exit() after queue is frozen
  mtip32xx: clean an indentation issue, remove extraneous tabs
  block: fix the DISCARD request merge
2018-11-02 11:25:48 -07:00
Linus Torvalds
c2aa1a444c vfs: rework data cloning infrastructure
Rework the vfs_clone_file_range and vfs_dedupe_file_range infrastructure to use
 a common .remap_file_range method and supply generic bounds and sanity checking
 functions that are shared with the data write path. The current VFS
 infrastructure has problems with rlimit, LFS file sizes, file time stamps,
 maximum filesystem file sizes, stripping setuid bits, etc and so they are
 addressed in these commits.
 
 We also introduce the ability for the ->remap_file_range methods to return short
 clones so that clones for vfs_copy_file_range() don't get rejected if the entire
 range can't be cloned. It also allows filesystems to sliently skip deduplication
 of partial EOF blocks if they are not capable of doing so without requiring
 errors to be thrown to userspace.
 
 All existing filesystems are converted to user the new .remap_file_range method,
 and both XFS and ocfs2 are modified to make use of the new generic checking
 infrastructure.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJb29gEAAoJEK3oKUf0dfodpOAQAL2VbHjvKXEwNMDTKscSRMmZ
 Z0xXo3gamFKQ+VGOqy2g2lmAYQs9SAnTuCGTJ7zIAp7u+q8gzUy5FzKAwLS4Id6L
 8siaY6nzlicfO04d0MdXnWz0f3xykChgzfdQfVUlUi7WrDioBUECLPmx4a+USsp1
 DQGjLOZfoOAmn2rijdnH9RTEaHqg+8mcTaLN9TRav4gGqrWxldFKXw2y6ouFC7uo
 /hxTRNXR9VI+EdbDelwBNXl9nU9gQA0WLOvRKwgUrtv6bSJohTPsmXt7EbBtNcVR
 cl3zDNc1sLD1bLaRLEUAszI/33wXaaQgom1iB51obIcHHef+JxRNG/j6rUMfzxZI
 VaauGv5EIvtaKN0LTAqVVLQ8t2MQFYfOr8TykmO+1UFog204aKRANdVMHDSjxD/0
 dTGKJGcq+HnKQ+JHDbTdvuXEL8sUUl1FiLjOQbZPw63XmuddLKFUA2TOjXn6htbU
 1h1MG5d9KjGLpabp2BQheczD08NuSmcrOBNt7IoeI3+nxr3HpMwprfB9TyaERy9X
 iEgyVXmjjc9bLLRW7A2wm77aW64NvPs51wKMnvuNgNwnCewrGS6cB8WVj2zbQjH1
 h3f3nku44s9ctNPSBzb/sJLnpqmZQ5t0oSmrMSN+5+En6rNTacoJCzxHRJBA7z/h
 Z+C6y1GTZw0euY6Zjiwu
 =CE/A
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull vfs dedup fixes from Dave Chinner:
 "This reworks the vfs data cloning infrastructure.

  We discovered many issues with these interfaces late in the 4.19 cycle
  - the worst of them (data corruption, setuid stripping) were fixed for
  XFS in 4.19-rc8, but a larger rework of the infrastructure fixing all
  the problems was needed. That rework is the contents of this pull
  request.

  Rework the vfs_clone_file_range and vfs_dedupe_file_range
  infrastructure to use a common .remap_file_range method and supply
  generic bounds and sanity checking functions that are shared with the
  data write path. The current VFS infrastructure has problems with
  rlimit, LFS file sizes, file time stamps, maximum filesystem file
  sizes, stripping setuid bits, etc and so they are addressed in these
  commits.

  We also introduce the ability for the ->remap_file_range methods to
  return short clones so that clones for vfs_copy_file_range() don't get
  rejected if the entire range can't be cloned. It also allows
  filesystems to sliently skip deduplication of partial EOF blocks if
  they are not capable of doing so without requiring errors to be thrown
  to userspace.

  Existing filesystems are converted to user the new remap_file_range
  method, and both XFS and ocfs2 are modified to make use of the new
  generic checking infrastructure"

* tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits)
  xfs: remove [cm]time update from reflink calls
  xfs: remove xfs_reflink_remap_range
  xfs: remove redundant remap partial EOF block checks
  xfs: support returning partial reflink results
  xfs: clean up xfs_reflink_remap_blocks call site
  xfs: fix pagecache truncation prior to reflink
  ocfs2: remove ocfs2_reflink_remap_range
  ocfs2: support partial clone range and dedupe range
  ocfs2: fix pagecache truncation prior to reflink
  ocfs2: truncate page cache for clone destination file before remapping
  vfs: clean up generic_remap_file_range_prep return value
  vfs: hide file range comparison function
  vfs: enable remap callers that can handle short operations
  vfs: plumb remap flags through the vfs dedupe functions
  vfs: plumb remap flags through the vfs clone functions
  vfs: make remap_file_range functions take and return bytes completed
  vfs: remap helper should update destination inode metadata
  vfs: pass remap flags to generic_remap_checks
  vfs: pass remap flags to generic_remap_file_range_prep
  vfs: combine the clone and dedupe into a single remap_file_range
  ...
2018-11-02 09:33:08 -07:00
Linus Torvalds
9931a07d51 Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull AFS updates from Al Viro:
 "AFS series, with some iov_iter bits included"

* 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
  missing bits of "iov_iter: Separate type from direction and use accessor functions"
  afs: Probe multiple fileservers simultaneously
  afs: Fix callback handling
  afs: Eliminate the address pointer from the address list cursor
  afs: Allow dumping of server cursor on operation failure
  afs: Implement YFS support in the fs client
  afs: Expand data structure fields to support YFS
  afs: Get the target vnode in afs_rmdir() and get a callback on it
  afs: Calc callback expiry in op reply delivery
  afs: Fix FS.FetchStatus delivery from updating wrong vnode
  afs: Implement the YFS cache manager service
  afs: Remove callback details from afs_callback_break struct
  afs: Commit the status on a new file/dir/symlink
  afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
  afs: Don't invoke the server to read data beyond EOF
  afs: Add a couple of tracepoints to log I/O errors
  afs: Handle EIO from delivery function
  afs: Fix TTL on VL server and address lists
  afs: Implement VL server rotation
  afs: Improve FS server rotation error handling
  ...
2018-11-01 19:58:52 -07:00
Dennis Zhou
b5f2954d30 blkcg: revert blkcg cleanups series
This reverts a series committed earlier due to null pointer exception
bug report in [1]. It seems there are edge case interactions that I did
not consider and will need some time to understand what causes the
adverse interactions.

The original series can be found in [2] with a follow up series in [3].

[1] https://www.spinics.net/lists/cgroups/msg20719.html
[2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
[3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

This reverts the following commits:
d459d853c2, b2c3fa5467, 101246ec02, b3b9f24f5f, e2b0989954,
f0fcb3ec89, c839e7a03f, bdc2491708, 74b7c02a9b, 5bf9a1f3b4,
a7b39b4e96, 07b05bcc32, 49f4c2dc2b, 27e6fa996c

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-01 19:59:53 -06:00
Linus Torvalds
b5b1de3537 virtio, vhost: fixes, tweaks
virtio balloon page hinting support
 vhost scsi control queue
 
 misc fixes.
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJb222AAAoJECgfDbjSjVRpfHMH/1uh87pIj6Qbh9LRZm0dRVHs
 iEuIa0TECb+9AmqHRBliEcqWjihWnlQnSwE6d/ZTk9cH9zEEXsIu0B7nDzjGwJ5b
 m7wq679JqrtGpCdQhRH85sk8P3fs5ldatdbc4/0fsPwwkoXHcqTOpSZWyhtQHGc4
 EkzzxHXVhVXgRBjzLXVCMoOAQ+8QfMZFrKIgKuOB0I4OVughFrAGf0Hemm18f4CL
 5+YwsJQjleDkm+Udf+FTQS2oZ57DJsOLm2bwoKgqCkBaDfPlR92uWjgTa50WXggo
 RaokpFQkJKpz11yenezslzrVWUJApnWnUhfRd71t1ttvujrrcD9zUvWEVMURbuU=
 =2i/5
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio/vhost updates from Michael Tsirkin:
 "Fixes and tweaks:

   - virtio balloon page hinting support

   - vhost scsi control queue

   - misc fixes"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  MAINTAINERS: remove reference to bogus vsock file
  vhost/scsi: Use common handling code in request queue handler
  vhost/scsi: Extract common handling code from control queue handler
  vhost/scsi: Respond to control queue operations
  vhost/scsi: truncate T10 PI iov_iter to prot_bytes
  virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
  mm/page_poison: expose page_poisoning_enabled to kernel modules
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
  kvm_config: add CONFIG_VIRTIO_MENU
2018-11-01 14:42:49 -07:00
Linus Torvalds
6444ccfd69 Merge branch 'for-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu fixes from Dennis Zhou:
 "Two small things for v4.20.

  The first fixes a clang uninitialized variable warning for arm64 in
  the default path calls BUILD_BUG(). The second removes an unnecessary
  unlikely() in a WARN_ON() use"

* 'for-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
  arm64: percpu: Initialize ret in the default case
  mm: percpu: remove unnecessary unlikely()
2018-11-01 09:27:57 -07:00
Fengguang Wu
2ebe82288b mm/gup.c: fix __get_user_pages_fast() comment
mmu_gather_tlb() no longer exists.  Replace with mmu_table_batch().

Link: http://lkml.kernel.org/r/20180928053441.rpzwafzlsnp74mkl@wfg-t540p.sh.intel.com
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jiri Kosina <trivial@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:17 -07:00
Jan Kara
f2c57d91b0 mm: Fix warning in insert_pfn()
In DAX mode a write pagefault can race with write(2) in the following
way:

CPU0                            CPU1
                                write fault for mapped zero page (hole)
dax_iomap_rw()
  iomap_apply()
    xfs_file_iomap_begin()
      - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - invalidates radix tree entries in given range
                                dax_iomap_pte_fault()
                                  grab_mapping_entry()
                                    - no entry found, creates empty
                                  ...
                                  xfs_file_iomap_begin()
                                    - finds already allocated block
                                  ...
                                  vmf_insert_mixed_mkwrite()
                                    - WARNs and does nothing because there
                                      is still zero page mapped in PTE
        unmap_mapping_pages()

This race results in WARN_ON from insert_pfn() and is occasionally
triggered by fstest generic/344. Note that the race is otherwise
harmless as before write(2) on CPU0 is finished, we will invalidate page
tables properly and thus user of mmap will see modified data from
write(2) from that point on. So just restrict the warning only to the
case when the PFN in PTE is not zero page.

Link: http://lkml.kernel.org/r/20180824154542.26872-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:17 -07:00
David Hildenbrand
381eab4a6e mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
There seem to be some problems as result of 30467e0b3b ("mm, hotplug:
fix concurrent memory hot-add deadlock"), which tried to fix a possible
lock inversion reported and discussed in [1] due to the two locks
	a) device_lock()
	b) mem_hotplug_lock

While add_memory() first takes b), followed by a) during
bus_probe_device(), onlining of memory from user space first took a),
followed by b), exposing a possible deadlock.

In [1], and it was decided to not make use of device_hotplug_lock, but
rather to enforce a locking order.

The problems I spotted related to this:

1. Memory block device attributes: While .state first calls
   mem_hotplug_begin() and the calls device_online() - which takes
   device_lock() - .online does no longer call mem_hotplug_begin(), so
   effectively calls online_pages() without mem_hotplug_lock.

2. device_online() should be called under device_hotplug_lock, however
   onlining memory during add_memory() does not take care of that.

In addition, I think there is also something wrong about the locking in

3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
   without locks. This was introduced after 30467e0b3b. And skimming over
   the code, I assume it could need some more care in regards to locking
   (e.g. device_online() called without device_hotplug_lock. This will
   be addressed in the following patches.

Now that we hold the device_hotplug_lock when
- adding memory (e.g. via add_memory()/add_memory_resource())
- removing memory (e.g. via remove_memory())
- device_online()/device_offline()

We can move mem_hotplug_lock usage back into
online_pages()/offline_pages().

Why is mem_hotplug_lock still needed? Essentially to make
get_online_mems()/put_online_mems() be very fast (relying on
device_hotplug_lock would be very slow), and to serialize against
addition of memory that does not create memory block devices (hmm).

[1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
    2015-February/065324.html

This patch is partly based on a patch by Vitaly Kuznetsov.

Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:17 -07:00
David Hildenbrand
8df1d0e4a2 mm/memory_hotplug: make add_memory() take the device_hotplug_lock
add_memory() currently does not take the device_hotplug_lock, however
is aleady called under the lock from
	arch/powerpc/platforms/pseries/hotplug-memory.c
	drivers/acpi/acpi_memhotplug.c
to synchronize against CPU hot-remove and similar.

In general, we should hold the device_hotplug_lock when adding memory to
synchronize against online/offline request (e.g.  from user space) - which
already resulted in lock inversions due to device_lock() and
mem_hotplug_lock - see 30467e0b3b ("mm, hotplug: fix concurrent memory
hot-add deadlock").  add_memory()/add_memory_resource() will create memory
block devices, so this really feels like the right thing to do.

Holding the device_hotplug_lock makes sure that a memory block device
can really only be accessed (e.g. via .online/.state) from user space,
once the memory has been fully added to the system.

The lock is not held yet in
	drivers/xen/balloon.c
	arch/powerpc/platforms/powernv/memtrace.c
	drivers/s390/char/sclp_cmd.c
	drivers/hv/hv_balloon.c
So, let's either use the locked variants or take the lock.

Don't export add_memory_resource(), as it once was exported to be used by
XEN, which is never built as a module.  If somebody requires it, we also
have to export a locked variant (as device_hotplug_lock is never
exported).

Link: http://lkml.kernel.org/r/20180925091457.28651-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:17 -07:00
David Hildenbrand
d15e59260f mm/memory_hotplug: make remove_memory() take the device_hotplug_lock
Patch series "mm: online/offline_pages called w.o. mem_hotplug_lock", v3.

Reading through the code and studying how mem_hotplug_lock is to be used,
I noticed that there are two places where we can end up calling
device_online()/device_offline() - online_pages()/offline_pages() without
the mem_hotplug_lock.  And there are other places where we call
device_online()/device_offline() without the device_hotplug_lock.

While e.g.
	echo "online" > /sys/devices/system/memory/memory9/state
is fine, e.g.
	echo 1 > /sys/devices/system/memory/memory9/online
Will not take the mem_hotplug_lock. However the device_lock() and
device_hotplug_lock.

E.g.  via memory_probe_store(), we can end up calling
add_memory()->online_pages() without the device_hotplug_lock.  So we can
have concurrent callers in online_pages().  We e.g.  touch in
online_pages() basically unprotected zone->present_pages then.

Looks like there is a longer history to that (see Patch #2 for details),
and fixing it to work the way it was intended is not really possible.  We
would e.g.  have to take the mem_hotplug_lock in device/base/core.c, which
sounds wrong.

Summary: We had a lock inversion on mem_hotplug_lock and device_lock().
More details can be found in patch 3 and patch 6.

I propose the general rules (documentation added in patch 6):

1. add_memory/add_memory_resource() must only be called with
   device_hotplug_lock.
2. remove_memory() must only be called with device_hotplug_lock. This is
   already documented and holds for all callers.
3. device_online()/device_offline() must only be called with
   device_hotplug_lock. This is already documented and true for now in core
   code. Other callers (related to memory hotplug) have to be fixed up.
4. mem_hotplug_lock is taken inside of add_memory/remove_memory/
   online_pages/offline_pages.

To me, this looks way cleaner than what we have right now (and easier to
verify).  And looking at the documentation of remove_memory, using
lock_device_hotplug also for add_memory() feels natural.

This patch (of 6):

remove_memory() is exported right now but requires the
device_hotplug_lock, which is not exported.  So let's provide a variant
that takes the lock and only export that one.

The lock is already held in
	arch/powerpc/platforms/pseries/hotplug-memory.c
	drivers/acpi/acpi_memhotplug.c
	arch/powerpc/platforms/powernv/memtrace.c

Apart from that, there are not other users in the tree.

Link: http://lkml.kernel.org/r/20180925091457.28651-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:17 -07:00
Mike Rapoport
2f770806fd mm/memblock.c: warn if zero alignment was requested
After updating all memblock users to explicitly specify SMP_CACHE_BYTES
alignment rather than use 0, it is still possible that uncovered users may
sneak in.  Add a WARN_ON_ONCE for such cases.

[sfr@canb.auug.org.au: use dump_stack() instead of WARN_ON_ONCE for the alignment checks]
  Link: http://lkml.kernel.org/r/20181016131927.6ceba6ab@canb.auug.org.au
[akpm@linux-foundation.org: add apologetic comment]
Link: http://lkml.kernel.org/r/20181011060850.GA19822@rapoport-lnx
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:17 -07:00
Mike Rapoport
7e1c4e2792 memblock: stop using implicit alignment to SMP_CACHE_BYTES
When a memblock allocation APIs are called with align = 0, the alignment
is implicitly set to SMP_CACHE_BYTES.

Implicit alignment is done deep in the memblock allocator and it can
come as a surprise.  Not that such an alignment would be wrong even
when used incorrectly but it is better to be explicit for the sake of
clarity and the prinicple of the least surprise.

Replace all such uses of memblock APIs with the 'align' parameter
explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment
in the memblock internal allocation functions.

For the case when memblock APIs are used via helper functions, e.g.  like
iommu_arena_new_node() in Alpha, the helper functions were detected with
Coccinelle's help and then manually examined and updated where
appropriate.

The direct memblock APIs users were updated using the semantic patch below:

@@
expression size, min_addr, max_addr, nid;
@@
(
|
- memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid)
|
- memblock_alloc(size, 0)
+ memblock_alloc(size, SMP_CACHE_BYTES)
|
- memblock_alloc_raw(size, 0)
+ memblock_alloc_raw(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from(size, 0, min_addr)
+ memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_nopanic(size, 0)
+ memblock_alloc_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low(size, 0)
+ memblock_alloc_low(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low_nopanic(size, 0)
+ memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from_nopanic(size, 0, min_addr)
+ memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_node(size, 0, nid)
+ memblock_alloc_node(size, SMP_CACHE_BYTES, nid)
)

[mhocko@suse.com: changelog update]
[akpm@linux-foundation.org: coding-style fixes]
[rppt@linux.ibm.com: fix missed uses of implicit alignment]
  Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx
Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Paul Burton <paul.burton@mips.com>	[MIPS]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport
57c8a661d9 mm: remove include/linux/bootmem.h
Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.

The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include <linux/memblock.h>

@@
@@
- #include <linux/bootmem.h>
+ #include <linux/memblock.h>

[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
  Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
  Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
  Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport
97ad1087ef memblock: replace BOOTMEM_ALLOC_* with MEMBLOCK variants
Drop BOOTMEM_ALLOC_ACCESSIBLE and BOOTMEM_ALLOC_ANYWHERE in favor of
identical MEMBLOCK definitions.

Link: http://lkml.kernel.org/r/1536927045-23536-29-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport
bda49a8116 mm: remove nobootmem
Move a few remaining functions from nobootmem.c to memblock.c and remove
nobootmem

Link: http://lkml.kernel.org/r/1536927045-23536-28-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport
7c2ee349cf memblock: rename __free_pages_bootmem to memblock_free_pages
The conversion is done using

sed -i 's@__free_pages_bootmem@memblock_free_pages@' \
    $(git grep -l __free_pages_bootmem)

Link: http://lkml.kernel.org/r/1536927045-23536-27-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport
c6ffc5ca8f memblock: rename free_all_bootmem to memblock_free_all
The conversion is done using

sed -i 's@free_all_bootmem@memblock_free_all@' \
    $(git grep -l free_all_bootmem)

Link: http://lkml.kernel.org/r/1536927045-23536-26-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport
53ab85ebfd memblock: replace free_bootmem_late with memblock_free_late
The free_bootmem_late and memblock_free_late do exactly the same thing:
they iterate over a range and give pages to the page allocator.

Replace calls to free_bootmem_late with calls to memblock_free_late and
remove the bootmem variant.

Link: http://lkml.kernel.org/r/1536927045-23536-25-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport
2013288f72 memblock: replace free_bootmem{_node} with memblock_free
The free_bootmem and free_bootmem_node are merely wrappers for
memblock_free. Replace their usage with a call to memblock_free using the
following semantic patch:

@@
expression e1, e2, e3;
@@
(
- free_bootmem(e1, e2)
+ memblock_free(e1, e2)
|
- free_bootmem_node(e1, e2, e3)
+ memblock_free(e2, e3)
)

Link: http://lkml.kernel.org/r/1536927045-23536-24-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport
6c7835f8d0 mm: nobootmem: remove bootmem allocation APIs
The bootmem compatibility APIs are not used and can be removed.

Link: http://lkml.kernel.org/r/1536927045-23536-23-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport
3913c8f9f9 memblock: add align parameter to memblock_alloc_node()
With the align parameter memblock_alloc_node() can be used as drop in
replacement for alloc_bootmem_pages_node() and __alloc_bootmem_node(),
which is done in the following patches.

Link: http://lkml.kernel.org/r/1536927045-23536-15-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:15 -07:00
Mike Rapoport
eb31d559f1 memblock: remove _virt from APIs returning virtual address
The conversion is done using

sed -i 's@memblock_virt_alloc@memblock_alloc@g' \
	$(git grep -l memblock_virt_alloc)

Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:15 -07:00
Mike Rapoport
9a8dd708d5 memblock: rename memblock_alloc{_nid,_try_nid} to memblock_phys_alloc*
Make it explicit that the caller gets a physical address rather than a
virtual one.

This will also allow using meblock_alloc prefix for memblock allocations
returning virtual address, which is done in the following patches.

The conversion is done using the following semantic patch:

@@
expression e1, e2, e3;
@@
(
- memblock_alloc(e1, e2)
+ memblock_phys_alloc(e1, e2)
|
- memblock_alloc_nid(e1, e2, e3)
+ memblock_phys_alloc_nid(e1, e2, e3)
|
- memblock_alloc_try_nid(e1, e2, e3)
+ memblock_phys_alloc_try_nid(e1, e2, e3)
)

Link: http://lkml.kernel.org/r/1536927045-23536-7-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:15 -07:00
Mike Rapoport
b146ada221 mm: nobootmem: remove dead code
Several bootmem functions and macros are not used. Remove them.

Link: http://lkml.kernel.org/r/1536927045-23536-6-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:15 -07:00
Mike Rapoport
355c45affc mm: remove bootmem allocator implementation.
All architectures have been converted to use MEMBLOCK + NO_BOOTMEM. The
bootmem allocator implementation can be removed.

Link: http://lkml.kernel.org/r/1536927045-23536-5-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:15 -07:00
Mike Rapoport
aca52c3983 mm: remove CONFIG_HAVE_MEMBLOCK
All architecures use memblock for early memory management. There is no need
for the CONFIG_HAVE_MEMBLOCK configuration option.

[rppt@linux.vnet.ibm.com: of/fdt: fixup #ifdefs]
  Link: http://lkml.kernel.org/r/20180919103457.GA20545@rapoport-lnx
[rppt@linux.vnet.ibm.com: csky: fixups after bootmem removal]
  Link: http://lkml.kernel.org/r/20180926112744.GC4628@rapoport-lnx
[rppt@linux.vnet.ibm.com: remove stale #else and the code it protects]
  Link: http://lkml.kernel.org/r/1538067825-24835-1-git-send-email-rppt@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1536927045-23536-4-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:15 -07:00
Mike Rapoport
b4a991ec58 mm: remove CONFIG_NO_BOOTMEM
All achitectures select NO_BOOTMEM which essentially becomes 'Y' for any
kernel configuration and therefore it can be removed.

[alexander.h.duyck@linux.intel.com: remove now defunct NO_BOOTMEM from depends list for deferred init]
  Link: http://lkml.kernel.org/r/20180925201814.3576.15105.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/1536927045-23536-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:14 -07:00
Dan Carpenter
4b408c74ee mm/gup_benchmark.c: prevent integer overflow in ioctl
The concern here is that "gup->size" is a u64 and "nr_pages" is unsigned
long.  On 32 bit systems we could trick the kernel into allocating fewer
pages than expected.

Link: http://lkml.kernel.org/r/20181025061546.hnhkv33diogf2uis@kili.mountain
Fixes: 64c349f4ae ("mm: add infrastructure for get_user_pages_fast() benchmarking")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Keith Busch <keith.busch@intel.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:12 -07:00
Jérôme Glisse
ec131b2d7f mm/hmm: invalidate device page table at start of invalidation
Invalidate device page table at start of invalidation and invalidate in
progress CPU page table snapshooting at both start and end of any
invalidation.

This is helpful when device need to dirty page because the device page
table report the page as dirty.  Dirtying page must happen in the start
mmu notifier callback and not in the end one.

Link: http://lkml.kernel.org/r/20181019160442.18723-7-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:12 -07:00
Jérôme Glisse
44532d4c59 mm/hmm: use a structure for update callback parameters
Use a structure to gather all the parameters for the update callback.
This make it easier when adding new parameters by avoiding having to
update all callback function signature.

The hmm_update structure is always associated with a mmu_notifier
callbacks so we are not planing on grouping multiple updates together.
Nor do we care about page size for the range as range will over fully
cover the page being invalidated (this is a mmu_notifier property).

Link: http://lkml.kernel.org/r/20181019160442.18723-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:12 -07:00
Jérôme Glisse
d08faca018 mm/hmm: properly handle migration pmd
Before this patch migration pmd entry (!pmd_present()) would have been
treated as a bad entry (pmd_bad() returns true on migration pmd entry).
The outcome was that device driver would believe that the range covered by
the pmd was bad and would either SIGBUS or simply kill all the device's
threads (each device driver decide how to react when the device tries to
access poisonnous or invalid range of memory).

This patch explicitly handle the case of migration pmd entry which are non
present pmd entry and either wait for the migration to finish or report
empty range (when device is just trying to pre- fill a range of virtual
address and thus do not want to wait or trigger page fault).

Link: http://lkml.kernel.org/r/20181019160442.18723-5-jglisse@redhat.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:11 -07:00
Ralph Campbell
86a2d59841 mm/hmm: fix race between hmm_mirror_unregister() and mmu_notifier callback
In hmm_mirror_unregister(), mm->hmm is set to NULL and then
mmu_notifier_unregister_no_release() is called.  That creates a small
window where mmu_notifier can call mmu_notifier_ops with mm->hmm equal to
NULL.  Fix this by first unregistering mmu notifier callbacks and then
setting mm->hmm to NULL.

Similarly in hmm_register(), set mm->hmm before registering mmu_notifier
callbacks so callback functions always see mm->hmm set.

Link: http://lkml.kernel.org/r/20181019160442.18723-4-jglisse@redhat.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:11 -07:00
Ralph Campbell
aab8d0520e mm/rmap: map_pte() was not handling private ZONE_DEVICE page properly
Private ZONE_DEVICE pages use a special pte entry and thus are not
present.  Properly handle this case in map_pte(), it is already handled in
check_pte(), the map_pte() part was lost in some rebase most probably.

Without this patch the slow migration path can not migrate back to any
private ZONE_DEVICE memory to regular memory.  This was found after stress
testing migration back to system memory.  This ultimatly can lead to the
CPU constantly page fault looping on the special swap entry.

Link: http://lkml.kernel.org/r/20181019160442.18723-3-jglisse@redhat.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:11 -07:00
Jérôme Glisse
f813f21971 mm/hmm: fix utf8 ...
Patch series "HMM updates, improvements and fixes", v2

Few fixes that only affect HMM users.  Improve the synchronization call
back so that we match was other mmu_notifier listener do and add proper
support to the new blockable flags in the process.

For curious folks here are branches to leverage HMM in various existing
device drivers:

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau-v01
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-radeon-v00
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-intel-v00

More to come (amd gpu, Mellanox, ...)

I expect more of the preparatory work for nouveau will be merge in 4.20
(like we have been doing since 4.16) and i will wait until this patchset
is upstream before pushing the patches that actualy make use of HMM (to
avoid complex tree inter-dependency).

This patch (of 6):

Somehow utf=8 must have been broken.

Link: http://lkml.kernel.org/r/20181019160442.18723-2-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:11 -07:00
Darrick J. Wong
eca3654e3c vfs: enable remap callers that can handle short operations
Plumb in a remap flag that enables the filesystem remap handler to
shorten remapping requests for callers that can handle it.  Now
copy_file_range can report partial success (in case we run up against
alignment problems, resource limits, etc.).

We also enable CAN_SHORTEN for fideduperange to maintain existing
userspace-visible behavior where xfs/btrfs shorten the dedupe range to
avoid stale post-eof data exposure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:42:10 +11:00
Darrick J. Wong
42ec3d4c02 vfs: make remap_file_range functions take and return bytes completed
Change the remap_file_range functions to take a number of bytes to
operate upon and return the number of bytes they operated on.  This is a
requirement for allowing fs implementations to return short clone/dedupe
results to the user, which will enable us to obey resource limits in a
graceful manner.

A subsequent patch will enable copy_file_range to signal to the
->clone_file_range implementation that it can handle a short length,
which will be returned in the function's return value.  For now the
short return is not implemented anywhere so the behavior won't change --
either copy_file_range manages to clone the entire range or it tries an
alternative.

Neither clone ioctl can take advantage of this, alas.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:41:49 +11:00
Darrick J. Wong
3d28193e1d vfs: pass remap flags to generic_remap_checks
Pass the same remap flags to generic_remap_checks for consistency.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:41:34 +11:00
Darrick J. Wong
9fd91a90cb vfs: strengthen checking of file range inputs to generic_remap_checks
File range remapping, if allowed to run past the destination file's EOF,
is an optimization on a regular file write.  Regular file writes that
extend the file length are subject to various constraints which are not
checked by range cloning.

This is a correctness problem because we're never allowed to touch
ranges that the page cache can't support (s_maxbytes); we're not
supposed to deal with large offsets (MAX_NON_LFS) if O_LARGEFILE isn't
set; and we must obey resource limits (RLIMIT_FSIZE).

Therefore, add these checks to the new generic_remap_checks function so
that we curtail unexpected behavior.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:40:46 +11:00
Darrick J. Wong
1383a7ed67 vfs: check file ranges before cloning files
Move the file range checks from vfs_clone_file_prep into a separate
generic_remap_checks function so that all the checks are collected in a
central location.  This forms the basis for adding more checks from
generic_write_checks that will make cloning's input checking more
consistent with write input checking.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:40:31 +11:00
Linus Torvalds
dad4f140ed Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax
Pull XArray conversion from Matthew Wilcox:
 "The XArray provides an improved interface to the radix tree data
  structure, providing locking as part of the API, specifying GFP flags
  at allocation time, eliminating preloading, less re-walking the tree,
  more efficient iterations and not exposing RCU-protected pointers to
  its users.

  This patch set

   1. Introduces the XArray implementation

   2. Converts the pagecache to use it

   3. Converts memremap to use it

  The page cache is the most complex and important user of the radix
  tree, so converting it was most important. Converting the memremap
  code removes the only other user of the multiorder code, which allows
  us to remove the radix tree code that supported it.

  I have 40+ followup patches to convert many other users of the radix
  tree over to the XArray, but I'd like to get this part in first. The
  other conversions haven't been in linux-next and aren't suitable for
  applying yet, but you can see them in the xarray-conv branch if you're
  interested"

* 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
  radix tree: Remove multiorder support
  radix tree test: Convert multiorder tests to XArray
  radix tree tests: Convert item_delete_rcu to XArray
  radix tree tests: Convert item_kill_tree to XArray
  radix tree tests: Move item_insert_order
  radix tree test suite: Remove multiorder benchmarking
  radix tree test suite: Remove __item_insert
  memremap: Convert to XArray
  xarray: Add range store functionality
  xarray: Move multiorder_check to in-kernel tests
  xarray: Move multiorder_shrink to kernel tests
  xarray: Move multiorder account test in-kernel
  radix tree test suite: Convert iteration test to XArray
  radix tree test suite: Convert tag_tagged_items to XArray
  radix tree: Remove radix_tree_clear_tags
  radix tree: Remove radix_tree_maybe_preload_order
  radix tree: Remove split/join code
  radix tree: Remove radix_tree_update_node_t
  page cache: Finish XArray conversion
  dax: Convert page fault handlers to XArray
  ...
2018-10-28 11:35:40 -07:00
Mike Kravetz
22146c3ce9 hugetlbfs: dirty pages as they are added to pagecache
Some test systems were experiencing negative huge page reserve counts and
incorrect file block counts.  This was traced to /proc/sys/vm/drop_caches
removing clean pages from hugetlbfs file pagecaches.  When non-hugetlbfs
explicit code removes the pages, the appropriate accounting is not
performed.

This can be recreated as follows:
 fallocate -l 2M /dev/hugepages/foo
 echo 1 > /proc/sys/vm/drop_caches
 fallocate -l 2M /dev/hugepages/foo
 grep -i huge /proc/meminfo
   AnonHugePages:         0 kB
   ShmemHugePages:        0 kB
   HugePages_Total:    2048
   HugePages_Free:     2047
   HugePages_Rsvd:    18446744073709551615
   HugePages_Surp:        0
   Hugepagesize:       2048 kB
   Hugetlb:         4194304 kB
 ls -lsh /dev/hugepages/foo
   4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo

To address this issue, dirty pages as they are added to pagecache.  This
can easily be reproduced with fallocate as shown above.  Read faulted
pages will eventually end up being marked dirty.  But there is a window
where they are clean and could be impacted by code such as drop_caches.
So, just dirty them all as they are added to the pagecache.

Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com
Fixes: 6bda666a03 ("hugepages: fold find_or_alloc_pages into huge_no_page()")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Mihcla Hocko <mhocko@suse.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:38:16 -07:00
Omar Sandoval
aa8aa8a331 mm: export add_swap_extent()
Btrfs currently does not support swap files because swap's use of bmap
does not work with copy-on-write and multiple devices.  See 35054394c4
("Btrfs: stop providing a bmap operation to avoid swapfile corruptions").

However, the swap code has a mechanism for the filesystem to manually add
swap extents using add_swap_extent() from the ->swap_activate() aop.
iomap has done this since 67482129cd ("iomap: add a swapfile activation
function").  Btrfs will do the same in a later patch, so export
add_swap_extent().

Link: http://lkml.kernel.org/r/bb1208575e02829aae51b538709476964f97b1ea.1536704650.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Sterba <dsterba@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:38:15 -07:00
Omar Sandoval
bc4ae27d81 mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go
through the filesystem, and to make swapoff() call ->swap_deactivate().
For Btrfs, we want the latter but not the former, so split this flag into
two.  This makes us always call ->swap_deactivate() if ->swap_activate()
succeeded, not just if it didn't add any swap extents itself.

This also resolves the issue of the very misleading name of SWP_FILE,
which is only used for swap files over NFS.

Link: http://lkml.kernel.org/r/6d63d8668c4287a4f6d203d65696e96f80abdfc7.1536704650.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:38:15 -07:00
Andrea Arcangeli
7eef5f97c1 mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
There should be no cache left by the time we overwrite the old transhuge
pmd with the new one.  It's already too late to flush through the virtual
address because we already copied the page data to the new physical
address.

So flush the cache before the data copy.

Also delete the "end" variable to shutoff a "unused variable" warning on
x86 where flush_cache_range() is a noop.

Link: http://lkml.kernel.org/r/20181015202311.7209-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:38:15 -07:00
Andrea Arcangeli
7066f0f933 mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB
right away.  do_huge_pmd_numa_page() flushes the TLB before calling
migrate_misplaced_transhuge_page().  By the time do_huge_pmd_numa_page()
runs some CPU could still access the page through the TLB.

change_huge_pmd() before arming the numa/protnone transhuge pmd calls
mmu_notifier_invalidate_range_start().  So there's no need of
mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
sequence in migrate_misplaced_transhuge_page() too, because by the time
migrate_misplaced_transhuge_page() runs, the pmd mapping has already been
invalidated in the secondary MMUs.  It has to or if a secondary MMU can
still write to the page, the migrate_page_copy() would lose data.

However an explicit mmu_notifier_invalidate_range() is needed before
migrate_misplaced_transhuge_page() starts copying the data of the
transhuge page or the below can happen for MMU notifier users sharing the
primary MMU pagetables and only implementing ->invalidate_range:

CPU0		CPU1		GPU sharing linux pagetables using
                                only ->invalidate_range
-----------	------------	---------
				GPU secondary MMU writes to the page
				mapped by the transhuge pmd
change_pmd_range()
mmu..._range_start()
->invalidate_range_start() noop
change_huge_pmd()
set_pmd_at(numa/protnone)
pmd_unlock()
		do_huge_pmd_numa_page()
		CPU TLB flush globally (1)
		CPU cannot write to page
		migrate_misplaced_transhuge_page()
				GPU writes to the page...
		migrate_page_copy()
				...GPU stops writing to the page
CPU TLB flush (2)
mmu..._range_end() (3)
->invalidate_range_stop() noop
->invalidate_range()
				GPU secondary MMU is invalidated
				and cannot write to the page anymore
				(too late)

Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives
too late, we also need a mmu_notifier_invalidate_range() before calling
migrate_misplaced_transhuge_page(), because the ->invalidate_range() in
(3) also arrives too late.

This requirement is the result of the lazy optimization in
change_huge_pmd() that releases the pmd_lock without first flushing the
TLB and without first calling mmu_notifier_invalidate_range().

Even converting the removed mmu_notifier_invalidate_range_only_end() into
a mmu_notifier_invalidate_range_end() would not have been enough to fix
this, because it run after migrate_page_copy().

After the hugepage data copy is done migrate_misplaced_transhuge_page()
can proceed and call set_pmd_at without having to flush the TLB nor any
secondary MMUs because the secondary MMU invalidate, just like the CPU TLB
flush, has to happen before the migrate_page_copy() is called or it would
be a bug in the first place (and it was for drivers using
->invalidate_range()).

KVM is unaffected because it doesn't implement ->invalidate_range().

The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
uses the generic migrate_pages which transitions the pte from
numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
and all mmu notifiers there before copying the page.

Link: http://lkml.kernel.org/r/20181013002430.698-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:38:15 -07:00
Andrea Arcangeli
d7c3393413 mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
Patch series "migrate_misplaced_transhuge_page race conditions".

Aaron found a new instance of the THP MADV_DONTNEED race against
pmdp_clear_flush* variants, that was apparently left unfixed.

While looking into the race found by Aaron, I may have found two more
issues in migrate_misplaced_transhuge_page.

These race conditions would not cause kernel instability, but they'd
corrupt userland data or leave data non zero after MADV_DONTNEED.

I did only minor testing, and I don't expect to be able to reproduce this
(especially the lack of ->invalidate_range before migrate_page_copy,
requires the latest iommu hardware or infiniband to reproduce).  The last
patch is noop for x86 and it needs further review from maintainers of
archs that implement flush_cache_range() (not in CC yet).

To avoid confusion, it's not the first patch that introduces the bug fixed
in the second patch, even before removing the
pmdp_huge_clear_flush_notify, that _notify suffix was called after
migrate_page_copy already run.

This patch (of 3):

This is a corollary of ced108037c ("thp: fix MADV_DONTNEED vs.  numa
balancing race"), 58ceeb6bec ("thp: fix MADV_DONTNEED vs.  MADV_FREE
race") and 5b7abeae3a ("thp: fix MADV_DONTNEED vs clear soft dirty
race).

When the above three fixes where posted Dave asked
https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
but apparently this was missed.

The pmdp_clear_flush* in migrate_misplaced_transhuge_page() was introduced
in a54a407fbf ("mm: Close races between THP migration and PMD numa
clearing").

The important part of such commit is only the part where the page lock is
not released until the first do_huge_pmd_numa_page() finished disarming
the pagenuma/protnone.

The addition of pmdp_clear_flush() wasn't beneficial to such commit and
there's no commentary about such an addition either.

I guess the pmdp_clear_flush() in such commit was added just in case for
safety, but it ended up introducing the MADV_DONTNEED race condition found
by Aaron.

At that point in time nobody thought of such kind of MADV_DONTNEED race
conditions yet (they were fixed later) so the code may have looked more
robust by adding the pmdp_clear_flush().

This specific race condition won't destabilize the kernel, but it can
confuse userland because after MADV_DONTNEED the memory won't be zeroed
out.

This also optimizes the code and removes a superfluous TLB flush.

[akpm@linux-foundation.org: reflow comment to 80 cols, fix grammar and typo (beacuse)]
Link: http://lkml.kernel.org/r/20181013002430.698-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Aaron Tomlin <atomlin@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:38:15 -07:00
Clark Williams
026d1eaf5e mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
The static lock quarantine_lock is used in quarantine.c to protect the
quarantine queue datastructures.  It is taken inside quarantine queue
manipulation routines (quarantine_put(), quarantine_reduce() and
quarantine_remove_cache()), with IRQs disabled.  This is not a problem on
a stock kernel but is problematic on an RT kernel where spin locks are
sleeping spinlocks, which can sleep and can not be acquired with disabled
interrupts.

Convert the quarantine_lock to a raw spinlock_t.  The usage of
quarantine_lock is confined to quarantine.c and the work performed while
the lock is held is used for debug purpose.

[bigeasy@linutronix.de: slightly altered the commit message]
Link: http://lkml.kernel.org/r/20181010214945.5owshc3mlrh74z4b@linutronix.de
Signed-off-by: Clark Williams <williams@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:38:15 -07:00
Keith Busch
df06b37ffe mm/gup: cache dev_pagemap while pinning pages
Getting pages from ZONE_DEVICE memory needs to check the backing device's
live-ness, which is tracked in the device's dev_pagemap metadata.  This
metadata is stored in a radix tree and looking it up adds measurable
software overhead.

This patch avoids repeating this relatively costly operation when
dev_pagemap is used by caching the last dev_pagemap while getting user
pages.  The gup_benchmark kernel self test reports this reduces time to
get user pages to as low as 1/3 of the previous time.

Link: http://lkml.kernel.org/r/20181012173040.15669-1-keith.busch@intel.com
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:38:15 -07:00
Pavel Tatashin
ec393a0f01 mm: return zero_resv_unavail optimization
When checking for valid pfns in zero_resv_unavail(), it is not necessary
to verify that pfns within pageblock_nr_pages ranges are valid, only the
first one needs to be checked.  This is because memory for pages are
allocated in contiguous chunks that contain pageblock_nr_pages struct
pages.

Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.com
Signed-off-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:38:15 -07:00
Naoya Horiguchi
907ec5fca3 mm: zero remaining unavailable struct pages
Patch series "mm: Fix for movable_node boot option", v3.

This patch series contains a fix for the movable_node boot option issue
which was introduced by commit 124049decb ("x86/e820: put !E820_TYPE_RAM
regions into memblock.reserved").

The commit breaks the option because it changed the memory gap range to
reserved memblock.  So, the node is marked as Normal zone even if the SRAT
has Hot pluggable affinity.

First and second patch fix the original issue which the commit tried to
fix, then revert the commit.

This patch (of 3):

There is a kernel panic that is triggered when reading /proc/kpageflags on
the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':

  BUG: unable to handle kernel paging request at fffffffffffffffe
  PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
  Oops: 0000 [#1] SMP PTI
  CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
  RIP: 0010:stable_page_flags+0x27/0x3c0
  Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
  RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
  RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
  RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
  R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
  R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
  FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
  Call Trace:
   kpageflags_read+0xc7/0x120
   proc_reg_read+0x3c/0x60
   __vfs_read+0x36/0x170
   vfs_read+0x89/0x130
   ksys_pread64+0x71/0x90
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7efc42e75e23
  Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24

According to kernel bisection, this problem became visible due to commit
f7f99100d8 which changes how struct pages are initialized.

Memblock layout affects the pfn ranges covered by node/zone.  Consider
that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the
default (no memmap= given) memblock layout is like below:

  MEMBLOCK configuration:
   memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
   memory.cnt  = 0x4
   memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
   memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
   memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
   memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
   ...

If you give memmap=1G!4G (so it just covers memory[0x2]),
the range [0x100000000-0x13fffffff] is gone:

  MEMBLOCK configuration:
   memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
   memory.cnt  = 0x3
   memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
   memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
   memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
   ...

This causes shrinking node 0's pfn range because it is calculated by the
address range of memblock.memory.  So some of struct pages in the gap
range are left uninitialized.

We have a function zero_resv_unavail() which does zeroing the struct pages
outside memblock.memory, but currently it covers only the reserved
unavailable range (i.e.  memblock.memory && !memblock.reserved).  This
patch extends it to cover all unavailable range, which fixes the reported
issue.

Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com
Fixes: f7f99100d8 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Tested-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:38:15 -07:00
Keith Busch
714a3a1eba mm/gup_benchmark.c: add additional pinning methods
Provide new gup benchmark ioctl commands to run different user page
pinning methods, get_user_pages_longterm() and get_user_pages(), in
addition to the existing get_user_pages_fast().

Link: http://lkml.kernel.org/r/20181010195605.10689-2-keith.busch@intel.com
Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:38:15 -07:00
Keith Busch
26db3d09d9 mm/gup_benchmark.c: time put_page()
We'd like to measure time to unpin user pages, so this adds a second
benchmark timer on put_page, separate from get_page.

Adding the field breaks this ioctl ABI, but should be okay since this an
in-tree kernel selftest.

[akpm@linux-foundation.org: add expansion to struct gup_benchmark for future use]
Link: http://lkml.kernel.org/r/20181010195605.10689-1-keith.busch@intel.com
Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:38:15 -07:00
Roman Gushchin
7a1adfddaf mm: don't raise MEMCG_OOM event due to failed high-order allocation
It was reported that on some of our machines containers were restarted
with OOM symptoms without an obvious reason.  Despite there were almost no
memory pressure and plenty of page cache, MEMCG_OOM event was raised
occasionally, causing the container management software to think, that OOM
has happened.  However, no tasks have been killed.

The following investigation showed that the problem is caused by a failing
attempt to charge a high-order page.  In such case, the OOM killer is
never invoked.  As shown below, it can happen under conditions, which are
very far from a real OOM: e.g.  there is plenty of clean page cache and no
memory pressure.

There is no sense in raising an OOM event in this case, as it might
confuse a user and lead to wrong and excessive actions (e.g.  restart the
workload, as in my case).

Let's look at the charging path in try_charge().  If the memory usage is
about memory.max, which is absolutely natural for most memory cgroups, we
try to reclaim some pages.  Even if we were able to reclaim enough memory
for the allocation, the following check can fail due to a race with
another concurrent allocation:

    if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
        goto retry;

For regular pages the following condition will save us from triggering
the OOM:

   if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
       goto retry;

But for high-order allocation this condition will intentionally fail.  The
reason behind is that we'll likely fall to regular pages anyway, so it's
ok and even preferred to return ENOMEM.

In this case the idea of raising MEMCG_OOM looks dubious.

Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation
order check, so that the event won't be raised for high order allocations.
This change doesn't affect regular pages allocation and charging.

Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:38:14 -07:00
Dave Chinner
64081362e8 mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
We've recently seen a workload on XFS filesystems with a repeatable
deadlock between background writeback and a multi-process application
doing concurrent writes and fsyncs to a small range of a file.

range_cyclic
writeback		Process 1		Process 2

xfs_vm_writepages
  write_cache_pages
    writeback_index = 2
    cycled = 0
    ....
    find page 2 dirty
    lock Page 2
    ->writepage
      page 2 writeback
      page 2 clean
      page 2 added to bio
    no more pages
			write()
			locks page 1
			dirties page 1
			locks page 2
			dirties page 1
			fsync()
			....
			xfs_vm_writepages
			write_cache_pages
			  start index 0
			  find page 1 towrite
			  lock Page 1
			  ->writepage
			    page 1 writeback
			    page 1 clean
			    page 1 added to bio
			  find page 2 towrite
			  lock Page 2
			  page 2 is writeback
			  <blocks>
						write()
						locks page 1
						dirties page 1
						fsync()
						....
						xfs_vm_writepages
						write_cache_pages
						  start index 0

    !done && !cycled
      sets index to 0, restarts lookup
    find page 1 dirty
						  find page 1 towrite
						  lock Page 1
						  page 1 is writeback
						  <blocks>

    lock Page 1
    <blocks>

DEADLOCK because:

	- process 1 needs page 2 writeback to complete to make
	  enough progress to issue IO pending for page 1
	- writeback needs page 1 writeback to complete so process 2
	  can progress and unlock the page it is blocked on, then it
	  can issue the IO pending for page 2
	- process 2 can't make progress until process 1 issues IO
	  for page 1

The underlying cause of the problem here is that range_cyclic writeback is
processing pages in descending index order as we hold higher index pages
in a structure controlled from above write_cache_pages().  The
write_cache_pages() caller needs to be able to submit these pages for IO
before write_cache_pages restarts writeback at mapping index 0 to avoid
wcp inverting the page lock/writeback wait order.

generic_writepages() is not susceptible to this bug as it has no private
context held across write_cache_pages() - filesystems using this
infrastructure always submit pages in ->writepage immediately and so there
is no problem with range_cyclic going back to mapping index 0.

However:
	mpage_writepages() has a private bio context,
	exofs_writepages() has page_collect
	fuse_writepages() has fuse_fill_wb_data
	nfs_writepages() has nfs_pageio_descriptor
	xfs_vm_writepages() has xfs_writepage_ctx

All of these ->writepages implementations can hold pages under writeback
in their private structures until write_cache_pages() returns, and hence
they are all susceptible to this deadlock.

Also worth noting is that ext4 has it's own bastardised version of
write_cache_pages() and so it /may/ have an equivalent deadlock.  I looked
at the code long enough to understand that it has a similar retry loop for
range_cyclic writeback reaching the end of the file and then promptly ran
away before my eyes bled too much.  I'll leave it for the ext4 developers
to determine if their code is actually has this deadlock and how to fix it
if it has.

There's a few ways I can see avoid this deadlock.  There's probably more,
but these are the first I've though of:

1. get rid of range_cyclic altogether

2. range_cyclic always stops at EOF, and we start again from
writeback index 0 on the next call into write_cache_pages()

2a. wcp also returns EAGAIN to ->writepages implementations to
indicate range cyclic has hit EOF. writepages implementations can
then flush the current context and call wpc again to continue. i.e.
lift the retry into the ->writepages implementation

3. range_cyclic uses trylock_page() rather than lock_page(), and it
skips pages it can't lock without blocking. It will already do this
for pages under writeback, so this seems like a no-brainer

3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
blocking as per pages under writeback.

I don't think #1 is an option - range_cyclic prevents frequently
dirtied lower file offset from starving background writeback of
rarely touched higher file offsets.

#2 is simple, and I don't think it will have any impact on
performance as going back to the start of the file implies an
immediate seek. We'll have exactly the same number of seeks if we
switch writeback to another inode, and then come back to this one
later and restart from index 0.

#2a is pretty much "status quo without the deadlock". Moving the
retry loop up into the wcp caller means we can issue IO on the
pending pages before calling wcp again, and so avoid locking or
waiting on pages in the wrong order. I'm not convinced we need to do
this given that we get the same thing from #2 on the next writeback
call from the writeback infrastructure.

#3 is really just a band-aid - it doesn't fix the access/wait
inversion problem, just prevents it from becoming a deadlock
situation. I'd prefer we fix the inversion, not sweep it under the
carpet like this.

#3a is really an optimisation that just so happens to include the
band-aid fix of #3.

So it seems that the simplest way to fix this issue is to implement
solution #2

Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.de>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:38:14 -07:00
Pavel Tatashin
a9a9e77fbf mm: move mirrored memory specific code outside of memmap_init_zone
memmap_init_zone, is getting complex, because it is called from different
contexts: hotplug, and during boot, and also because it must handle some
architecture quirks.  One of them is mirrored memory.

Move the code that decides whether to skip mirrored memory outside of
memmap_init_zone, into a separate function.

[pasha.tatashin@oracle.com: uninline overlap_memmap_init()]
  Link: http://lkml.kernel.org/r/20180726193509.3326-4-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180724235520.10200-4-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:38:10 -07:00
Pavel Tatashin
d3035be4ce mm: calculate deferred pages after skipping mirrored memory
update_defer_init() should be called only when struct page is about to be
initialized. Because it counts number of initialized struct pages, but
there we may skip struct pages if there is some mirrored memory.

So move, update_defer_init() after checking for mirrored memory.

Also, rename update_defer_init() to defer_init() and reverse the return
boolean to emphasize that this is a boolean function, that tells that the
reset of memmap initialization should be deferred.

Make this function self-contained: do not pass number of already
initialized pages in this zone by using static counters.

I found this bug by reading the code.  The effect is that fewer than
expected struct pages are initialized early in boot, and it is possible
that in some corner cases we may fail to boot when mirrored pages are
used.  The deferred on demand code should somewhat mitigate this.  But
this still brings some inconsistencies compared to when booting without
mirrored pages, so it is better to fix.

[pasha.tatashin@oracle.com: add comment about defer_init's lack of locking]
  Link: http://lkml.kernel.org/r/20180726193509.3326-3-pasha.tatashin@oracle.com
[akpm@linux-foundation.org: make defer_init non-inline, __meminit]
Link: http://lkml.kernel.org/r/20180724235520.10200-3-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:35 -07:00
Pavel Tatashin
dfb3ccd00a mm: make memmap_init a proper function
memmap_init is sometimes a macro sometimes a function based on
__HAVE_ARCH_MEMMAP_INIT.  It is only a function on ia64.  Make memmap_init
a weak function instead, and let ia64 redefine it.

Link: http://lkml.kernel.org/r/20180724235520.10200-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:35 -07:00
Kirill Tkhai
1c2d479a11 mm/memcontrol.c: convert mem_cgroup_id::ref to refcount_t type
This will allow to use generic refcount_t interfaces to check counters
overflow instead of currently existing VM_BUG_ON().  The only difference
after the patch is VM_BUG_ON() may cause BUG(), while refcount_t fires
with WARN().  But this seems not to be significant here, since such the
problems are usually caught by syzbot with panic-on-warn enabled.

Link: http://lkml.kernel.org/r/153910718919.7006.13400779039257185427.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:35 -07:00
David Rientjes
4a222127f3 mm/page_alloc.c: initialize num_movable in move_freepages()
If move_freepages_block() returns 0 because !zone_spans_pfn(),
*num_movable can hold the value from the stack because it does not get
initialized in move_freepages().

Move the initialization to move_freepages_block() to guarantee the value
actually makes sense.

This currently doesn't affect its only caller where num_movable != NULL,
so no bug fix, but just more robust.

Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1810051355490.212229@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:35 -07:00
Gustavo A. R. Silva
61855f021c mm/zsmalloc.c: fix fall-through annotation
Replace "fallthru" with a proper "fall through" annotation.

This fix is part of the ongoing efforts to enabling
-Wimplicit-fallthrough

Link: http://lkml.kernel.org/r/20181003105114.GA24423@embeddedor.com
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:35 -07:00
Jann Horn
f0ecf25a09 mm/vmstat.c: assert that vmstat_text is in sync with stat_items_size
Having two gigantic arrays that must manually be kept in sync, including
ifdefs, isn't exactly robust.  To make it easier to catch such issues in
the future, add a BUILD_BUG_ON().

Link: http://lkml.kernel.org/r/20181001143138.95119-3-jannh@google.com
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:35 -07:00
Aneesh Kumar K.V
ff09d7ec97 mm/memory.c: recheck page table entry with page table lock held
We clear the pte temporarily during read/modify/write update of the pte.
If we take a page fault while the pte is cleared, the application can get
SIGBUS.  One such case is with remap_pfn_range without a backing
vm_ops->fault callback.  do_fault will return SIGBUS in that case.

cpu 0		 				cpu1
mprotect()
ptep_modify_prot_start()/pte cleared.
.
.						page fault.
.
.
prep_modify_prot_commit()

Fix this by taking page table lock and rechecking for pte_none.

[aneesh.kumar@linux.ibm.com: fix crash observed with syzkaller run]
  Link: http://lkml.kernel.org/r/87va6bwlfg.fsf@linux.ibm.com
Link: http://lkml.kernel.org/r/20180926031858.9692-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:35 -07:00
Yang Shi
9bc8039e71 mm: brk: downgrade mmap_sem to read when shrinking
brk might be used to shrink memory mapping too other than munmap().  So,
it may hold write mmap_sem for long time when shrinking large mapping, as
what commit ("mm: mmap: zap pages with read mmap_sem in munmap")
described.

The brk() will not manipulate vmas anymore after __do_munmap() call for
the mapping shrink use case.  But, it may set mm->brk after __do_munmap(),
which needs hold write mmap_sem.

However, a simple trick can workaround this by setting mm->brk before
__do_munmap().  Then restore the original value if __do_munmap() fails.
With this trick, it is safe to downgrade to read mmap_sem.

So, the same optimization, which downgrades mmap_sem to read for zapping
pages, is also feasible and reasonable to this case.

The period of holding exclusive mmap_sem for shrinking large mapping would
be reduced significantly with this optimization.

[akpm@linux-foundation.org: tweak comment]
[yang.shi@linux.alibaba.com: fix unsigned compare against 0 issue]
  Link: http://lkml.kernel.org/r/1538687672-17795-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1538067582-60038-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:35 -07:00
Yang Shi
85a06835f6 mm: mremap: downgrade mmap_sem to read when shrinking
Other than munmap, mremap might be used to shrink memory mapping too.
So, it may hold write mmap_sem for long time when shrinking large
mapping, as what commit ("mm: mmap: zap pages with read mmap_sem in
munmap") described.

The mremap() will not manipulate vmas anymore after __do_munmap() call for
the mapping shrink use case, so it is safe to downgrade to read mmap_sem.

So, the same optimization, which downgrades mmap_sem to read for zapping
pages, is also feasible and reasonable to this case.

The period of holding exclusive mmap_sem for shrinking large mapping
would be reduced significantly with this optimization.

MREMAP_FIXED and MREMAP_MAYMOVE are more complicated to adopt this
optimization since they need manipulate vmas after do_munmap(),
downgrading mmap_sem may create race window.

Simple mapping shrink is the low hanging fruit, and it may cover the
most cases of unmap with munmap together.

[akpm@linux-foundation.org: tweak comment]
[yang.shi@linux.alibaba.com: fix unsigned compare against 0 issue]
  Link: http://lkml.kernel.org/r/1538687672-17795-2-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1538067582-60038-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:35 -07:00
Souptick Joarder
3c0513243a mm/filemap.c: use vmf_error()
These codes can be replaced with new inline vmf_error().

Link: http://lkml.kernel.org/r/20180927171411.GA23331@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:35 -07:00
Wei Yang
d4faa40259 mm: remove unnecessary local variable addr in __get_user_pages_fast()
The local variable `addr' in __get_user_pages_fast() is just a shadow of
`start'.  Since `start' never changes after assignment to `addr', it is
fine to replace `start' with it.

Also the meaning of [start, end] is more obvious than [addr, end] when
passed to gup_pgd_range().

Link: http://lkml.kernel.org/r/20180925021448.20265-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:34 -07:00
Alexander Duyck
966cf44f63 mm: defer ZONE_DEVICE page initialization to the point where we init pgmap
The ZONE_DEVICE pages were being initialized in two locations.  One was
with the memory_hotplug lock held and another was outside of that lock.
The problem with this is that it was nearly doubling the memory
initialization time.  Instead of doing this twice, once while holding a
global lock and once without, I am opting to defer the initialization to
the one outside of the lock.  This allows us to avoid serializing the
overhead for memory init and we can instead focus on per-node init times.

One issue I encountered is that devm_memremap_pages and
hmm_devmmem_pages_create were initializing only the pgmap field the same
way.  One wasn't initializing hmm_data, and the other was initializing it
to a poison value.  Since this is something that is exposed to the driver
in the case of hmm I am opting for a third option and just initializing
hmm_data to 0 since this is going to be exposed to unknown third party
drivers.

[alexander.h.duyck@linux.intel.com: fix reference count for pgmap in devm_memremap_pages]
  Link: http://lkml.kernel.org/r/20181008233404.1909.37302.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/20180925202053.3576.66039.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:34 -07:00
Alexander Duyck
d483da5bc7 mm: create non-atomic version of SetPageReserved for init use
It doesn't make much sense to use the atomic SetPageReserved at init time
when we are using memset to clear the memory and manipulating the page
flags via simple "&=" and "|=" operations in __init_single_page.

This patch adds a non-atomic version __SetPageReserved that can be used
during page init and shows about a 10% improvement in initialization times
on the systems I have available for testing.  On those systems I saw
initialization times drop from around 35 seconds to around 32 seconds to
initialize a 3TB block of persistent memory.  I believe the main advantage
of this is that it allows for more compiler optimization as the __set_bit
operation can be reordered whereas the atomic version cannot.

I tried adding a bit of documentation based on f1dd2cd13c ("mm,
memory_hotplug: do not associate hotadded memory to zones until online").

Ideally the reserved flag should be set earlier since there is a brief
window where the page is initialization via __init_single_page and we have
not set the PG_Reserved flag.  I'm leaving that for a future patch set as
that will require a more significant refactor.

Link: http://lkml.kernel.org/r/20180925202018.3576.11607.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:34 -07:00
Alexander Duyck
f682a97a00 mm: provide kernel parameter to allow disabling page init poisoning
Patch series "Address issues slowing persistent memory initialization", v5.

The main thing this patch set achieves is that it allows us to initialize
each node worth of persistent memory independently.  As a result we reduce
page init time by about 2 minutes because instead of taking 30 to 40
seconds per node and going through each node one at a time, we process all
4 nodes in parallel in the case of a 12TB persistent memory setup spread
evenly over 4 nodes.

This patch (of 3):

On systems with a large amount of memory it can take a significant amount
of time to initialize all of the page structs with the PAGE_POISON_PATTERN
value.  I have seen it take over 2 minutes to initialize a system with
over 12TB of RAM.

In order to work around the issue I had to disable CONFIG_DEBUG_VM and
then the boot time returned to something much more reasonable as the
arch_add_memory call completed in milliseconds versus seconds.  However in
doing that I had to disable all of the other VM debugging on the system.

In order to work around a kernel that might have CONFIG_DEBUG_VM enabled
on a system that has a large amount of memory I have added a new kernel
parameter named "vm_debug" that can be set to "-" in order to disable it.

Link: http://lkml.kernel.org/r/20180925201921.3576.84239.stgit@localhost.localdomain
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:34 -07:00
Shakeel Butt
85cfb24506 memcg: remove memcg_kmem_skip_account
The flag memcg_kmem_skip_account was added during the era of opt-out kmem
accounting.  There is no need for such flag in the opt-in world as there
aren't any __GFP_ACCOUNT allocations within memcg_create_cache_enqueue().

Link: http://lkml.kernel.org/r/20180919004501.178023-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
Oscar Salvador
86b27beae5 mm/memory_hotplug.c: clean up node_states_check_changes_offline()
This patch, as the previous one, gets rid of the wrong if statements.
While at it, I realized that the comments are sometimes very confusing,
to say the least, and wrong.
For example:

___
zone_last = ZONE_MOVABLE;

/*
 * check whether node_states[N_HIGH_MEMORY] will be changed
 * If we try to offline the last present @nr_pages from the node,
 * we can determind we will need to clear the node from
 * node_states[N_HIGH_MEMORY].
 */

for (; zt <= zone_last; zt++)
        present_pages += pgdat->node_zones[zt].present_pages;
if (nr_pages >= present_pages)
        arg->status_change_nid = zone_to_nid(zone);
else
        arg->status_change_nid = -1;
___

In case the node gets empry, it must be removed from N_MEMORY.  We already
check N_HIGH_MEMORY a bit above within the CONFIG_HIGHMEM ifdef code.  Not
to say that status_change_nid is for N_MEMORY, and not for N_HIGH_MEMORY.

So I re-wrote some of the comments to what I think is better.

[osalvador@suse.de: address feedback from Pavel]
  Link: http://lkml.kernel.org/r/20180921132634.10103-5-osalvador@techadventures.net
Link: http://lkml.kernel.org/r/20180919100819.25518-6-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
Oscar Salvador
8efe33f40f mm/memory_hotplug.c: simplify node_states_check_changes_online
While looking at node_states_check_changes_online, I stumbled upon some
confusing things.

Right after entering the function, we find this:

if (N_MEMORY == N_NORMAL_MEMORY)
        zone_last = ZONE_MOVABLE;

This is wrong.
N_MEMORY cannot really be equal to N_NORMAL_MEMORY.
My guess is that this wanted to be something like:

if (N_NORMAL_MEMORY == N_HIGH_MEMORY)

to check if we have CONFIG_HIGHMEM.

Later on, in the CONFIG_HIGHMEM block, we have:

if (N_MEMORY == N_HIGH_MEMORY)
        zone_last = ZONE_MOVABLE;

Again, this is wrong, and will never be evaluated to true.

Besides removing these wrong if statements, I simplified the function a
bit.

[osalvador@suse.de: address feedback from Pavel]
  Link: http://lkml.kernel.org/r/20180921132634.10103-4-osalvador@techadventures.net
Link: http://lkml.kernel.org/r/20180919100819.25518-5-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <yasu.isimatu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
Oscar Salvador
cf01f6f5e3 mm/memory_hotplug.c: tidy up node_states_clear_node()
node_states_clear has the following if statements:

if ((N_MEMORY != N_NORMAL_MEMORY) &&
    (arg->status_change_nid_high >= 0))
        ...

if ((N_MEMORY != N_HIGH_MEMORY) &&
    (arg->status_change_nid >= 0))
        ...

N_MEMORY can never be equal to neither N_NORMAL_MEMORY nor
N_HIGH_MEMORY.

Similar problem was found in [1].
Since this is wrong, let us get rid of it.

[1] https://patchwork.kernel.org/patch/10579155/

Link: http://lkml.kernel.org/r/20180919100819.25518-4-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <yasu.isimatu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
Oscar Salvador
83d83612d7 mm/memory_hotplug.c: spare unnecessary calls to node_set_state
In node_states_check_changes_online, we check if the node will have to be
set for any of the N_*_MEMORY states after the pages have been onlined.

Later on, we perform the activation in node_states_set_node.  Currently,
in node_states_set_node we set the node to N_MEMORY unconditionally.

This means that we call node_set_state for N_MEMORY every time pages go
online, but we only need to do it if the node has not yet been set for
N_MEMORY.

Fix this by checking status_change_nid.

Link: http://lkml.kernel.org/r/20180919100819.25518-2-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
haiqing.shq
3cb7b121ff mm/filemap.c: Use existing variable
Use the variable write_len instead of ov_iter_count(from).

Link: http://lkml.kernel.org/r/1537375855-2088-1-git-send-email-leviathan0992@gmail.com
Signed-off-by: haiqing.shq <leviathan0992@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
Yang Shi
cb4922496a mm: unmap VM_PFNMAP mappings with optimized path
When unmapping VM_PFNMAP mappings, vm flags need to be updated.  Since the
vmas have been detached, so it sounds safe to update vm flags with read
mmap_sem.

Link: http://lkml.kernel.org/r/1537376621-51150-4-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
Yang Shi
b4cefb3605 mm: unmap VM_HUGETLB mappings with optimized path
When unmapping VM_HUGETLB mappings, vm flags need to be updated.  Since
the vmas have been detached, so it sounds safe to update vm flags with
read mmap_sem.

Link: http://lkml.kernel.org/r/1537376621-51150-3-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
Yang Shi
dd2283f260 mm: mmap: zap pages with read mmap_sem in munmap
Patch series "mm: zap pages with read mmap_sem in munmap for large
mapping", v11.

Background:
Recently, when we ran some vm scalability tests on machines with large memory,
we ran into a couple of mmap_sem scalability issues when unmapping large memory
space, please refer to https://lkml.org/lkml/2017/12/14/733 and
https://lkml.org/lkml/2018/2/20/576.

History:
Then akpm suggested to unmap large mapping section by section and drop mmap_sem
at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).

V1 patch series was submitted to the mailing list per Andrew's suggestion
(see https://lkml.org/lkml/2018/3/20/786).  Then I received a lot great
feedback and suggestions.

Then this topic was discussed on LSFMM summit 2018.  In the summit, Michal
Hocko suggested (also in the v1 patches review) to try "two phases"
approach.  Zapping pages with read mmap_sem, then doing via cleanup with
write mmap_sem (for discussion detail, see
https://lwn.net/Articles/753269/)

Approach:
Zapping pages is the most time consuming part, according to the suggestion from
Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.

But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
  * The unexpected state from PF if it wins the race in the middle of munmap.
    It may return zero page, instead of the content or SIGSEGV.
  * Can't handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
    is a showstopper from akpm

But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
        acquire write mmap_sem
        lookup vmas (find and split vmas)
        deal with special mappings
        detach vmas
        downgrade_write

        zap pages
        free page tables
        release mmap_sem

The vm events with read mmap_sem may come in during page zapping, but
since vmas have been detached before, they, i.e.  page fault, gup, etc,
will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
expected.

If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
mappings.  They will be handled by falling back to regular do_munmap()
with exclusive mmap_sem held in this patch since they may update vm flags.

But, with the "detach vmas first" approach, the vmas have been detached
when vm flags are updated, so it sounds safe to update vm flags with read
mmap_sem for this specific case.  So, VM_HUGETLB and VM_PFNMAP will be
handled by using the optimized path in the following separate patches for
bisectable sake.

Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
However it is fine to have false-positive MMF_RECALC_UPROBES according to
uprobes developer.  So, uprobe unmap will not be handled by the regular
path.

With the "detach vmas first" approach we don't have to re-acquire mmap_sem
again to clean up vmas to avoid race window which might get the address
space changed since downgrade_write() doesn't release the lock to lead
regression, which simply downgrades to read lock.

And, since the lock acquire/release cost is managed to the minimum and
almost as same as before, the optimization could be extended to any size
of mapping without incurring significant penalty to small mappings.

For the time being, just do this in munmap syscall path.  Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
intact due to some implementation difficulties since they acquire write
mmap_sem from very beginning and hold it until the end, do_munmap() might
be called in the middle.  But, the optimized do_munmap would like to be
called without mmap_sem held so that we can do the optimization.  So, if
we want to do the similar optimization for mmap/mremap path, I'm afraid we
would have to redesign them.  mremap might be called on very large area
depending on the usecases, the optimization to it will be considered in
the future.

This patch (of 3):

When running some mmap/munmap scalability tests with large memory (i.e.
> 300GB), the below hung task issue may happen occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
       Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 ps              D    0 14018      1 0x00000004
  ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
  ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
  00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
 Call Trace:
  [<ffffffff817154d0>] ? __schedule+0x250/0x730
  [<ffffffff817159e6>] schedule+0x36/0x80
  [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
  [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
  [<ffffffff81717db0>] down_read+0x20/0x40
  [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
  [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
  [<ffffffff81241d87>] __vfs_read+0x37/0x150
  [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
  [<ffffffff81242266>] vfs_read+0x96/0x130
  [<ffffffff812437b5>] SyS_read+0x55/0xc0
  [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5

It is because munmap holds mmap_sem exclusively from very beginning to all
the way down to the end, and doesn't release it in the middle.  When
unmapping large mapping, it may take long time (take ~18 seconds to unmap
320GB mapping with every single page mapped on an idle machine).

Zapping pages is the most time consuming part, according to the suggestion
from Michal Hocko [1], zapping pages can be done with holding read
mmap_sem, like what MADV_DONTNEED does.  Then re-acquire write mmap_sem to
cleanup vmas.

But, some part may need write mmap_sem, for example, vma splitting. So,
the design is as follows:
        acquire write mmap_sem
        lookup vmas (find and split vmas)
        deal with special mappings
        detach vmas
        downgrade_write

        zap pages
        free page tables
        release mmap_sem

The vm events with read mmap_sem may come in during page zapping, but
since vmas have been detached before, they, i.e.  page fault, gup, etc,
will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
expected.

If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
mappings.  They will be handled by without downgrading mmap_sem in this
patch since they may update vm flags.

But, with the "detach vmas first" approach, the vmas have been detached
when vm flags are updated, so it sounds safe to update vm flags with read
mmap_sem for this specific case.  So, VM_HUGETLB and VM_PFNMAP will be
handled by using the optimized path in the following separate patches for
bisectable sake.

Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
However it is fine to have false-positive MMF_RECALC_UPROBES according to
uprobes developer.

With the "detach vmas first" approach we don't have to re-acquire mmap_sem
again to clean up vmas to avoid race window which might get the address
space changed since downgrade_write() doesn't release the lock to lead
regression, which simply downgrades to read lock.

And, since the lock acquire/release cost is managed to the minimum and
almost as same as before, the optimization could be extended to any size
of mapping without incurring significant penalty to small mappings.

For the time being, just do this in munmap syscall path.  Other
vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
intact due to some implementation difficulties since they acquire write
mmap_sem from very beginning and hold it until the end, do_munmap() might
be called in the middle.  But, the optimized do_munmap would like to be
called without mmap_sem held so that we can do the optimization.  So, if
we want to do the similar optimization for mmap/mremap path, I'm afraid we
would have to redesign them.  mremap might be called on very large area
depending on the usecases, the optimization to it will be considered in
the future.

With the patches, exclusive mmap_sem hold time when munmap a 80GB address
space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level
from second.

munmap_test-15002 [008]   594.380138: funcgraph_entry: |
__vm_munmap() {
munmap_test-15002 [008]   594.380146: funcgraph_entry:      !2485684 us
|    unmap_region();
munmap_test-15002 [008]   596.865836: funcgraph_exit:       !2485692 us
|  }

Here the execution time of unmap_region() is used to evaluate the time of
holding read mmap_sem, then the remaining time is used with holding
exclusive lock.

[1] https://lwn.net/Articles/753269/

Link: http://lkml.kernel.org/r/1537376621-51150-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>Suggested-by: Michal Hocko <mhocko@kernel.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
Andrey Ryabinin
a8dda165ec vfree: add debug might_sleep()
Add might_sleep() call to vfree() to catch potential sleep-in-atomic bugs
earlier.

[aryabinin@virtuozzo.com: drop might_sleep_if() from kvfree()]
  Link: http://lkml.kernel.org/r/7e19e4df-b1a6-29bd-9ae7-0266d50bef1d@virtuozzo.com
Link: http://lkml.kernel.org/r/20180914130512.10394-3-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
Andrey Ryabinin
3ca4ea3a7a mm/vmalloc.c: improve vfree() kerneldoc
vfree() might sleep if called not in interrupt context.  Explain that in
the comment.

Link: http://lkml.kernel.org/r/20180914130512.10394-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
Andrey Ryabinin
52414d3302 kvfree(): fix misleading comment
vfree() might sleep if called not in interrupt context.  So does kvfree()
too.  Fix misleading kvfree()'s comment about allowed context.

Link: http://lkml.kernel.org/r/20180914130512.10394-1-aryabinin@virtuozzo.com
Fixes: 04b8e94607 ("mm/util.c: improve kvfree() kerneldoc")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
zhong jiang
dedf2c73b8 mm/mempolicy.c: use match_string() helper to simplify the code
match_string() returns the index of an array for a matching string, which
can be used intead of open coded implementation.

Link: http://lkml.kernel.org/r/1536988365-50310-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
YueHaibing
c3df29d130 mm/swap.c: remove duplicated include
Remove duplicated include linux/memremap.h

Link: http://lkml.kernel.org/r/20180917131308.16420-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
Michal Hocko
2c029a1ea3 mm, page_alloc: drop should_suppress_show_mem
should_suppress_show_mem() was introduced to reduce the overhead of
show_mem on large NUMA systems.  Things have changed since then though.
Namely c78e93630d ("mm: do not walk all of system memory during
show_mem") has reduced the overhead considerably.

Moreover warn_alloc_show_mem clears SHOW_MEM_FILTER_NODES when called from
the IRQ context already so we are not printing per node stats.

Remove should_suppress_show_mem because we are losing potentially
interesting information about allocation failures.  We have seen a bug
report where system gets unresponsive under memory pressure and there is
only

kernel: [2032243.696888] qlge 0000:8b:00.1 ql1: Could not get a page chunk, i=8, clean_idx =200 .
kernel: [2032243.710725] swapper/7: page allocation failure: order:1, mode:0x1084120(GFP_ATOMIC|__GFP_COLD|__GFP_COMP)

without an additional information for debugging.  It would be great to see
the state of the page allocator at the moment.

Link: http://lkml.kernel.org/r/20180907114334.7088-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
Johannes Weiner
e9b257ed15 mm/memcontrol.c: fix memory.stat item ordering
The refault stats go better with the page fault stats, and are of
higher interest than the stats on LRU operations. In fact they used to
be grouped together; when the LRU operation stats were added later on,
they were wedged in between.

Move them back together. Documentation/admin-guide/cgroup-v2.rst
already lists them in the right order.

Link: http://lkml.kernel.org/r/20181010140239.GA2527@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
Johannes Weiner
4b85afbdac mm: zero-seek shrinkers
The page cache and most shrinkable slab caches hold data that has been
read from disk, but there are some caches that only cache CPU work, such
as the dentry and inode caches of procfs and sysfs, as well as the subset
of radix tree nodes that track non-resident page cache.

Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for
the shrinker's seeks setting tells the reclaim algorithm that for every
two page cache pages scanned it should scan one slab object.

This is a bogus setting.  A virtual inode that required no IO to create is
not twice as valuable as a page cache page; shadow cache entries with
eviction distances beyond the size of memory aren't either.

In most cases, the behavior in practice is still fine.  Such virtual
caches don't tend to grow and assert themselves aggressively, and usually
get picked up before they cause problems.  But there are scenarios where
that's not true.

Our database workloads suffer from two of those.  For one, their file
workingset is several times bigger than available memory, which has the
kernel aggressively create shadow page cache entries for the non-resident
parts of it.  The workingset code does tell the VM that most of these are
expendable, but the VM ends up balancing them 2:1 to cache pages as per
the seeks setting.  This is a huge waste of memory.

These workloads also deal with tens of thousands of open files and use
/proc for introspection, which ends up growing the proc_inode_cache to
absurdly large sizes - again at the cost of valuable cache space, which
isn't a reasonable trade-off, given that proc inodes can be re-created
without involving the disk.

This patch implements a "zero-seek" setting for shrinkers that results in
a target ratio of 0:1 between their objects and IO-backed caches.  This
allows such virtual caches to grow when memory is available (they do
cache/avoid CPU work after all), but effectively disables them as soon as
IO-backed objects are under pressure.

It then switches the shrinkers for procfs and sysfs metadata, as well as
excess page cache shadow nodes, to the new zero-seek setting.

Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Domas Mituzas <dmituzas@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
Johannes Weiner
68d48e6a2d mm: workingset: add vmstat counter for shadow nodes
Make it easier to catch bugs in the shadow node shrinker by adding a
counter for the shadow nodes in circulation.

[akpm@linux-foundation.org: assert that irqs are disabled, for __inc_lruvec_page_state()]
[akpm@linux-foundation.org: s/WARN_ON_ONCE/VM_WARN_ON_ONCE/, per Johannes]
Link: http://lkml.kernel.org/r/20181009184732.762-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
Johannes Weiner
505802a535 mm: workingset: use cheaper __inc_lruvec_state in irqsafe node reclaim
No need to use the preemption-safe lruvec state function inside the
reclaim region that has irqs disabled.

Link: http://lkml.kernel.org/r/20181009184732.762-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Johannes Weiner
eb414681d5 psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills.  In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively.  Stall states are aggregate versions of the per-task delay
accounting delays:

       cpu: some tasks are runnable but not executing on a CPU
       memory: tasks are reclaiming, or waiting for swapin or thrashing cache
       io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit.  They can also indicate when the system
is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states.  Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime.  A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).

[hannes@cmpxchg.org: doc fixlet, per Randy]
  Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
  Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
  Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
  Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Johannes Weiner
b1d29ba82c delayacct: track delays from thrashing cache pages
Delay accounting already measures the time a task spends in direct reclaim
and waiting for swapin, but in low memory situations tasks spend can spend
a significant amount of their time waiting on thrashing page cache.  This
isn't tracked right now.

To know the full impact of memory contention on an individual task,
measure the delay when waiting for a recently evicted active cache page to
read back into memory.

Also update tools/accounting/getdelays.c:

     [hannes@computer accounting]$ sudo ./getdelays -d -p 1
     print delayacct stats ON
     PID     1

     CPU             count     real total  virtual total    delay total  delay average
                     50318      745000000      847346785      400533713          0.008ms
     IO              count    delay total  delay average
                       435      122601218              0ms
     SWAP            count    delay total  delay average
                         0              0              0ms
     RECLAIM         count    delay total  delay average
                         0              0              0ms
     THRASHING       count    delay total  delay average
                        19       12621439              0ms

Link: http://lkml.kernel.org/r/20180828172258.3185-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Johannes Weiner
1899ad18c6 mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing.  Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.

During workingset transitions, inactive cache refaults and pushes out
established active cache.  When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.

Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime.  This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.

How many page->flags does this leave us with on 32-bit?

	20 bits are always page flags

	21 if you have an MMU

	23 with the zone bits for DMA, Normal, HighMem, Movable

	29 with the sparsemem section bits

	30 if PAE is enabled

	31 with this patch.

So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.

Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Johannes Weiner
95f9ab2d59 mm: workingset: don't drop refault information prematurely
Patch series "psi: pressure stall information for CPU, memory, and IO", v4.

		Overview

PSI reports the overall wallclock time in which the tasks in a system (or
cgroup) wait for (contended) hardware resources.

This helps users understand the resource pressure their workloads are
under, which allows them to rootcause and fix throughput and latency
problems caused by overcommitting, underprovisioning, suboptimal job
placement in a grid; as well as anticipate major disruptions like OOM.

		Real-world applications

We're using the data collected by PSI (and its previous incarnation,
memdelay) quite extensively at Facebook, and with several success stories.

One usecase is avoiding OOM hangs/livelocks.  The reason these happen is
because the OOM killer is triggered by reclaim not being able to free
pages, but with fast flash devices there is *always* some clean and
uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
spend 90% of the time thrashing the cache pages of their own executables.
There is no situation where this ever makes sense in practice.  We wrote a
<100 line POC python script to monitor memory pressure and kill stuff way
before such pathological thrashing leads to full system losses that would
require forcible hard resets.

We've since extended and deployed this code into other places to guarantee
latency and throughput SLAs, since they're usually violated way before the
kernel OOM killer would ever kick in.

It is available here: https://github.com/facebookincubator/oomd

Eventually we probably want to trigger the in-kernel OOM killer based on
extreme sustained pressure as well, so that Linux can avoid memory
livelocks - which technically aren't deadlocks, but to the user
indistinguishable from them - out of the box.  We'd continue using OOMD as
the first line of defense to ensure workload health and implement complex
kill policies that are beyond the scope of the kernel.

We also use PSI memory pressure for loadshedding.  Our batch job
infrastructure used to use heuristics based on various VM stats to
anticipate OOM situations, with lackluster success.  We switched it to PSI
and managed to anticipate and avoid OOM kills and lockups fairly reliably.
The reduction of OOM outages in the worker pool raised the pool's
aggregate productivity, and we were able to switch that service to smaller
machines.

Lastly, we use cgroups to isolate a machine's main workload from
maintenance crap like package upgrades, logging, configuration, as well as
to prevent multiple workloads on a machine from stepping on each others'
toes.  We were not able to configure this properly without the pressure
metrics; we would see latency or bandwidth drops, but it would often be
hard to impossible to rootcause it post-mortem.

We now log and graph pressure for the containers in our fleet and can
trivially link latency spikes and throughput drops to shortages of
specific resources after the fact, and fix the job config/scheduling.

PSI has also received testing, feedback, and feature requests from Android
and EndlessOS for the purpose of low-latency OOM killing, to intervene in
pressure situations before the UI starts hanging.

		How do you use this feature?

A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3
files: cpu, memory, and io.  If using cgroup2, cgroups will also have
cpu.pressure, memory.pressure and io.pressure files, which simply
aggregate task stalls at the cgroup level instead of system-wide.

The cpu file contains one line:

	some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722

The averages give the percentage of walltime in which one or more tasks
are delayed on the runqueue while another task has the CPU.  They're
recent averages over 10s, 1m, 5m windows, so you can tell short term
trends from long term ones, similarly to the load average.

The total= value gives the absolute stall time in microseconds.  This
allows detecting latency spikes that might be too short to sway the
running averages.  It also allows custom time averaging in case the
10s/1m/5m windows aren't adequate for the usecase (or are too coarse with
future hardware).

What to make of this "some" metric?  If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting.  At two or more runnable tasks
per CPU, the system is 100% overcommitted and the pressure average will
indicate as much.  From a utilization perspective this is a great state of
course: no CPU cycles are being wasted, even when 50% of the threads were
to go idle (as most workloads do vary).  From the perspective of the
individual job it's not great, however, and they would do better with more
resources.  Depending on what your priority and options are, raised "some"
numbers may or may not require action.

The memory file contains two lines:

some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258

The some line is the same as for cpu, the time in which at least one task
is stalled on the resource.  In the case of memory, this includes waiting
on swap-in, page cache refaults and page reclaim.

The full line, however, indicates time in which *nobody* is using the CPU
productively due to pressure: all non-idle tasks are waiting for memory in
one form or another.  Significant time spent in there is a good trigger
for killing things, moving jobs to other machines, or dropping incoming
requests, since neither the jobs nor the machine overall are making too
much headway.

The io file is similar to memory.  Because the block layer doesn't have a
concept of hardware contention right now (how much longer is my IO request
taking due to other tasks?), it reports CPU potential lost on all IO
delays, not just the potential lost due to competition.

		FAQ

Q: How is PSI's CPU component different from the load average?

A: There are several quirks in the load average that make it hard to
   impossible to tell how overcommitted the CPU really is.

   1. The load average is reported as a raw number of active tasks.
      You need to know how many CPUs there are in the system, how many
      CPUs the workload is allowed to use, then think about what the
      proportion between load and the number of CPUs mean for the
      tasks trying to run.

      PSI reports the percentage of wallclock time in which tasks are
      waiting for a CPU to run on. It doesn't matter how many CPUs are
      present or usable. The number always tells the quality of life
      of tasks in the system or in a particular cgroup.

   2. The shortest averaging window is 1m, which is extremely coarse,
      and it's sampled in 5s intervals. A *lot* can happen on a CPU in
      5 seconds. This *may* be able to identify persistent long-term
      trends and very clear and obvious overloads, but it's unusable
      for latency spikes and more subtle overutilization.

      PSI's shortest window is 10s. It also exports the cumulative
      stall times (in microseconds) of synchronously recorded events.

   3. On Linux, the load average for historical reasons includes all
      TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
      busy the system is, but on the flipside it doesn't distinguish
      whether tasks are likely to contend over the CPU or IO - which
      obviously requires very different interventions from a sys admin
      or a job scheduler.

      PSI reports independent metrics for CPU and IO. You can tell
      which resource is making the tasks wait, but in conjunction
      still see how overloaded the system is overall.

Q: What's the cost / performance impact of this feature?

A: PSI's primary cost is in the scheduler, in particular task wakeups
   and sleeps.

   I benchmarked this code using Facebook's two most scheduling
   sensitive workloads: memcache and webserver. They handle a ton of
   small requests - lots of wakeups and sleeps with little actual work
   in between - so they tend to be canaries for scheduler regressions.

   In the tests, the boxes were handling live traffic over the course
   of several hours. Half the machines, the control, ran with
   CONFIG_PSI=n.

   For memcache I used eight machines total. They're 2-socket, 14
   core, 56 thread boxes. The test runs for half the test period,
   flips the test and control kernels on the hardware to rule out HW
   factors, DC location etc., then runs the other half of the test.

   For the webservers, I used 32 machines total. They're single
   socket, 16 core, 32 thread machines.

   During the memcache test, CPU load was nopsi=78.05% psi=78.98% in
   the first half and nopsi=77.52% psi=78.25%, so PSI added between
   0.7 and 0.9 percentage points to the CPU load, a difference of
   about 1%.

   UPDATE: I re-ran this test with the v3 version of this patch set
   and the CPU utilization was equivalent between test and control.

   UPDATE: v4 is on par with v3.

   As far as end-to-end request latency from the client perspective
   goes, we don't sample those finely enough to capture the requests
   going to those particular machines during the test, but we know the
   p50 turnaround time in this workload is 54us, and perf bench sched
   pipe on those machines show nopsi=5.232666 us/op and psi=5.587347
   us/op, so this doesn't add much here either.

   The profile for the pipe benchmark shows:

        0.87%  sched-pipe  [kernel.vmlinux]    [k] psi_group_change
        0.83%  perf.real   [kernel.vmlinux]    [k] psi_group_change
        0.82%  perf.real   [kernel.vmlinux]    [k] psi_task_change
        0.58%  sched-pipe  [kernel.vmlinux]    [k] psi_task_change

   The webserver load is running inside 4 nested cgroup levels. The
   CPU load with both nopsi and psi kernels was indistinguishable at
   81%.

   For comparison, we had to disable the cgroup cpu controller on the
   webservers because it added 4 percentage points to the CPU% during
   this same exact test.

   Versions of this accounting code now run on 80% of our fleet. None
   of our workloads have reported regressions during the rollout.

Daniel Drake said:

: I just retested the latest version at
: http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results
: are great.
:
: Test setup:
: Endless OS
: GeminiLake N4200 low end laptop
: 2GB RAM
: swap (and zram swap) disabled
:
: Baseline test: open a handful of large-ish apps and several website
: tabs in Google Chrome.
:
: Results: after a couple of minutes, system is excessively thrashing, mouse
: cursor can barely be moved, UI is not responding to mouse clicks, so it's
: impractical to recover from this situation as an ordinary user
:
: Add my simple killer:
: https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
:
: Results: when the thrashing causes the UI to become sluggish, the killer
: steps in and kills something (usually a chrome tab), and the system
: remains usable.  I repeatedly opened more apps and more websites over a 15
: minute period but I wasn't able to get the system to a point of UI
: unresponsiveness.

Suren said:

: Backported to 4.9 and retested on ARMv8 8 code system running Android.
: Signals behave as expected reacting to memory pressure, no jumps in
: "total" counters that would indicate an overflow/underflow issues.  Nicely
: done!

This patch (of 9):

If we keep just enough refault information to match the *current* page
cache during reclaim time, we could lose a lot of events when there is
only a temporary spike in non-cache memory consumption that pushes out all
the cache.  Once cache comes back, we won't see those refaults.  They
might not be actionable for LRU aging, but we want to know about them for
measuring memory pressure.

[hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters]
  Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <jweiner@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Vlastimil Babka
f0d7787414 mm, slab: shorten kmalloc cache names for large sizes
Kmalloc cache names can get quite long for large object sizes, when the
sizes are expressed in bytes.  Use 'k' and 'M' prefixes to make the names
as short as possible e.g.  in /proc/slabinfo.  This works, as we mostly
use power-of-two sizes, with exceptions only below 1k.

Example: 'kmalloc-4194304' becomes 'kmalloc-4M'

Link: http://lkml.kernel.org/r/20180731090649.16028-7-vbabka@suse.cz
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Vlastimil Babka
b29940c1ab mm: rename and change semantics of nr_indirectly_reclaimable_bytes
The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by
commit eb59254608 ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with
the goal of accounting objects that can be reclaimed, but cannot be
allocated via a SLAB_RECLAIM_ACCOUNT cache.  This is now possible via
kmalloc() with __GFP_RECLAIMABLE flag, and the dcache external names user
is converted.

The counter is however still useful for accounting direct page allocations
(i.e.  not slab) with a shrinker, such as the ION page pool.  So keep it,
and:

- change granularity to pages to be more like other counters; sub-page
  allocations should be able to use kmalloc
- rename the counter to NR_KERNEL_MISC_RECLAIMABLE
- expose the counter again in vmstat as "nr_kernel_misc_reclaimable"; we can
  again remove the check for not printing "hidden" counters

Link: http://lkml.kernel.org/r/20180731090649.16028-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Vlastimil Babka
1291523f2c mm, slab/slub: introduce kmalloc-reclaimable caches
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker).  This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions.  The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.

The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag.  A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit f1782c9bc5 ("dcache: account external names as indirectly
reclaimable memory").

To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
gfp flags.  They are added to the kmalloc_caches array as a new type.
Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
cache.

This change only applies to SLAB and SLUB, not SLOB.  This is fine, since
SLOB's target are tiny system and this patch does add some overhead of
kmem management objects.

Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:31 -07:00
Vlastimil Babka
cc252eae85 mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.

As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names.  That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.

With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting.  More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations.  The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs.  Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some).  A
pathological case would be e.g.  one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.

If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.

Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output.  The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.

This is how /proc/slabinfo looks like after booting in virtme:

...
kmalloc-rcl-4M         0      0 4194304    1 1024 : tunables    1    1    0 : slabdata      0      0      0
...
kmalloc-rcl-96         7     32    128   32    1 : tunables  120   60    8 : slabdata      1      1      0
kmalloc-rcl-64        25    128     64   64    1 : tunables  120   60    8 : slabdata      2      2      0
kmalloc-rcl-32         0      0     32  124    1 : tunables  120   60    8 : slabdata      0      0      0
kmalloc-4M             0      0 4194304    1 1024 : tunables    1    1    0 : slabdata      0      0      0
kmalloc-2M             0      0 2097152    1  512 : tunables    1    1    0 : slabdata      0      0      0
kmalloc-1M             0      0 1048576    1  256 : tunables    1    1    0 : slabdata      0      0      0
...

/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:

...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...

/proc/meminfo with new KReclaimable counter:

...
Shmem:               564 kB
KReclaimable:      11260 kB
Slab:              18368 kB
SReclaimable:      11260 kB
SUnreclaim:         7108 kB
KernelStack:        1248 kB
...

This patch (of 6):

The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations.  There are tests for
__GFP_DMA in the allocation hotpaths.  We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type".  This
will also allow to add kmalloc-reclaimable caches as a third type.

Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:31 -07:00
Andrea Arcangeli
3b9aadf727 userfaultfd: allow get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) to trigger userfaults
get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would
not be waiting for userfaults before failing and it would hit on a SIGBUS
instead.  Using get_user_pages_locked/unlocked instead will allow
get_mempolicy to allow userfaults to resolve the fault and fill the hole,
before grabbing the node id of the page.

If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an
address inside an area managed by uffd and there is no page at that
address, the page allocation from within get_mempolicy() will fail
because get_user_pages() does not allow for page fault retry required
for uffd; the user will get SIGBUS.

With this patch, the page fault will be resolved by the uffd and the
get_mempolicy() will continue normally.

Background:

Via code review, previously the syscall would have returned -EFAULT
(vm_fault_to_errno), now it will block and wait for an userfault (if
it's waken before the fault is resolved it'll still -EFAULT).

This way get_mempolicy will give a chance to an "unaware" app to be
compliant with userfaults.

The reason this visible change is that becoming "userfault compliant"
cannot regress anything: all other syscalls including read(2)/write(2)
had to become "userfault compliant" long time ago (that's one of the
things userfaultfd can do that PROT_NONE and trapping segfaults can't).

So this is just one more syscall that become "userfault compliant" like
all other major ones already were.

This has been happening on virtio-bridge dpdk process which just called
get_mempolicy on the guest space post live migration, but before the
memory had a chance to be migrated to destination.

I didn't run an strace to be able to show the -EFAULT going away, but
I've the confirmation of the below debug aid information (only visible
with CONFIG_DEBUG_VM=y) going away with the patch:

    [20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0
    [20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1
    [20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017
    [20116.371466] Call Trace:
    [20116.371473]  dump_stack+0x5c/0x80
    [20116.371476]  handle_userfault.cold.37+0x1b/0x22
    [20116.371479]  ? remove_wait_queue+0x20/0x60
    [20116.371481]  ? poll_freewait+0x45/0xa0
    [20116.371483]  ? do_sys_poll+0x31c/0x520
    [20116.371485]  ? radix_tree_lookup_slot+0x1e/0x50
    [20116.371488]  shmem_getpage_gfp+0xce7/0xe50
    [20116.371491]  ? page_add_file_rmap+0x1a/0x2c0
    [20116.371493]  shmem_fault+0x78/0x1e0
    [20116.371495]  ? filemap_map_pages+0x3a1/0x450
    [20116.371498]  __do_fault+0x1f/0xc0
    [20116.371500]  __handle_mm_fault+0xe2e/0x12f0
    [20116.371502]  handle_mm_fault+0xda/0x200
    [20116.371504]  __get_user_pages+0x238/0x790
    [20116.371506]  get_user_pages+0x3e/0x50
    [20116.371510]  kernel_get_mempolicy+0x40b/0x700
    [20116.371512]  ? vfs_write+0x170/0x1a0
    [20116.371515]  __x64_sys_get_mempolicy+0x21/0x30
    [20116.371517]  do_syscall_64+0x5b/0x160
    [20116.371520]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

The above harmless debug message (not a kernel crash, just a
dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify
and improve kernel spots that may have to become "userfaultfd
compliant" like this one (without having to run an strace and search
for syscall misbehavior).  Spots like the above are more closer to a
kernel bug for the non-cooperative usages that Mike focuses on, than
for for dpdk qemu-cooperative usages that reproduced it, but it's still
nicer to get this fixed for dpdk too.

The part of the patch that caused me to think is only the
implementation issue of mpol_get, but it looks like it should work safe
no matter the kind of mempolicy structure that is (the default static
policy also starts at 1 so it'll go to 2 and back to 1 without crashing
everything at 0).

[rppt@linux.vnet.ibm.com: changelog addition]
  http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx
Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Tested-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:20 -07:00
Matthew Wilcox
9b5a8e00d4 mm: convert insert_pfn() to vm_fault_t
All callers convert its errno into a vm_fault_t, so convert it to return a
vm_fault_t directly.

Link: http://lkml.kernel.org/r/20180828145728.11873-11-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:20 -07:00
Matthew Wilcox
79f3aa5ba9 mm: convert __vm_insert_mixed() to vm_fault_t
Both of its callers currently convert its errno return into a vm_fault_t,
so move the conversion into __vm_insert_mixed().

Link: http://lkml.kernel.org/r/20180828145728.11873-10-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:20 -07:00
Matthew Wilcox
6d958546ff mm: inline vm_insert_pfn_prot() into caller
vm_insert_pfn_prot() is only called from vmf_insert_pfn_prot(), so inline
it and convert some of the errnos into vm_fault codes earlier.

Link: http://lkml.kernel.org/r/20180828145728.11873-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:20 -07:00
Matthew Wilcox
ae2b01f370 mm: remove vm_insert_pfn()
All callers are now converted to vmf_insert_pfn() so convert
vmf_insert_pfn() from being a compatibility wrapper around vm_insert_pfn()
to being a compatibility wrapper around vmf_insert_pfn_prot().

Link: http://lkml.kernel.org/r/20180828145728.11873-8-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:20 -07:00
Matthew Wilcox
bc12e6ad96 mm: make vm_insert_pfn_prot() static
Now this is no longer used outside mm/memory.c, make it static.

Link: http://lkml.kernel.org/r/20180828145728.11873-6-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Matthew Wilcox
f5e6d1d5f8 mm: introduce vmf_insert_pfn_prot()
Like vm_insert_pfn_prot(), but returns a vm_fault_t instead of an errno.
Also unexport vm_insert_pfn_prot as it has no modular users.

Link: http://lkml.kernel.org/r/20180828145728.11873-4-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Matthew Wilcox
5d74763745 mm: remove vm_insert_mixed()
All callers are now converted to vmf_insert_mixed() so convert
vmf_insert_mixed() from being a compatibility wrapper into the real
function.

Link: http://lkml.kernel.org/r/20180828145728.11873-3-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Souptick Joarder
4b96a37d1c mm: convert to use vm_fault_t
As part of vm_fault_t conversion filemap_page_mkwrite() for the NOMMU case
was missed.  Now converted.

Link: http://lkml.kernel.org/r/20180828174952.GA29229@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Oscar Salvador
7b0e0c0e35 mm/page_alloc.c: clean up check_for_memory()
check_for_memory() looks a bit confusing.  First of all, we have this:

if (N_MEMORY == N_NORMAL_MEMORY)
	return;

Checking the ENUM declaration, looks like N_MEMORY canot be equal to
N_NORMAL_MEMORY.

I could not find where N_MEMORY is set to N_NORMAL_MEMORY, or the other
way around either, so unless I am missing something, this condition will
never evaluate to true.  It makes sense to get rid of it.

Moving forward, the operations within the loop look a bit confusing as
well.

We set N_HIGH_MEMORY unconditionally, and then we set N_NORMAL_MEMORY in
case we have CONFIG_HIGHMEM (N_NORMAL_MEMORY != N_HIGH_MEMORY) and zone <=
ZONE_NORMAL.  (N_HIGH_MEMORY falls back to N_NORMAL_MEMORY on
!CONFIG_HIGHMEM systems, and that is why we can just go ahead and set
N_HIGH_MEMORY unconditionally)

Although this works, it is a bit subtle.

I think that this could be easier to follow:

First, we should only set N_HIGH_MEMORY in case we have CONFIG_HIGHMEM.
And then we should set N_NORMAL_MEMORY in case zone <= ZONE_NORMAL,
without further checking whether we have CONFIG_HIGHMEM or not.

Link: http://lkml.kernel.org/r/20180828210158.4617-1-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Huang Ying
979aafa591 mm/swapfile.c: clear si->swap_map[] in swap_free_cluster()
si->swap_map[] of the swap entries in cluster needs to be cleared during
freeing.  Previously, this is done in the caller of swap_free_cluster().
This may cause code duplication (one user now, will add more users later)
and lock/unlock cluster unnecessarily.  In this patch, the clearing code
is moved to swap_free_cluster() to avoid the downside.

Link: http://lkml.kernel.org/r/20180827075535.17406-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Huang Ying
10e364da10 mm/swapfile.c: call free_swap_slot() in __swap_entry_free()
This is a code cleanup patch without functionality change.

Originally, when __swap_entry_free() is called, and its return value is 0,
free_swap_slot() will always be called to free the swap entry to the
per-CPU pool.  So move the call to free_swap_slot() to __swap_entry_free()
to simplify the code.

Link: http://lkml.kernel.org/r/20180827075535.17406-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Huang Ying
bcd49e8671 mm/swapfile.c: use __try_to_reclaim_swap() in free_swap_and_cache()
The code path to reclaim the swap entry in free_swap_and_cache() is
almost same as that of __try_to_reclaim_swap().  The largest
difference is just coding style.  So the support to the additional
requirement of free_swap_and_cache() is added into
__try_to_reclaim_swap().  free_swap_and_cache() is changed to call
__try_to_reclaim_swap(), and delete the duplicated code.  This will
improve code readability and reduce the potential bugs.

There are 2 functionality differences between __try_to_reclaim_swap()
and swap entry reclaim code of free_swap_and_cache().

- free_swap_and_cache() only reclaims the swap entry if the page is
  unmapped or swap is getting full.  The support has been added into
  __try_to_reclaim_swap().

- try_to_free_swap() (called by __try_to_reclaim_swap()) checks
  pm_suspended_storage(), while free_swap_and_cache() not.  I think
  this is OK.  Because the page and the swap entry can be reclaimed
  later eventually.

Link: http://lkml.kernel.org/r/20180827075535.17406-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Vincent Whitchurch
154221c3e5 kmemleak: add module param to print warnings to dmesg
Currently, kmemleak only prints the number of suspected leaks to dmesg but
requires the user to read a debugfs file to get the actual stack traces of
the objects' allocation points.  Add a module option to print the full
object information to dmesg too.  It can be enabled with
kmemleak.verbose=1 on the kernel command line, or "echo 1 >
/sys/module/kmemleak/parameters/verbose":

This allows easier integration of kmemleak into test systems: We have
automated test infrastructure to test our Linux systems.  With this
option, running our tests with kmemleak is as simple as enabling kmemleak
and passing this command line option; the test infrastructure knows how to
save kernel logs, which will now include kmemleak reports.  Without this
option, the test infrastructure needs to be specifically taught to read
out the kmemleak debugfs file.  Removing this need for special handling
makes kmemleak more similar to other kernel debug options (slab debugging,
debug objects, etc).

Link: http://lkml.kernel.org/r/20180903144046.21023-1-vincent.whitchurch@axis.com
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Michal Hocko
4e15a073a1 Revert "mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks"
Revert 5ff7091f5a ("mm, mmu_notifier: annotate mmu notifiers with
blockable invalidate callbacks").

MMU_INVALIDATE_DOES_NOT_BLOCK flags was the only one used and it is no
longer needed since 93065ac753 ("mm, oom: distinguish blockable mode for
mmu notifiers").  We now have a full support for per range !blocking
behavior so we can drop the stop gap workaround which the per notifier
flag was used for.

Link: http://lkml.kernel.org/r/20180827112623.8992-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Michal Hocko
15f570bf3d mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry()
Tetsuo Handa has reported that it is possible to bypass the short sleep
for PF_WQ_WORKER threads which was introduced by commit 373ccbe592
("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
any progress") and lock up the system if OOM.

The primary reason is that WQ_MEM_RECLAIM WQs are not guaranteed to run
even when they have a rescuer available.  Those workers might be essential
for reclaim to make a forward progress, however.  If we are too unlucky
all the allocations requests can get stuck waiting for a WQ_MEM_RECLAIM
work item and the system is essentially stuck in an OOM condition without
much hope to move on.  Tetsuo has seen the reclaim stuck on
drain_local_pages_wq or xlog_cil_push_work (xfs).  There might be others.

Since should_reclaim_retry() should be a natural reschedule point,
let's do the short sleep for PF_WQ_WORKER threads unconditionally in
order to guarantee that other pending work items are started.  This
will workaround this problem and it is less fragile than hunting down
when the sleep is missed.  Having a single sleeping point is more
robust.

[akpm@linux-foundation.org: reflow comment to 80 cols to save a couple of lines]
Link: http://lkml.kernel.org/r/20180827135101.15700-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Debugged-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Roman Gushchin
68600f623d mm: don't miss the last page because of round-off error
I've noticed, that dying memory cgroups are often pinned in memory by a
single pagecache page.  Even under moderate memory pressure they sometimes
stayed in such state for a long time.  That looked strange.

My investigation showed that the problem is caused by applying the LRU
pressure balancing math:

  scan = div64_u64(scan * fraction[lru], denominator),

where

  denominator = fraction[anon] + fraction[file] + 1.

Because fraction[lru] is always less than denominator, if the initial scan
size is 1, the result is always 0.

This means the last page is not scanned and has
no chances to be reclaimed.

Fix this by rounding up the result of the division.

In practice this change significantly improves the speed of dying cgroups
reclaim.

[guro@fb.com: prevent double calculation of DIV64_U64_ROUND_UP() arguments]
  Link: http://lkml.kernel.org/r/20180829213311.GA13501@castle
Link: http://lkml.kernel.org/r/20180827162621.30187-3-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Roman Gushchin
591edfb10a mm: drain memcg stocks on css offlining
Memcg charge is batched using per-cpu stocks, so an offline memcg can be
pinned by a cached charge up to a moment, when a process belonging to some
other cgroup will charge some memory on the same cpu.  In other words,
cached charges can prevent a memory cgroup from being reclaimed for some
time, without any clear need.

Let's optimize it by explicit draining of all stocks on css offlining.  As
draining is performed asynchronously, and is skipped if any parallel
draining is happening, it's cheap.

Link: http://lkml.kernel.org/r/20180827162621.30187-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Aaron Tomlin
c5fd3ca06b slub: extend slub debug to handle multiple slabs
Extend the slub_debug syntax to "slub_debug=<flags>[,<slub>]*", where
<slub> may contain an asterisk at the end.  For example, the following
would poison all kmalloc slabs:

	slub_debug=P,kmalloc*

and the following would apply the default flags to all kmalloc and all
block IO slabs:

	slub_debug=,bio*,kmalloc*

Please note that a similar patch was posted by Iliyan Malchev some time
ago but was never merged:

	https://marc.info/?l=linux-mm&m=131283905330474&w=2

Link: http://lkml.kernel.org/r/20180928111139.27962-1-atomlin@redhat.com
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Iliyan Malchev <malchev@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Dmitry Vyukov
61448479a9 mm: don't warn about large allocations for slab
Slub does not call kmalloc_slab() for sizes > KMALLOC_MAX_CACHE_SIZE,
instead it falls back to kmalloc_large().

For slab KMALLOC_MAX_CACHE_SIZE == KMALLOC_MAX_SIZE and it calls
kmalloc_slab() for all allocations relying on NULL return value for
over-sized allocations.

This inconsistency leads to unwanted warnings from kmalloc_slab() for
over-sized allocations for slab.  Returning NULL for failed allocations is
the expected behavior.

Make slub and slab code consistent by checking size >
KMALLOC_MAX_CACHE_SIZE in slab before calling kmalloc_slab().

While we are here also fix the check in kmalloc_slab().  We should check
against KMALLOC_MAX_CACHE_SIZE rather than KMALLOC_MAX_SIZE.  It all kinda
worked because for slab the constants are the same, and slub always checks
the size against KMALLOC_MAX_CACHE_SIZE before kmalloc_slab().  But if we
get there with size > KMALLOC_MAX_CACHE_SIZE anyhow bad things will
happen.  For example, in case of a newly introduced bug in slub code.

Also move the check in kmalloc_slab() from function entry to the size >
192 case.  This partially compensates for the additional check in slab
code and makes slub code a bit faster (at least theoretically).

Also drop __GFP_NOWARN in the warning check.  This warning means a bug in
slab code itself, user-passed flags have nothing to do with it.

Nothing of this affects slob.

Link: http://lkml.kernel.org/r/20180927171502.226522-1-dvyukov@gmail.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: syzbot+87829a10073277282ad1@syzkaller.appspotmail.com
Reported-by: syzbot+ef4e8fc3a06e9019bb40@syzkaller.appspotmail.com
Reported-by: syzbot+6e438f4036df52cbb863@syzkaller.appspotmail.com
Reported-by: syzbot+8574471d8734457d98aa@syzkaller.appspotmail.com
Reported-by: syzbot+af1504df0807a083dbd9@syzkaller.appspotmail.com
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Andy Shevchenko
0684e6526e mm/slub.c: switch to bitmap_zalloc()
Switch to bitmap_zalloc() to show clearly what we are allocating.  Besides
that it returns pointer of bitmap type instead of opaque void *.

Link: http://lkml.kernel.org/r/20180830104301.61649-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:18 -07:00
Linus Torvalds
26873acacb Driver core patches for 4.20-rc1
Driver core patches for 4.20-rc1
 
 Here is a small number of driver core patches for 4.20-rc1.
 
 Not much happened here this merge window, only a very tiny number of
 patches that do:
 	- add BUS_ATTR_WO() for use by drivers
 	- component error path fixes
 	- kernfs range check fix
 	- other tiny error path fixes and const changes
 
 All of these have been in linux-next with no reported issues for a
 while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCW9Lhtw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykHTgCguaJ3SgRefuC/WijjqboTC/SikCoAnRVTUxfU
 v8BisSN22kR3jmxwsXud
 =/IvY
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is a small number of driver core patches for 4.20-rc1.

  Not much happened here this merge window, only a very tiny number of
  patches that do:

   - add BUS_ATTR_WO() for use by drivers

   - component error path fixes

   - kernfs range check fix

   - other tiny error path fixes and const changes

  All of these have been in linux-next with no reported issues for a
  while"

* tag 'driver-core-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  devres: provide devm_kstrdup_const()
  mm: move is_kernel_rodata() to asm-generic/sections.h
  devres: constify p in devm_kfree()
  driver core: add BUS_ATTR_WO() macro
  kernfs: Fix range checks in kernfs_get_target_path
  component: fix loop condition to call unbind() if bind() fails
  drivers/base/devtmpfs.c: don't pretend path is const in delete_path
  kernfs: update comment about kernfs_path() return value
2018-10-26 08:42:25 -07:00
Wei Wang
d95f58f4a6 mm/page_poison: expose page_poisoning_enabled to kernel modules
In some usages, e.g. virtio-balloon, a kernel module needs to know if
page poisoning is in use. This patch exposes the page_poisoning_enabled
function to kernel modules.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2018-10-24 20:57:55 -04:00
Linus Torvalds
ba9f6f8954 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull siginfo updates from Eric Biederman:
 "I have been slowly sorting out siginfo and this is the culmination of
  that work.

  The primary result is in several ways the signal infrastructure has
  been made less error prone. The code has been updated so that manually
  specifying SEND_SIG_FORCED is never necessary. The conversion to the
  new siginfo sending functions is now complete, which makes it
  difficult to send a signal without filling in the proper siginfo
  fields.

  At the tail end of the patchset comes the optimization of decreasing
  the size of struct siginfo in the kernel from 128 bytes to about 48
  bytes on 64bit. The fundamental observation that enables this is by
  definition none of the known ways to use struct siginfo uses the extra
  bytes.

  This comes at the cost of a small user space observable difference.
  For the rare case of siginfo being injected into the kernel only what
  can be copied into kernel_siginfo is delivered to the destination, the
  rest of the bytes are set to 0. For cases where the signal and the
  si_code are known this is safe, because we know those bytes are not
  used. For cases where the signal and si_code combination is unknown
  the bits that won't fit into struct kernel_siginfo are tested to
  verify they are zero, and the send fails if they are not.

  I made an extensive search through userspace code and I could not find
  anything that would break because of the above change. If it turns out
  I did break something it will take just the revert of a single change
  to restore kernel_siginfo to the same size as userspace siginfo.

  Testing did reveal dependencies on preferring the signo passed to
  sigqueueinfo over si->signo, so bit the bullet and added the
  complexity necessary to handle that case.

  Testing also revealed bad things can happen if a negative signal
  number is passed into the system calls. Something no sane application
  will do but something a malicious program or a fuzzer might do. So I
  have fixed the code that performs the bounds checks to ensure negative
  signal numbers are handled"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
  signal: Guard against negative signal numbers in copy_siginfo_from_user32
  signal: Guard against negative signal numbers in copy_siginfo_from_user
  signal: In sigqueueinfo prefer sig not si_signo
  signal: Use a smaller struct siginfo in the kernel
  signal: Distinguish between kernel_siginfo and siginfo
  signal: Introduce copy_siginfo_from_user and use it's return value
  signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
  signal: Fail sigqueueinfo if si_signo != sig
  signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
  signal/unicore32: Use force_sig_fault where appropriate
  signal/unicore32: Generate siginfo in ucs32_notify_die
  signal/unicore32: Use send_sig_fault where appropriate
  signal/arc: Use force_sig_fault where appropriate
  signal/arc: Push siginfo generation into unhandled_exception
  signal/ia64: Use force_sig_fault where appropriate
  signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
  signal/ia64: Use the generic force_sigsegv in setup_frame
  signal/arm/kvm: Use send_sig_mceerr
  signal/arm: Use send_sig_fault where appropriate
  signal/arm: Use force_sig_fault where appropriate
  ...
2018-10-24 11:22:39 +01:00
David Howells
aa563d7bca iov_iter: Separate type from direction and use accessor functions
In the iov_iter struct, separate the iterator type from the iterator
direction and use accessor functions to access them in most places.

Convert a bunch of places to use switch-statements to access them rather
then chains of bitwise-AND statements.  This makes it easier to add further
iterator types.  Also, this can be more efficient as to implement a switch
of small contiguous integers, the compiler can use ~50% fewer compare
instructions than it has to use bitwise-and instructions.

Further, cease passing the iterator type into the iterator setup function.
The iterator function can set that itself.  Only the direction is required.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:07 +01:00
David Howells
00e2370744 iov_iter: Use accessor function
Use accessor functions to access an iterator's type and direction.  This
allows for the possibility of using some other method of determining the
type of iterator than if-chains with bitwise-AND conditions.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:40:44 +01:00
Linus Torvalds
99792e0cea Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
 "Lots of changes in this cycle:

   - Lots of CPA (change page attribute) optimizations and related
     cleanups (Thomas Gleixner, Peter Zijstra)

   - Make lazy TLB mode even lazier (Rik van Riel)

   - Fault handler cleanups and improvements (Dave Hansen)

   - kdump, vmcore: Enable kdumping encrypted memory with AMD SME
     enabled (Lianbo Jiang)

   - Clean up VM layout documentation (Baoquan He, Ingo Molnar)

   - ... plus misc other fixes and enhancements"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  x86/stackprotector: Remove the call to boot_init_stack_canary() from cpu_startup_entry()
  x86/mm: Kill stray kernel fault handling comment
  x86/mm: Do not warn about PCI BIOS W+X mappings
  resource: Clean it up a bit
  resource: Fix find_next_iomem_res() iteration issue
  resource: Include resource end in walk_*() interfaces
  x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one error
  x86/mm: Remove spurious fault pkey check
  x86/mm/vsyscall: Consider vsyscall page part of user address space
  x86/mm: Add vsyscall address helper
  x86/mm: Fix exception table comments
  x86/mm: Add clarifying comments for user addr space
  x86/mm: Break out user address space handling
  x86/mm: Break out kernel address space handling
  x86/mm: Clarify hardware vs. software "error_code"
  x86/mm/tlb: Make lazy TLB mode lazier
  x86/mm/tlb: Add freed_tables element to flush_tlb_info
  x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range
  smp,cpumask: introduce on_each_cpu_cond_mask
  smp: use __cpumask_set_cpu in on_each_cpu_cond
  ...
2018-10-23 17:05:28 +01:00
Linus Torvalds
0200fbdd43 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking and misc x86 updates from Ingo Molnar:
 "Lots of changes in this cycle - in part because locking/core attracted
  a number of related x86 low level work which was easier to handle in a
  single tree:

   - Linux Kernel Memory Consistency Model updates (Alan Stern, Paul E.
     McKenney, Andrea Parri)

   - lockdep scalability improvements and micro-optimizations (Waiman
     Long)

   - rwsem improvements (Waiman Long)

   - spinlock micro-optimization (Matthew Wilcox)

   - qspinlocks: Provide a liveness guarantee (more fairness) on x86.
     (Peter Zijlstra)

   - Add support for relative references in jump tables on arm64, x86
     and s390 to optimize jump labels (Ard Biesheuvel, Heiko Carstens)

   - Be a lot less permissive on weird (kernel address) uaccess faults
     on x86: BUG() when uaccess helpers fault on kernel addresses (Jann
     Horn)

   - macrofy x86 asm statements to un-confuse the GCC inliner. (Nadav
     Amit)

   - ... and a handful of other smaller changes as well"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
  locking/lockdep: Make global debug_locks* variables read-mostly
  locking/lockdep: Fix debug_locks off performance problem
  locking/pvqspinlock: Extend node size when pvqspinlock is configured
  locking/qspinlock_stat: Count instances of nested lock slowpaths
  locking/qspinlock, x86: Provide liveness guarantee
  x86/asm: 'Simplify' GEN_*_RMWcc() macros
  locking/qspinlock: Rework some comments
  locking/qspinlock: Re-order code
  locking/lockdep: Remove duplicated 'lock_class_ops' percpu array
  x86/defconfig: Enable CONFIG_USB_XHCI_HCD=y
  futex: Replace spin_is_locked() with lockdep
  locking/lockdep: Make class->ops a percpu counter and move it under CONFIG_DEBUG_LOCKDEP=y
  x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs
  x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs
  x86/extable: Macrofy inline assembly code to work around GCC inlining bugs
  x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops
  x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs
  x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs
  x86/refcount: Work around GCC inlining bug
  x86/objtool: Use asm macros to work around GCC inlining bugs
  ...
2018-10-23 13:08:53 +01:00
Linus Torvalds
6ab9e09238 for-4.20/block-20181021
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvNQKgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgps+8D/9Iy6YIeoPwN10gYsqIh0P2fS3wKzL3kiww
 3vFsWO78PzgLxUlNmB7teLtNFc/R5mi8becZmAdvs9za5YFZk56o3Ifv1x+e+z00
 VY1/gxhiJD6suLeJ6lECnERGDaiWOZVRMo2TE17vxYGW6GGaa0Ts6PUUXmpla1u5
 WKctgt0Qv9WVNyiIdLdeHqzKJwsSSwNTt8fK7eFhy3x8e0CwJr+GtXckbbW3LFkY
 lug0npsTli3EmEPMovZhd25SjZmTk5GTM+ADZQ7Tnv5KXoDWB9jn6TcCSAi3G+5d
 5WUVwfnDyYJiH8qvlg5tRJ690muIy3xMOmpr7QBQ0YnR/LQ3EW+1CVfqD+qimgLH
 TXzlREXQpBP3YlxSDS5nddz4o5z84GZmC9B/43ujPaZKIQ6eBXYdkmQH7tPtSugm
 C6VGomR5tHotjxIiAsexh/5hAus+wW8bObKGTPTyINT0ub3XNclwCKLh26CgI9ie
 WvbS9g3j/KPvu/7s6weZpgD+cks0YdWe/XdXXxiHwsGI9h3J2aJna5RQt1rKWDm5
 wGCgbc/B8eSwiWx+GXlqdB9/Dy/bGXOnSTDnKpEVl1f5zNjeLwUKXbjvkMefWs4m
 jEIcquuDETORY+ZYEfa5YbmS4Lhskr0kzMVTVkZ++81tAWpSCU9Xh3IHrR8TNpt+
 J0oh0FHBDg==
 =LRTT
 -----END PGP SIGNATURE-----

Merge tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "This is the main pull request for block changes for 4.20. This
  contains:

   - Series enabling runtime PM for blk-mq (Bart).

   - Two pull requests from Christoph for NVMe, with items such as;
      - Better AEN tracking
      - Multipath improvements
      - RDMA fixes
      - Rework of FC for target removal
      - Fixes for issues identified by static checkers
      - Fabric cleanups, as prep for TCP transport
      - Various cleanups and bug fixes

   - Block merging cleanups (Christoph)

   - Conversion of drivers to generic DMA mapping API (Christoph)

   - Series fixing ref count issues with blkcg (Dennis)

   - Series improving BFQ heuristics (Paolo, et al)

   - Series improving heuristics for the Kyber IO scheduler (Omar)

   - Removal of dangerous bio_rewind_iter() API (Ming)

   - Apply single queue IPI redirection logic to blk-mq (Ming)

   - Set of fixes and improvements for bcache (Coly et al)

   - Series closing a hotplug race with sysfs group attributes (Hannes)

   - Set of patches for lightnvm:
      - pblk trace support (Hans)
      - SPDX license header update (Javier)
      - Tons of refactoring patches to cleanly abstract the 1.2 and 2.0
        specs behind a common core interface. (Javier, Matias)
      - Enable pblk to use a common interface to retrieve chunk metadata
        (Matias)
      - Bug fixes (Various)

   - Set of fixes and updates to the blk IO latency target (Josef)

   - blk-mq queue number updates fixes (Jianchao)

   - Convert a bunch of drivers from the old legacy IO interface to
     blk-mq. This will conclude with the removal of the legacy IO
     interface itself in 4.21, with the rest of the drivers (me, Omar)

   - Removal of the DAC960 driver. The SCSI tree will introduce two
     replacement drivers for this (Hannes)"

* tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits)
  block: setup bounce bio_sets properly
  blkcg: reassociate bios when make_request() is called recursively
  blkcg: fix edge case for blk_get_rl() under memory pressure
  nvme-fabrics: move controller options matching to fabrics
  nvme-rdma: always have a valid trsvcid
  mtip32xx: fully switch to the generic DMA API
  rsxx: switch to the generic DMA API
  umem: switch to the generic DMA API
  sx8: switch to the generic DMA API
  sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code
  skd: switch to the generic DMA API
  ubd: remove use of blk_rq_map_sg
  nvme-pci: remove duplicate check
  drivers/block: Remove DAC960 driver
  nvme-pci: fix hot removal during error handling
  nvmet-fcloop: suppress a compiler warning
  nvme-core: make implicit seed truncation explicit
  nvmet-fc: fix kernel-doc headers
  nvme-fc: rework the request initialization code
  nvme-fc: introduce struct nvme_fcp_op_w_sgl
  ...
2018-10-22 17:46:08 +01:00
Linus Torvalds
5289851171 arm64 updates for 4.20:
- Core mmu_gather changes which allow tracking the levels of page-table
   being cleared together with the arm64 low-level flushing routines
 
 - Support for the new ARMv8.5 PSTATE.SSBS bit which can be used to
   mitigate Spectre-v4 dynamically without trapping to EL3 firmware
 
 - Introduce COMPAT_SIGMINSTKSZ for use in compat_sys_sigaltstack
 
 - Optimise emulation of MRS instructions to ID_* registers on ARMv8.4
 
 - Support for Common Not Private (CnP) translations allowing threads of
   the same CPU to share the TLB entries
 
 - Accelerated crc32 routines
 
 - Move swapper_pg_dir to the rodata section
 
 - Trap WFI instruction executed in user space
 
 - ARM erratum 1188874 workaround (arch_timer)
 
 - Miscellaneous fixes and clean-ups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAlvKGdEACgkQa9axLQDI
 XvGSQBAAiOH6aQABL4TB7c5KIc7C+Unjm6QCFCoaeGWoHuemnM6cFJ7RQsi0GqnP
 dVEX5V/FKfmeTWO5g24Ah+MbTm3Bt6+81gywAmi1rrHhmCaCIPjT7xDqy/WsLlvt
 7WtgegSGvQ7DIMj2dbfFav6+ra67qAiYZTc46jvuynVl6DrE3BCiyTDbXAWt2nzP
 Xf3un4AHRbg3UEMUZTLqU5q4z0tbM6rEAZru8O0UOTnD2q7uttUqW3Ab7fpuEkkj
 lEVrMWD3h8SJg+Df9CbXmCNOjh4VhwBwDb5LgO8vA/AcyV/YLEF5b2OUAk/28qwo
 0GBwjqRyI4+YQ9LPg41MhGzrlnta0HCdYoeNLgLQZiDcUkuSfGhoA+MNZNOR8B08
 sCWF7F6f8UIQm8KMMBiYYdlVyUYgHLsWE/1+CyeLV0oIoWT5k3c+Xe3pho9KpVb0
 Co04TqMlqalry0sbevHz5c55H7iWIjB1Tpo3SxM105dVJVibXRPXkz+WZ5iPO+xa
 ex2j1kjNdA/AUzrSCZ5lh22zhg0WsfwD++E5meAaJMxieim8FeZDRga43rowJ0BA
 zMbSNB/+NDFZ9EhC40VaUfKk8Tkgiug9J5swv0+v7hy1QLDyydHhbOecTuIueauM
 6taiT2Iuov5yFng1eonYj4htvouVF4WOhPGthFPJMOcrB9mLMhs=
 =3Mc8
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 "Apart from some new arm64 features and clean-ups, this also contains
  the core mmu_gather changes for tracking the levels of the page table
  being cleared and a minor update to the generic
  compat_sys_sigaltstack() introducing COMPAT_SIGMINSKSZ.

  Summary:

   - Core mmu_gather changes which allow tracking the levels of
     page-table being cleared together with the arm64 low-level flushing
     routines

   - Support for the new ARMv8.5 PSTATE.SSBS bit which can be used to
     mitigate Spectre-v4 dynamically without trapping to EL3 firmware

   - Introduce COMPAT_SIGMINSTKSZ for use in compat_sys_sigaltstack

   - Optimise emulation of MRS instructions to ID_* registers on ARMv8.4

   - Support for Common Not Private (CnP) translations allowing threads
     of the same CPU to share the TLB entries

   - Accelerated crc32 routines

   - Move swapper_pg_dir to the rodata section

   - Trap WFI instruction executed in user space

   - ARM erratum 1188874 workaround (arch_timer)

   - Miscellaneous fixes and clean-ups"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (78 commits)
  arm64: KVM: Guests can skip __install_bp_hardening_cb()s HYP work
  arm64: cpufeature: Trap CTR_EL0 access only where it is necessary
  arm64: cpufeature: Fix handling of CTR_EL0.IDC field
  arm64: cpufeature: ctr: Fix cpu capability check for late CPUs
  Documentation/arm64: HugeTLB page implementation
  arm64: mm: Use __pa_symbol() for set_swapper_pgd()
  arm64: Add silicon-errata.txt entry for ARM erratum 1188873
  Revert "arm64: uaccess: implement unsafe accessors"
  arm64: mm: Drop the unused cpu parameter
  MAINTAINERS: fix bad sdei paths
  arm64: mm: Use #ifdef for the __PAGETABLE_P?D_FOLDED defines
  arm64: Fix typo in a comment in arch/arm64/mm/kasan_init.c
  arm64: xen: Use existing helper to check interrupt status
  arm64: Use daifflag_restore after bp_hardening
  arm64: daifflags: Use irqflags functions for daifflags
  arm64: arch_timer: avoid unused function warning
  arm64: Trap WFI executed in userspace
  arm64: docs: Document SSBS HWCAP
  arm64: docs: Fix typos in ELF hwcaps
  arm64/kprobes: remove an extra semicolon in arch_prepare_kprobe
  ...
2018-10-22 17:30:06 +01:00
Matthew Wilcox
3a08cd52c3 radix tree: Remove multiorder support
All users have now been converted to the XArray.  Removing the support
reduces code size and ensures new users will use the XArray instead.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:48 -04:00
Matthew Wilcox
a283348629 page cache: Finish XArray conversion
With no more radix tree API users left, we can drop the GFP flags
and use xa_init() instead of INIT_RADIX_TREE().

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:44 -04:00
Matthew Wilcox
7f4446eefe shmem: Comment fixups
Remove the last mentions of radix tree from various comments.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:41 -04:00
Matthew Wilcox
ef3038a573 memfd: Convert memfd_tag_pins to XArray
Switch to a batch-processing model like memfd_wait_for_pins() and
use the xa_state previously set up by memfd_wait_for_pins().

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
2018-10-21 10:46:41 -04:00
Matthew Wilcox
2313216f86 memfd: Convert memfd_wait_for_pins to XArray
Simplify the locking by taking the spinlock while we walk the tree on
the assumption that many acquires and releases of the lock will be worse
than holding the lock while we process an entire batch of pages.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
2018-10-21 10:46:40 -04:00
Matthew Wilcox
7ae3424fb4 shmem: Convert shmem_partial_swap_usage to XArray
Simpler code because the xarray takes care of things like the limit and
dereferencing the slot.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:40 -04:00
Matthew Wilcox
c121d3bb71 shmem: Convert shmem_free_swap to XArray
Since we are conditionally storing NULL in the XArray, we do not need
to allocate memory and the GFP flags will be unused.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:40 -04:00
Matthew Wilcox
7b8d046fba shmem: Convert shmem_alloc_hugepage to XArray
xa_find() is a slightly easier API to use than
radix_tree_gang_lookup_slot() because it contains its own RCU locking.
This commit removes the last user of radix_tree_gang_lookup_slot()
so remove the function too.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:40 -04:00
Matthew Wilcox
552446a416 shmem: Convert shmem_add_to_page_cache to XArray
We can use xas_find_conflict() instead of radix_tree_gang_lookup_slot()
to find any conflicting entry and combine the three paths through this
function into one.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:40 -04:00
Matthew Wilcox
e21a29552f shmem: Convert find_swap_entry to XArray
This is a 1:1 conversion.  The major part of this patch is converting
the test framework from userspace to kernel space and mirroring the
algorithm now used in find_swap_entry().

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:39 -04:00
Matthew Wilcox
a12831bf42 shmem: Convert shmem_confirm_swap to XArray
xa_load has its own RCU locking, so we can eliminate it here.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:39 -04:00
Matthew Wilcox
62f945b6a7 shmem: Convert shmem_radix_tree_replace to XArray
Rename shmem_radix_tree_replace() to shmem_replace_entry() and
convert it to use the XArray API.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:39 -04:00
Matthew Wilcox
10bbd23585 pagevec: Use xa_mark_t
Removes sparse warnings.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:39 -04:00
Matthew Wilcox
67891ffff2 mm: Convert is_page_cache_freeable to XArray
This is just a variable rename and comment change.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:38 -04:00
Matthew Wilcox
85b392dbac mm: Convert khugepaged_scan_shmem to XArray
Slightly shorter and easier to read code.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:38 -04:00
Matthew Wilcox
77da9389b9 mm: Convert collapse_shmem to XArray
I found another victim of the radix tree being hard to use.  Because
there was no call to radix_tree_preload(), khugepaged was allocating
radix_tree_nodes using GFP_ATOMIC.

I also converted a local_irq_save()/restore() pair to
disable()/enable().

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:38 -04:00
Matthew Wilcox
aa5dc07f70 mm: Convert huge_memory to XArray
Quite a straightforward conversion.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:38 -04:00
Matthew Wilcox
89eb946a74 mm: Convert page migration to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:38 -04:00
Matthew Wilcox
560d454bae mm: Convert __do_page_cache_readahead to XArray
This one is trivial.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:37 -04:00
Matthew Wilcox
4e17ec250f mm: Convert delete_from_swap_cache to XArray
Both callers of __delete_from_swap_cache have the swp_entry_t already,
so pass that in to make constructing the XA_STATE easier.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:37 -04:00
Matthew Wilcox
8d93b41c09 mm: Convert add_to_swap_cache to XArray
Combine __add_to_swap_cache and add_to_swap_cache into one function
since there is no more need to preload.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:37 -04:00
Matthew Wilcox
69b6c1319b mm: Convert truncate to XArray
This is essentially xa_cmpxchg() with the locking handled above us,
and it doesn't have to handle replacing a NULL entry.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:37 -04:00
Matthew Wilcox
a97e7904c0 mm: Convert workingset to XArray
We construct an XA_STATE and use it to delete the node with
xas_store() rather than adding a special function for this unique
use case.  Includes a test that simulates this usage for the
test suite.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:36 -04:00
Matthew Wilcox
ff9c745b81 mm: Convert page-writeback to XArray
Includes moving mapping_tagged() to fs.h as a static inline, and
changing it to return bool.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:36 -04:00
Matthew Wilcox
8fa8e538e4 page cache: Convert filemap_range_has_page to XArray
Instead of calling find_get_pages_range() and putting any reference,
use xas_find() to iterate over any entries in the range, skipping the
shadow/swap entries.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:36 -04:00
Matthew Wilcox
22ecdb4f8b page cache: Remove stray radix comment
Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:36 -04:00
Matthew Wilcox
ef8e5717db page cache: Convert delete_batch to XArray
Rename the function from page_cache_tree_delete_batch to just
page_cache_delete_batch.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:36 -04:00
Matthew Wilcox
070e807c69 page cache: Convert filemap_map_pages to XArray
Slight change of strategy here; if we have trouble getting hold of a
page for whatever reason (eg a compound page is split underneath us),
don't spin to stabilise the page, just continue the iteration, like we
would if we failed to trylock the page.  Since this is a speculative
optimisation, it feels like we should allow the process to take an extra
fault if it turns out to need this page instead of spending time to pin
down a page it may not need.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:35 -04:00
Matthew Wilcox
c1901cd33c page cache: Convert find_get_entries_tag to XArray
Slightly shorter and simpler code.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:35 -04:00
Matthew Wilcox
a6906972fe page cache; Convert find_get_pages_range_tag to XArray
The 'end' parameter of the xas_for_each iterator avoids a useless
iteration at the end of the range.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:35 -04:00
Matthew Wilcox
3ece58a270 page cache: Convert find_get_pages_contig to XArray
There's no direct replacement for radix_tree_for_each_contig()
in the XArray API as it's an unusual thing to do.  Instead,
open-code a loop using xas_next().  This removes the only user of
radix_tree_for_each_contig() so delete the iterator from the API and
the test suite code for it.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:34 -04:00
Matthew Wilcox
fd1b3cee2a page cache: Convert find_get_pages_range to XArray
The 'end' parameter of the xas_for_each iterator avoids a useless
iteration at the end of the range.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:34 -04:00
Matthew Wilcox
f280bf092d page cache: Convert find_get_entries to XArray
Slightly shorter and simpler code.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:34 -04:00
Matthew Wilcox
4c7472c0df page cache: Convert find_get_entry to XArray
Slightly shorter and simpler code.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:34 -04:00
Matthew Wilcox
5c024e6a4e page cache: Convert page deletion to XArray
The code is slightly shorter and simpler.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:34 -04:00
Matthew Wilcox
74d609585d page cache: Add and replace pages using the XArray
Use the XArray APIs to add and replace pages in the page cache.  This
removes two uses of the radix tree preload API and is significantly
shorter code.  It also removes the last user of __radix_tree_create()
outside radix-tree.c itself, so make it static.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:33 -04:00
Matthew Wilcox
0d3f929666 page cache: Convert hole search to XArray
The page cache offers the ability to search for a miss in the previous or
next N locations.  Rather than teach the XArray about the page cache's
definition of a miss, use xas_prev() and xas_next() to search the page
array.  This should be more efficient as it does not have to start the
lookup from the top for each index.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:33 -04:00
Matthew Wilcox
01959dfe77 xarray: Define struct xa_node
This is a direct replacement for struct radix_tree_node.  A couple of
struct members have changed name, so convert those.  Use a #define so
that radix tree users continue to work without change.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2018-10-21 10:45:56 -04:00
Linus Torvalds
eb66ae0308 mremap: properly flush TLB before releasing the page
Jann Horn points out that our TLB flushing was subtly wrong for the
mremap() case.  What makes mremap() special is that we don't follow the
usual "add page to list of pages to be freed, then flush tlb, and then
free pages".  No, mremap() obviously just _moves_ the page from one page
table location to another.

That matters, because mremap() thus doesn't directly control the
lifetime of the moved page with a freelist: instead, the lifetime of the
page is controlled by the page table locking, that serializes access to
the entry.

As a result, we need to flush the TLB not just before releasing the lock
for the source location (to avoid any concurrent accesses to the entry),
but also before we release the destination page table lock (to avoid the
TLB being flushed after somebody else has already done something to that
page).

This also makes the whole "need_flush" logic unnecessary, since we now
always end up flushing the TLB for every valid entry.

Reported-and-tested-by: Jann Horn <jannh@google.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Tested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-18 11:30:52 +02:00
Bartosz Golaszewski
59c3f82ad1 mm: move is_kernel_rodata() to asm-generic/sections.h
Export this routine so that we can use it later in devm_kstrdup_const()
and devm_kfree().

Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-16 12:53:27 +02:00
Jérôme Glisse
bfba8e5cf2 mm/thp: fix call to mmu_notifier in set_pmd_migration_entry() v2
Inside set_pmd_migration_entry() we are holding page table locks and thus
we can not sleep so we can not call invalidate_range_start/end()

So remove call to mmu_notifier_invalidate_range_start/end() because they
are call inside the function calling set_pmd_migration_entry() (see
try_to_unmap_one()).

Link: http://lkml.kernel.org/r/20181012181056.7864-1-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Zi Yan <zi.yan@cs.rutgers.edu>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-13 09:31:02 +02:00
Jann Horn
7aa867dd89 mm/mmap.c: don't clobber partially overlapping VMA with MAP_FIXED_NOREPLACE
Daniel Micay reports that attempting to use MAP_FIXED_NOREPLACE in an
application causes that application to randomly crash.  The existing check
for handling MAP_FIXED_NOREPLACE looks up the first VMA that either
overlaps or follows the requested region, and then bails out if that VMA
overlaps *the start* of the requested region.  It does not bail out if the
VMA only overlaps another part of the requested region.

Fix it by checking that the found VMA only starts at or after the end of
the requested region, in which case there is no overlap.

Test case:

user@debian:~$ cat mmap_fixed_simple.c
#include <sys/mman.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#ifndef MAP_FIXED_NOREPLACE
#define MAP_FIXED_NOREPLACE 0x100000
#endif

int main(void) {
  char *p;

  errno = 0;
  p = mmap((void*)0x10001000, 0x4000, PROT_NONE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, -1, 0);
  printf("p1=%p err=%m\n", p);

  errno = 0;
  p = mmap((void*)0x10000000, 0x2000, PROT_READ,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, -1, 0);
  printf("p2=%p err=%m\n", p);

  char cmd[100];
  sprintf(cmd, "cat /proc/%d/maps", getpid());
  system(cmd);

  return 0;
}
user@debian:~$ gcc -o mmap_fixed_simple mmap_fixed_simple.c
user@debian:~$ ./mmap_fixed_simple
p1=0x10001000 err=Success
p2=0x10000000 err=Success
10000000-10002000 r--p 00000000 00:00 0
10002000-10005000 ---p 00000000 00:00 0
564a9a06f000-564a9a070000 r-xp 00000000 fe:01 264004
  /home/user/mmap_fixed_simple
564a9a26f000-564a9a270000 r--p 00000000 fe:01 264004
  /home/user/mmap_fixed_simple
564a9a270000-564a9a271000 rw-p 00001000 fe:01 264004
  /home/user/mmap_fixed_simple
564a9a54a000-564a9a56b000 rw-p 00000000 00:00 0                          [heap]
7f8eba447000-7f8eba5dc000 r-xp 00000000 fe:01 405885
  /lib/x86_64-linux-gnu/libc-2.24.so
7f8eba5dc000-7f8eba7dc000 ---p 00195000 fe:01 405885
  /lib/x86_64-linux-gnu/libc-2.24.so
7f8eba7dc000-7f8eba7e0000 r--p 00195000 fe:01 405885
  /lib/x86_64-linux-gnu/libc-2.24.so
7f8eba7e0000-7f8eba7e2000 rw-p 00199000 fe:01 405885
  /lib/x86_64-linux-gnu/libc-2.24.so
7f8eba7e2000-7f8eba7e6000 rw-p 00000000 00:00 0
7f8eba7e6000-7f8eba809000 r-xp 00000000 fe:01 405876
  /lib/x86_64-linux-gnu/ld-2.24.so
7f8eba9e9000-7f8eba9eb000 rw-p 00000000 00:00 0
7f8ebaa06000-7f8ebaa09000 rw-p 00000000 00:00 0
7f8ebaa09000-7f8ebaa0a000 r--p 00023000 fe:01 405876
  /lib/x86_64-linux-gnu/ld-2.24.so
7f8ebaa0a000-7f8ebaa0b000 rw-p 00024000 fe:01 405876
  /lib/x86_64-linux-gnu/ld-2.24.so
7f8ebaa0b000-7f8ebaa0c000 rw-p 00000000 00:00 0
7ffcc99fa000-7ffcc9a1b000 rw-p 00000000 00:00 0                          [stack]
7ffcc9b44000-7ffcc9b47000 r--p 00000000 00:00 0                          [vvar]
7ffcc9b47000-7ffcc9b49000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0
  [vsyscall]
user@debian:~$ uname -a
Linux debian 4.19.0-rc6+ #181 SMP Wed Oct 3 23:43:42 CEST 2018 x86_64 GNU/Linux
user@debian:~$

As you can see, the first page of the mapping at 0x10001000 was clobbered.

Link: http://lkml.kernel.org/r/20181010152736.99475-1-jannh@google.com
Fixes: a4ff8e8620 ("mm: introduce MAP_FIXED_NOREPLACE")
Signed-off-by: Jann Horn <jannh@google.com>
Reported-by: Daniel Micay <danielmicay@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-13 09:31:02 +02:00
Greg Kroah-Hartman
a22dd3629e Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Ingo writes:
  "scheduler fix:

   Cleanup of dead code left over from the recent sched/numa fixes."

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  mm, sched/numa: Remove remaining traces of NUMA rate-limiting
2018-10-11 12:27:47 +02:00
Peter Zijlstra
a31acd3ee8 x86/mm: Page size aware flush_tlb_mm_range()
Use the new tlb_get_unmap_shift() to determine the stride of the
INVLPG loop.

Cc: Nick Piggin <npiggin@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-10-09 16:51:11 +02:00
Peter Zijlstra
a5b966ae42 Merge branch 'tlb/asm-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux into x86/mm
Pull in the generic mmu_gather changes from the ARM64 tree such that we
can put x86 specific things on top as well.
2018-10-09 16:50:43 +02:00
Srikar Dronamraju
e054637597 mm, sched/numa: Remove remaining traces of NUMA rate-limiting
Remove the leftover pglist_data::numabalancing_migrate_lock and its
initialization, we stopped using this lock with:

  efaffc5e40 ("mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration")

[ mingo: Rewrote the changelog. ]

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1538824999-31230-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-09 08:30:51 +02:00
Mike Rapoport
6685b35736 percpu: stop leaking bitmap metadata blocks
The commit ca460b3c96 ("percpu: introduce bitmap metadata blocks")
introduced bitmap metadata blocks. These metadata blocks are allocated
whenever a new chunk is created, but they are never freed. Fix it.

Fixes: ca460b3c96 ("percpu: introduce bitmap metadata blocks")
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2018-10-07 14:50:12 -07:00
Greg Kroah-Hartman
091a1eaa0e Merge branch 'akpm'
* akpm:
  mm: madvise(MADV_DODUMP): allow hugetlbfs pages
  ocfs2: fix locking for res->tracking and dlm->tracking_list
  mm/vmscan.c: fix int overflow in callers of do_shrink_slab()
  mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly
  mm/vmstat.c: fix outdated vmstat_text
  proc: restrict kernel stack dumps to root
  mm/hugetlb: add mmap() encodings for 32MB and 512MB page sizes
  mm/migrate.c: split only transparent huge pages when allocation fails
  ipc/shm.c: use ERR_CAST() for shm_lock() error return
  mm/gup_benchmark: fix unsigned comparison to zero in __gup_benchmark_ioctl
  mm, thp: fix mlocking THP page with migration enabled
  ocfs2: fix crash in ocfs2_duplicate_clusters_by_page()
  hugetlb: take PMD sharing into account when flushing tlb/caches
  mm: migration: fix migration of huge PMD shared pages
2018-10-05 16:33:03 -07:00
Daniel Black
d41aa52523 mm: madvise(MADV_DODUMP): allow hugetlbfs pages
Reproducer, assuming 2M of hugetlbfs available:

Hugetlbfs mounted, size=2M and option user=testuser

  # mount | grep ^hugetlbfs
  hugetlbfs on /dev/hugepages type hugetlbfs (rw,pagesize=2M,user=dan)
  # sysctl vm.nr_hugepages=1
  vm.nr_hugepages = 1
  # grep Huge /proc/meminfo
  AnonHugePages:         0 kB
  ShmemHugePages:        0 kB
  HugePages_Total:       1
  HugePages_Free:        1
  HugePages_Rsvd:        0
  HugePages_Surp:        0
  Hugepagesize:       2048 kB
  Hugetlb:            2048 kB

Code:

  #include <sys/mman.h>
  #include <stddef.h>
  #define SIZE 2*1024*1024
  int main()
  {
    void *ptr;
    ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_HUGETLB | MAP_ANONYMOUS, -1, 0);
    madvise(ptr, SIZE, MADV_DONTDUMP);
    madvise(ptr, SIZE, MADV_DODUMP);
  }

Compile and strace:

  mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7ff7c9200000
  madvise(0x7ff7c9200000, 2097152, MADV_DONTDUMP) = 0
  madvise(0x7ff7c9200000, 2097152, MADV_DODUMP) = -1 EINVAL (Invalid argument)

hugetlbfs pages have VM_DONTEXPAND in the VmFlags driver pages based on
author testing with analysis from Florian Weimer[1].

The inclusion of VM_DONTEXPAND into the VM_SPECIAL defination was a
consequence of the large useage of VM_DONTEXPAND in device drivers.

A consequence of [2] is that VM_DONTEXPAND marked pages are unable to be
marked DODUMP.

A user could quite legitimately madvise(MADV_DONTDUMP) their hugetlbfs
memory for a while and later request that madvise(MADV_DODUMP) on the same
memory.  We correct this omission by allowing madvice(MADV_DODUMP) on
hugetlbfs pages.

[1] https://stackoverflow.com/questions/52548260/madvisedodump-on-the-same-ptr-size-as-a-successful-madvisedontdump-fails-wit
[2] commit 0103bd16fb ("mm: prepare VM_DONTDUMP for using in drivers")

Link: http://lkml.kernel.org/r/20180930054629.29150-1-daniel@linux.ibm.com
Link: https://lists.launchpad.net/maria-discuss/msg05245.html
Fixes: 0103bd16fb ("mm: prepare VM_DONTDUMP for using in drivers")
Reported-by: Kenneth Penza <kpenza@gmail.com>
Signed-off-by: Daniel Black <daniel@linux.ibm.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:05 -07:00
Kirill Tkhai
b8e57efa2c mm/vmscan.c: fix int overflow in callers of do_shrink_slab()
do_shrink_slab() returns unsigned long value, and the placing into int
variable cuts high bytes off.  Then we compare ret and 0xfffffffe (since
SHRINK_EMPTY is converted to ret type).

Thus a large number of objects returned by do_shrink_slab() may be
interpreted as SHRINK_EMPTY, if low bytes of their value are equal to
0xfffffffe.  Fix that by declaration ret as unsigned long in these
functions.

Link: http://lkml.kernel.org/r/153813407177.17544.14888305435570723973.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:05 -07:00
Jann Horn
58bc4c34d2 mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly
5dd0b16cda ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even
on UP") made the availability of the NR_TLB_REMOTE_FLUSH* counters inside
the kernel unconditional to reduce #ifdef soup, but (either to avoid
showing dummy zero counters to userspace, or because that code was missed)
didn't update the vmstat_array, meaning that all following counters would
be shown with incorrect values.

This only affects kernel builds with
CONFIG_VM_EVENT_COUNTERS=y && CONFIG_DEBUG_TLBFLUSH=y && CONFIG_SMP=n.

Link: http://lkml.kernel.org/r/20181001143138.95119-2-jannh@google.com
Fixes: 5dd0b16cda ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:05 -07:00
Jann Horn
28e2c4bb99 mm/vmstat.c: fix outdated vmstat_text
7a9cdebdcc ("mm: get rid of vmacache_flush_all() entirely") removed the
VMACACHE_FULL_FLUSHES statistics, but didn't remove the corresponding
entry in vmstat_text.  This causes an out-of-bounds access in
vmstat_show().

Luckily this only affects kernels with CONFIG_DEBUG_VM_VMACACHE=y, which
is probably very rare.

Link: http://lkml.kernel.org/r/20181001143138.95119-1-jannh@google.com
Fixes: 7a9cdebdcc ("mm: get rid of vmacache_flush_all() entirely")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:05 -07:00
Anshuman Khandual
e6112fc300 mm/migrate.c: split only transparent huge pages when allocation fails
split_huge_page_to_list() fails on HugeTLB pages.  I was experimenting
with moving 32MB contig HugeTLB pages on arm64 (with a debug patch
applied) and hit the following stack trace when the kernel crashed.

[ 3732.462797] Call trace:
[ 3732.462835]  split_huge_page_to_list+0x3b0/0x858
[ 3732.462913]  migrate_pages+0x728/0xc20
[ 3732.462999]  soft_offline_page+0x448/0x8b0
[ 3732.463097]  __arm64_sys_madvise+0x724/0x850
[ 3732.463197]  el0_svc_handler+0x74/0x110
[ 3732.463297]  el0_svc+0x8/0xc
[ 3732.463347] Code: d1000400 f90b0e60 f2fbd5a2 a94982a1 (f9000420)

When unmap_and_move[_huge_page]() fails due to lack of memory, the
splitting should happen only for transparent huge pages not for HugeTLB
pages.  PageTransHuge() returns true for both THP and HugeTLB pages.
Hence the conditonal check should test PagesHuge() flag to make sure that
given pages is not a HugeTLB one.

Link: http://lkml.kernel.org/r/1537798495-4996-1-git-send-email-anshuman.khandual@arm.com
Fixes: 94723aafb9 ("mm: unclutter THP migration")
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:04 -07:00
YueHaibing
5189686457 mm/gup_benchmark: fix unsigned comparison to zero in __gup_benchmark_ioctl
get_user_pages_fast() will return negative value if no pages were pinned,
then be converted to a unsigned, which is compared to zero, giving the
wrong result.

Link: http://lkml.kernel.org/r/20180921095015.26088-1-yuehaibing@huawei.com
Fixes: 09e35a4a1c ("mm/gup_benchmark: handle gup failures")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:04 -07:00
Kirill A. Shutemov
e125fe405a mm, thp: fix mlocking THP page with migration enabled
A transparent huge page is represented by a single entry on an LRU list.
Therefore, we can only make unevictable an entire compound page, not
individual subpages.

If a user tries to mlock() part of a huge page, we want the rest of the
page to be reclaimable.

We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
PMD on border of VM_LOCKED VMA will be split into PTE table.

Introduction of THP migration breaks[1] the rules around mlocking THP
pages.  If we had a single PMD mapping of the page in mlocked VMA, the
page will get mlocked, regardless of PTE mappings of the page.

For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
remove_migration_pmd().

Anon THP pages can only be shared between processes via fork().  Mlocked
page can only be shared if parent mlocked it before forking, otherwise CoW
will be triggered on mlock().

For Anon-THP, we can fix the issue by munlocking the page on removing PTE
migration entry for the page.  PTEs for the page will always come after
mlocked PMD: rmap walks VMAs from oldest to newest.

Test-case:

	#include <unistd.h>
	#include <sys/mman.h>
	#include <sys/wait.h>
	#include <linux/mempolicy.h>
	#include <numaif.h>

	int main(void)
	{
	        unsigned long nodemask = 4;
	        void *addr;

		addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
			MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);

	        if (fork()) {
			wait(NULL);
			return 0;
	        }

	        mlock(addr, 4UL << 10);
	        mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
	                &nodemask, 4, MPOL_MF_MOVE);

	        return 0;
	}

[1] https://lkml.kernel.org/r/CAOMGZ=G52R-30rZvhGxEbkTw7rLLwBGadVYeo--iizcD3upL3A@mail.gmail.com

Link: http://lkml.kernel.org/r/20180917133816.43995-1-kirill.shutemov@linux.intel.com
Fixes: 616b837153 ("mm: thp: enable thp migration in generic path")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reviewed-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>	[4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:04 -07:00
Mike Kravetz
dff11abe28 hugetlb: take PMD sharing into account when flushing tlb/caches
When fixing an issue with PMD sharing and migration, it was discovered via
code inspection that other callers of huge_pmd_unshare potentially have an
issue with cache and tlb flushing.

Use the routine adjust_range_if_pmd_sharing_possible() to calculate worst
case ranges for mmu notifiers.  Ensure that this range is flushed if
huge_pmd_unshare succeeds and unmaps a PUD_SUZE area.

Link: http://lkml.kernel.org/r/20180823205917.16297-3-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:04 -07:00
Mike Kravetz
017b1660df mm: migration: fix migration of huge PMD shared pages
The page migration code employs try_to_unmap() to try and unmap the source
page.  This is accomplished by using rmap_walk to find all vmas where the
page is mapped.  This search stops when page mapcount is zero.  For shared
PMD huge pages, the page map count is always 1 no matter the number of
mappings.  Shared mappings are tracked via the reference count of the PMD
page.  Therefore, try_to_unmap stops prematurely and does not completely
unmap all mappings of the source page.

This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the target
page.  Hence, data is lost.

This problem was originally seen as DB corruption of shared global areas
after a huge page was soft offlined due to ECC memory errors.  DB
developers noticed they could reproduce the issue by (hotplug) offlining
memory used to back huge pages.  A simple testcase can reproduce the
problem by creating a shared PMD mapping (note that this must be at least
PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
migrate_pages() to migrate process pages between nodes while continually
writing to the huge pages being migrated.

To fix, have the try_to_unmap_one routine check for huge PMD sharing by
calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a shared
mapping it will be 'unshared' which removes the page table entry and drops
the reference on the PMD page.  After this, flush caches and TLB.

mmu notifiers are called before locking page tables, but we can not be
sure of PMD sharing until page tables are locked.  Therefore, check for
the possibility of PMD sharing before locking so that notifiers can
prepare for the worst possible case.

Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
[mike.kravetz@oracle.com: make _range_in_vma() a static inline]
  Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
Fixes: 39dde65c99 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:04 -07:00
Ingo Molnar
c0554d2d3d Merge branch 'linus' into x86/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-04 08:23:03 +02:00
Mel Gorman
efaffc5e40 mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration
Rate limiting of page migrations due to automatic NUMA balancing was
introduced to mitigate the worst-case scenario of migrating at high
frequency due to false sharing or slowly ping-ponging between nodes.
Since then, a lot of effort was spent on correctly identifying these
pages and avoiding unnecessary migrations and the safety net may no longer
be required.

Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
avoids spreading STREAM tasks wide prematurely. However, once the task
was properly placed, it delayed migrating the memory due to rate limiting.
Increasing the limit fixed the problem for him.

Currently, the limit is hard-coded and does not account for the real
capabilities of the hardware. Even if an estimate was attempted, it would
not properly account for the number of memory controllers and it could
not account for the amount of bandwidth used for normal accesses. Rather
than fudging, this patch simply eliminates the rate limiting.

However, Jirka reports that a STREAM configuration using multiple
processes achieved similar performance to 4.16. In local tests, this patch
improved performance of STREAM relative to the baseline but it is somewhat
machine-dependent. Most workloads show little or not performance difference
implying that there is not a heavily reliance on the throttling mechanism
and it is safe to remove.

STREAM on 2-socket machine
                         4.19.0-rc5             4.19.0-rc5
                         numab-v1r1       noratelimit-v1r1
MB/sec copy     43298.52 (   0.00%)    44673.38 (   3.18%)
MB/sec scale    30115.06 (   0.00%)    31293.06 (   3.91%)
MB/sec add      32825.12 (   0.00%)    34883.62 (   6.27%)
MB/sec triad    32549.52 (   0.00%)    34906.60 (   7.24%

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02 11:31:14 +02:00
Srikar Dronamraju
7534612123 mm/migrate: Use spin_trylock() while resetting rate limit
Since this spinlock will only serialize the migrate rate limiting,
convert the spin_lock() to a spin_trylock(). If another thread is updating, this
task can move on.

Specjbb2005 results (8 warehouses)
Higher bops are better

2 Socket - 2  Node Haswell - X86
JVMS  Prev    Current  %Change
4     205332  198512   -3.32145
1     319785  313559   -1.94693

2 Socket - 4 Node Power8 - PowerNV
JVMS  Prev    Current  %Change
8     74912   74761.9  -0.200368
1     206585  214874   4.01239

2 Socket - 2  Node Power9 - PowerNV
JVMS  Prev    Current  %Change
4     189162  180536   -4.56011
1     213760  210281   -1.62753

4 Socket - 4  Node Power7 - PowerVM
JVMS  Prev     Current  %Change
8     58736.8  56511.4  -3.78877
1     105419   104899   -0.49327

Avoiding stretching of window intervals may be the reason for the
regression. Also code now uses READ_ONCE/WRITE_ONCE. That may
also be hurting performance to some extent.

Some events stats before and after applying the patch.

perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
Event                     Before          After
cs                        14,285,708      13,818,546
migrations                1,180,621       1,149,960
faults                    339,114         385,583
cache-misses              55,205,631,894  55,259,546,768
sched:sched_move_numa     843             2,257
sched:sched_stick_numa    6               9
sched:sched_swap_numa     219             512
migrate:mm_migrate_pages  365             2,225

vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
Event                   Before  After
numa_hint_faults        26907   72692
numa_hint_faults_local  24279   62270
numa_hit                239771  238762
numa_huge_pte_updates   0       48
numa_interleave         68      75
numa_local              239688  238676
numa_other              83      86
numa_pages_migrated     363     2225
numa_pte_updates        27415   98557

perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
Event                     Before          After
cs                        3,202,779       3,173,490
migrations                37,186          36,966
faults                    106,076         108,776
cache-misses              12,024,873,744  12,200,075,320
sched:sched_move_numa     931             1,264
sched:sched_stick_numa    0               0
sched:sched_swap_numa     1               0
migrate:mm_migrate_pages  637             899

vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
Event                   Before  After
numa_hint_faults        17409   21109
numa_hint_faults_local  14367   17120
numa_hit                73953   72934
numa_huge_pte_updates   20      42
numa_interleave         25      33
numa_local              73892   72866
numa_other              61      68
numa_pages_migrated     668     915
numa_pte_updates        27276   42326

perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
Event                     Before       After
cs                        8,474,013    8,312,022
migrations                254,934      231,705
faults                    320,506      310,242
cache-misses              110,580,458  402,324,573
sched:sched_move_numa     725          193
sched:sched_stick_numa    0            0
sched:sched_swap_numa     7            3
migrate:mm_migrate_pages  145          93

vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
Event                   Before  After
numa_hint_faults        22797   11838
numa_hint_faults_local  21539   11216
numa_hit                89308   90689
numa_huge_pte_updates   0       0
numa_interleave         865     1579
numa_local              88955   89634
numa_other              353     1055
numa_pages_migrated     149     92
numa_pte_updates        22930   12109

perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
Event                     Before     After
cs                        2,195,628  2,170,481
migrations                11,179     10,126
faults                    149,656    160,962
cache-misses              8,117,515  10,834,845
sched:sched_move_numa     49         10
sched:sched_stick_numa    0          0
sched:sched_swap_numa     0          0
migrate:mm_migrate_pages  5          2

vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
Event                   Before  After
numa_hint_faults        3577    403
numa_hint_faults_local  3476    358
numa_hit                26142   25898
numa_huge_pte_updates   0       0
numa_interleave         358     207
numa_local              26042   25860
numa_other              100     38
numa_pages_migrated     5       2
numa_pte_updates        3587    400

perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
Event                     Before           After
cs                        100,602,296      110,339,633
migrations                4,135,630        4,139,812
faults                    789,256          863,622
cache-misses              226,160,621,058  231,838,045,660
sched:sched_move_numa     1,366            2,196
sched:sched_stick_numa    16               33
sched:sched_swap_numa     374              544
migrate:mm_migrate_pages  1,350            2,469

vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
Event                   Before  After
numa_hint_faults        47857   85748
numa_hint_faults_local  39768   66831
numa_hit                240165  242213
numa_huge_pte_updates   0       0
numa_interleave         0       0
numa_local              240165  242211
numa_other              0       2
numa_pages_migrated     1224    2376
numa_pte_updates        48354   86233

perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
Event                     Before          After
cs                        58,515,496      59,331,057
migrations                564,845         552,019
faults                    245,807         266,586
cache-misses              73,603,757,976  73,796,312,990
sched:sched_move_numa     996             981
sched:sched_stick_numa    10              54
sched:sched_swap_numa     193             286
migrate:mm_migrate_pages  646             713

vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
Event                   Before  After
numa_hint_faults        13422   14807
numa_hint_faults_local  5619    5738
numa_hit                36118   36230
numa_huge_pte_updates   0       0
numa_interleave         0       0
numa_local              36116   36228
numa_other              2       2
numa_pages_migrated     616     703
numa_pte_updates        13374   14742

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-6-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02 09:42:26 +02:00
Jens Axboe
c0aac682fa This is the 4.19-rc6 release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAluw4MIACgkQONu9yGCS
 aT7+8xAAiYnc4khUsxeInm3z44WPfRX1+UF51frTNSY5C8Nn5nvRSnTUNLuKkkrz
 8RbwCL6UYyJxF9I/oZdHPsPOD4IxXkQY55tBjz7ZbSBIFEwYM6RJMm8mAGlXY7wq
 VyWA5MhlpGHM9DjrguB4DMRipnrSc06CVAnC+ZyKLjzblzU1Wdf2dYu+AW9pUVXP
 j4r74lFED5djPY1xfqfzEwmYRCeEGYGx7zMqT3GrrF5uFPqj1H6O5klEsAhIZvdl
 IWnJTU2coC8R/Sd17g4lHWPIeQNnMUGIUbu+PhIrZ/lDwFxlocg4BvarPXEdzgYi
 gdZzKBfovpEsSu5RCQsKWG4IGQxY7I1p70IOP9eqEFHZy77qT1YcHVAWrK1Y/bJd
 UA08gUOSzRnhKkNR3+PsaMflUOl9WkpyHECZu394cyRGMutSS50aWkavJPJ/o1Qi
 D/oGqZLLcKFyuNcchG+Met1TzY3LvYEDgSburqwqeUZWtAsGs8kmiiq7qvmXx4zV
 IcgM8ERqJ8mbfhfsXQU7hwydIrPJ3JdIq19RnM5ajbv2Q4C/qJCyAKkQoacrlKR4
 aiow/qvyNrP80rpXfPJB8/8PiWeDtAnnGhM+xySZNlw3t8GR6NYpUkIzf5TdkSb3
 C8KuKg6FY9QAS62fv+5KK3LB/wbQanxaPNruQFGe5K1iDQ5Fvzw=
 =dMl4
 -----END PGP SIGNATURE-----

Merge tag 'v4.19-rc6' into for-4.20/block

Merge -rc6 in, for two reasons:

1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
   they aren't in the 4.20 branch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

* tag 'v4.19-rc6': (780 commits)
  Linux 4.19-rc6
  MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
  cpufreq: qcom-kryo: Fix section annotations
  perf/core: Add sanity check to deal with pinned event failure
  xen/blkfront: correct purging of persistent grants
  Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
  selftests/powerpc: Fix Makefiles for headers_install change
  blk-mq: I/O and timer unplugs are inverted in blktrace
  dax: Fix deadlock in dax_lock_mapping_entry()
  x86/boot: Fix kexec booting failure in the SEV bit detection code
  bcache: add separate workqueue for journal_write to avoid deadlock
  drm/amd/display: Fix Edid emulation for linux
  drm/amd/display: Fix Vega10 lightup on S3 resume
  drm/amdgpu: Fix vce work queue was not cancelled when suspend
  Revert "drm/panel: Add device_link from panel device to DRM device"
  xen/blkfront: When purging persistent grants, keep them in the buffer
  clocksource/drivers/timer-atmel-pit: Properly handle error cases
  block: fix deadline elevator drain for zoned block devices
  ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
  drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-01 08:58:57 -06:00
Matthew Wilcox
3159f943aa xarray: Replace exceptional entries
Introduce xarray value entries and tagged pointers to replace radix
tree exceptional entries.  This is a slight change in encoding to allow
the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
value entry).  It is also a change in emphasis; exceptional entries are
intimidating and different.  As the comment explains, you can choose
to store values or pointers in the xarray and they are both first-class
citizens.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2018-09-29 22:47:49 -04:00
Dennis Zhou (Facebook)
74b7c02a9b blkcg: associate a blkg for pages being evicted by swap
A prior patch in this series added blkg association to bios issued by
cgroups. There are two other paths that we want to attribute work back
to the appropriate cgroup: swap and writeback. Here we modify the way
swap tags bios to include the blkg. Writeback will be tackle in the next
patch.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:09 -06:00
Roman Gushchin
172b06c32b mm: slowly shrink slabs with a relatively small number of objects
9092c71bb7 ("mm: use sc->priority for slab shrink targets") changed the
way that the target slab pressure is calculated and made it
priority-based:

    delta = freeable >> priority;
    delta *= 4;
    do_div(delta, shrinker->seeks);

The problem is that on a default priority (which is 12) no pressure is
applied at all, if the number of potentially reclaimable objects is less
than 4096 (1<<12).

This causes the last objects on slab caches of no longer used cgroups to
(almost) never get reclaimed.  It's obviously a waste of memory.

It can be especially painful, if these stale objects are holding a
reference to a dying cgroup.  Slab LRU lists are reparented on memcg
offlining, but corresponding objects are still holding a reference to the
dying cgroup.  If we don't scan these objects, the dying cgroup can't go
away.  Most likely, the parent cgroup hasn't any directly charged objects,
only remaining objects from dying children cgroups.  So it can easily hold
a reference to hundreds of dying cgroups.

If there are no big spikes in memory pressure, and new memory cgroups are
created and destroyed periodically, this causes the number of dying
cgroups grow steadily, causing a slow-ish and hard-to-detect memory
"leak".  It's not a real leak, as the memory can be eventually reclaimed,
but it could not happen in a real life at all.  I've seen hosts with a
steadily climbing number of dying cgroups, which doesn't show any signs of
a decline in months, despite the host is loaded with a production
workload.

It is an obvious waste of memory, and to prevent it, let's apply a minimal
pressure even on small shrinker lists.  E.g.  if there are freeable
objects, let's scan at least min(freeable, scan_batch) objects.

This fix significantly improves a chance of a dying cgroup to be
reclaimed, and together with some previous patches stops the steady growth
of the dying cgroups number on some of our hosts.

Link: http://lkml.kernel.org/r/20180905230759.12236-1-guro@fb.com
Fixes: 9092c71bb7 ("mm: use sc->priority for slab shrink targets")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-20 22:01:11 +02:00
Joel Fernandes (Google)
b45d71fb89 mm: shmem.c: Correctly annotate new inodes for lockdep
Directories and inodes don't necessarily need to be in the same lockdep
class.  For ex, hugetlbfs splits them out too to prevent false positives
in lockdep.  Annotate correctly after new inode creation.  If its a
directory inode, it will be put into a different class.

This should fix a lockdep splat reported by syzbot:

> ======================================================
> WARNING: possible circular locking dependency detected
> 4.18.0-rc8-next-20180810+ #36 Not tainted
> ------------------------------------------------------
> syz-executor900/4483 is trying to acquire lock:
> 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at: inode_lock
> include/linux/fs.h:765 [inline]
> 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at:
> shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
>
> but task is already holding lock:
> 0000000025208078 (ashmem_mutex){+.+.}, at: ashmem_shrink_scan+0xb4/0x630
> drivers/staging/android/ashmem.c:448
>
> which lock already depends on the new lock.
>
> -> #2 (ashmem_mutex){+.+.}:
>        __mutex_lock_common kernel/locking/mutex.c:925 [inline]
>        __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073
>        mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088
>        ashmem_mmap+0x55/0x520 drivers/staging/android/ashmem.c:361
>        call_mmap include/linux/fs.h:1844 [inline]
>        mmap_region+0xf27/0x1c50 mm/mmap.c:1762
>        do_mmap+0xa10/0x1220 mm/mmap.c:1535
>        do_mmap_pgoff include/linux/mm.h:2298 [inline]
>        vm_mmap_pgoff+0x213/0x2c0 mm/util.c:357
>        ksys_mmap_pgoff+0x4da/0x660 mm/mmap.c:1585
>        __do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
>        __se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
>        __x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
>        do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> -> #1 (&mm->mmap_sem){++++}:
>        __might_fault+0x155/0x1e0 mm/memory.c:4568
>        _copy_to_user+0x30/0x110 lib/usercopy.c:25
>        copy_to_user include/linux/uaccess.h:155 [inline]
>        filldir+0x1ea/0x3a0 fs/readdir.c:196
>        dir_emit_dot include/linux/fs.h:3464 [inline]
>        dir_emit_dots include/linux/fs.h:3475 [inline]
>        dcache_readdir+0x13a/0x620 fs/libfs.c:193
>        iterate_dir+0x48b/0x5d0 fs/readdir.c:51
>        __do_sys_getdents fs/readdir.c:231 [inline]
>        __se_sys_getdents fs/readdir.c:212 [inline]
>        __x64_sys_getdents+0x29f/0x510 fs/readdir.c:212
>        do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> -> #0 (&sb->s_type->i_mutex_key#9){++++}:
>        lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
>        down_write+0x8f/0x130 kernel/locking/rwsem.c:70
>        inode_lock include/linux/fs.h:765 [inline]
>        shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
>        ashmem_shrink_scan+0x236/0x630 drivers/staging/android/ashmem.c:455
>        ashmem_ioctl+0x3ae/0x13a0 drivers/staging/android/ashmem.c:797
>        vfs_ioctl fs/ioctl.c:46 [inline]
>        file_ioctl fs/ioctl.c:501 [inline]
>        do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:685
>        ksys_ioctl+0xa9/0xd0 fs/ioctl.c:702
>        __do_sys_ioctl fs/ioctl.c:709 [inline]
>        __se_sys_ioctl fs/ioctl.c:707 [inline]
>        __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:707
>        do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> other info that might help us debug this:
>
> Chain exists of:
>   &sb->s_type->i_mutex_key#9 --> &mm->mmap_sem --> ashmem_mutex
>
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(ashmem_mutex);
>                                lock(&mm->mmap_sem);
>                                lock(ashmem_mutex);
>   lock(&sb->s_type->i_mutex_key#9);
>
>  *** DEADLOCK ***
>
> 1 lock held by syz-executor900/4483:
>  #0: 0000000025208078 (ashmem_mutex){+.+.}, at:
> ashmem_shrink_scan+0xb4/0x630 drivers/staging/android/ashmem.c:448

Link: http://lkml.kernel.org/r/20180821231835.166639-1-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Suggested-by: NeilBrown <neilb@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-20 22:01:11 +02:00
Pasha Tatashin
889c695d41 mm: disable deferred struct page for 32-bit arches
Deferred struct page init is needed only on systems with large amount of
physical memory to improve boot performance.  32-bit systems do not
benefit from this feature.

Jiri reported a problem where deferred struct pages do not work well with
x86-32:

[    0.035162] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
[    0.035725] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
[    0.036269] Initializing CPU#0
[    0.036513] Initializing HighMem for node 0 (00036ffe:0007ffe0)
[    0.038459] page:f6780000 is uninitialized and poisoned
[    0.038460] raw: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
[    0.039509] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
[    0.040038] ------------[ cut here ]------------
[    0.040399] kernel BUG at include/linux/page-flags.h:293!
[    0.040823] invalid opcode: 0000 [#1] SMP PTI
[    0.041166] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc1_pt_jiri #9
[    0.041694] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
[    0.042496] EIP: free_highmem_page+0x64/0x80
[    0.042839] Code: 13 46 d8 c1 e8 18 5d 83 e0 03 8d 04 c0 c1 e0 06 ff 80 ec 5f 44 d8 c3 8d b4 26 00 00 00 00 ba 08 65 28 d8 89 d8 e8 fc 71 02 00 <0f> 0b 8d 76 00 8d bc 27 00 00 00 00 ba d0 b1 26 d8 89 d8 e8 e4 71
[    0.044338] EAX: 0000003c EBX: f6780000 ECX: 00000000 EDX: d856cbe8
[    0.044868] ESI: 0007ffe0 EDI: d838df20 EBP: d838df00 ESP: d838defc
[    0.045372] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210086
[    0.045913] CR0: 80050033 CR2: 00000000 CR3: 18556000 CR4: 00040690
[    0.046413] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[    0.046913] DR6: fffe0ff0 DR7: 00000400
[    0.047220] Call Trace:
[    0.047419]  add_highpages_with_active_regions+0xbd/0x10d
[    0.047854]  set_highmem_pages_init+0x5b/0x71
[    0.048202]  mem_init+0x2b/0x1e8
[    0.048460]  start_kernel+0x1d2/0x425
[    0.048757]  i386_start_kernel+0x93/0x97
[    0.049073]  startup_32_smp+0x164/0x168
[    0.049379] Modules linked in:
[    0.049626] ---[ end trace 337949378db0abbb ]---

We free highmem pages before their struct pages are initialized:

mem_init()
 set_highmem_pages_init()
  add_highpages_with_active_regions()
   free_highmem_page()
    .. Access uninitialized struct page here..

Because there is no reason to have this feature on 32-bit systems, just
disable it.

Link: http://lkml.kernel.org/r/20180831150506.31246-1-pavel.tatashin@microsoft.com
Fixes: 2e3ca40f03 ("mm: relax deferred struct page requirements")
Signed-off-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reported-by: Jiri Slaby <jslaby@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-20 22:01:11 +02:00
Linus Torvalds
48751b562b overlayfs fixes for 4.19-rc4
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW5qpOgAKCRDh3BK/laaZ
 PDCQAQCIKLg0aLeWOkfUO76mBjlp5srKgJfrqpFoyuozD6l2fQEAl/W2x9NOduV+
 PK4sCYMT8SpI0hMrbv9P4zZ683kmaA8=
 =RnZU
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs fixes from Miklos Szeredi:
 "This fixes a regression in the recent file stacking update, reported
  and fixed by Amir Goldstein. The fix is fairly trivial, but involves
  adding a fadvise() f_op and the associated churn in the vfs. As
  discussed on -fsdevel, there are other possible uses for this method,
  than allowing proper stacking for overlays.

  And there's one other fix for a syzkaller detected oops"

* tag 'ovl-fixes-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix oopses in ovl_fill_super() failure paths
  ovl: add ovl_fadvise()
  vfs: implement readahead(2) using POSIX_FADV_WILLNEED
  vfs: add the fadvise() file operation
  Documentation/filesystems: update documentation of file_operations
  ovl: fix GPF in swapfile_activate of file from overlayfs over xfs
  ovl: respect FIEMAP_FLAG_SYNC flag
2018-09-13 19:21:40 -10:00
Linus Torvalds
7a9cdebdcc mm: get rid of vmacache_flush_all() entirely
Jann Horn points out that the vmacache_flush_all() function is not only
potentially expensive, it's buggy too.  It also happens to be entirely
unnecessary, because the sequence number overflow case can be avoided by
simply making the sequence number be 64-bit.  That doesn't even grow the
data structures in question, because the other adjacent fields are
already 64-bit.

So simplify the whole thing by just making the sequence number overflow
case go away entirely, which gets rid of all the complications and makes
the code faster too.  Win-win.

[ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics
  also just goes away entirely with this ]

Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Will Deacon <will.deacon@arm.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-13 15:18:04 -10:00
Igor Stoppa
0b59c25f91 mm: percpu: remove unnecessary unlikely()
WARN_ON() already contains an unlikely(), so it's not necessary to
wrap it into another.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: zijun_hu <zijun_hu@htc.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2018-09-12 21:23:23 -07:00
Eric W. Biederman
079b22dc9b signal: Use SEND_SIG_PRIV not SEND_SIG_FORCED with SIGKILL and SIGSTOP
Now that siginfo is never allocated for SIGKILL and SIGSTOP there is
no difference between SEND_SIG_PRIV and SEND_SIG_FORCED for SIGKILL
and SIGSTOP.  This makes SEND_SIG_FORCED unnecessary and redundant in
the presence of SIGKILL and SIGSTOP.  Therefore change users of
SEND_SIG_FORCED that are sending SIGKILL or SIGSTOP to use
SEND_SIG_PRIV instead.

This removes the last users of SEND_SIG_FORCED.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-11 21:19:35 +02:00
Peter Zijlstra
196d9d8bb7 mm/memory: Move mmu_gather and TLB invalidation code into its own file
In preparation for maintaining the mmu_gather code as its own entity,
move the implementation out of memory.c and into its own file.

Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-09-07 15:19:25 +01:00
Linus Torvalds
ca16eb342e for-linus-20180906
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAluRkywQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpm8uEAC8vBFb5tzZ2dOeRbGQ6LaPTToBmRrLtOcP
 kDRnfZIw0raNStOpn1dkGLz8IOSjwOGftx9Q4pJed25vynTEq5lYmmLVUlJQ6cJ7
 oNpYiCdPxJvbKz5fChGG2nHHa1RLer1d728NZtkeZU/ChPmw56EO5ORghE7zPG7K
 Z/0qHYsgwS427o8pUDsymmt6I62IJGrjzqJdC0pqBy6RePQWtlwkmtd7CIgFiffY
 tDnk6RSwcihnIalMMLvFXeGf6cSaZvuH4oK1QNdfojAyS8kWeA6gHtjRS8UcuuUY
 t6o+hU0vki8bghoNoI40RrLgAmV91BVv1/Voo79dQvDWAigyie51HwFFkqdWzJxJ
 g4MCZYpys26w/VUGBFCku0hiRIAhZFO8Sun5zbVCJpyt8hTXF0RrG3CpwmCF7Lc0
 m+h8tJanEMCesfYMztTD31L1BOFhJeOgBJr4a5QURy0LbIvC0V52IKiOQ0475E8E
 H10rQaRw/7Am+mZugedMUGMgYD/eN33NQoRuTWZdck/58big2SU78zGpR/GqTmy3
 w9v2I8ksBTivzEayBV0G4Z5Gxu7QYA7NMsO5RS/wuGfUX8D/1QtQU9Ejh5TESbek
 R3WUyhXJJ2S+DWTUlmX7TgPxYxG3sXatQbSAgFJiucxyIRdpdqfeoXmOHvPrWZEq
 O3VDm0D6pw==
 =qhv7
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180906' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Small collection of fixes that should go into this release. This
  contains:

   - Small series that fixes a race between blkcg teardown and writeback
     (Dennis Zhou)

   - Fix disallowing invalid block size settings from the nbd ioctl (me)

   - BFQ fix for a use-after-free on last release of a bfqg (Konstantin
     Khlebnikov)

   - Fix for the "don't warn for flush" fix (Mikulas)"

* tag 'for-linus-20180906' of git://git.kernel.dk/linux-block:
  block: bfq: swap puts in bfqg_and_blkg_put
  block: don't warn when doing fsync on read-only devices
  nbd: don't allow invalid blocksize settings
  blkcg: use tryget logic when associating a blkg with a bio
  blkcg: delay blkg destruction until after writeback has finished
  Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()"
2018-09-06 14:01:15 -07:00
Dave Jiang
62ec0d8c4f mm: fix BUG_ON() in vmf_insert_pfn_pud() from VM_MIXEDMAP removal
It looks like I missed the PUD path when doing VM_MIXEDMAP removal.
This can be triggered by:
1. Boot with memmap=4G!8G
2. build ndctl with destructive flag on
3. make TESTS=device-dax check

[  +0.000675] kernel BUG at mm/huge_memory.c:824!

Applying the same change that was applied to vmf_insert_pfn_pmd() in the
original patch.

Link: http://lkml.kernel.org/r/153565957352.35524.1005746906902065126.stgit@djiang5-desk3.ch.intel.com
Fixes: e1fb4a0864 ("dax: remove VM_MIXEDMAP for fsdax and device dax")
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-04 16:45:02 -07:00
Aneesh Kumar K.V
464c7ffbcb mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported.
When scanning for movable pages, filter out Hugetlb pages if hugepage
migration is not supported.  Without this we hit infinte loop in
__offline_pages() where we do

	pfn = scan_movable_pages(start_pfn, end_pfn);
	if (pfn) { /* We have movable pages */
		ret = do_migrate_range(pfn, end_pfn);
		goto repeat;
	}

Fix this by checking hugepage_migration_supported both in
has_unmovable_pages which is the primary backoff mechanism for page
offlining and for consistency reasons also into scan_movable_pages
because it doesn't make any sense to return a pfn to non-migrateable
huge page.

This issue was revealed by, but not caused by 72b39cfc4d ("mm,
memory_hotplug: do not fail offlining too early").

Link: http://lkml.kernel.org/r/20180824063314.21981-1-aneesh.kumar@linux.ibm.com
Fixes: 72b39cfc4d ("mm, memory_hotplug: do not fail offlining too early")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reported-by: Haren Myneni <haren@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-04 16:45:02 -07:00
Andrew Morton
04b8e94607 mm/util.c: improve kvfree() kerneldoc
Scooped from an email from Matthew.

Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-04 16:45:02 -07:00
Vincent Whitchurch
b353756b2b kmemleak: always register debugfs file
If kmemleak built in to the kernel, but is disabled by default, the
debugfs file is never registered.  Because of this, it is not possible
to find out if the kernel is built with kmemleak support by checking for
the presence of this file.  To allow this, always register the file.

After this patch, if the file doesn't exist, kmemleak is not available
in the kernel.  If writing "scan" or any other value than "clear" to
this file results in EBUSY, then kmemleak is available but is disabled
by default and can be activated via the kernel command line.

Catalin: "that's also consistent with a late disabling of kmemleak when
the debugfs entry sticks around."

Link: http://lkml.kernel.org/r/20180824131220.19176-1-vincent.whitchurch@axis.com
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-04 16:45:02 -07:00
Tetsuo Handa
79cc81057e mm, oom: fix missing tlb_finish_mmu() in __oom_reap_task_mm().
Commit 93065ac753 ("mm, oom: distinguish blockable mode for mmu
notifiers") has added an ability to skip over vmas with blockable mmu
notifiers. This however didn't call tlb_finish_mmu as it should.

As a result inc_tlb_flush_pending has been called without its pairing
dec_tlb_flush_pending and all callers mm_tlb_flush_pending would flush
even though this is not really needed.  This alone is not harmful and it
seems there shouldn't be any such callers for oom victims at all but
there is no real reason to skip tlb_finish_mmu on early skip either so
call it.

[mhocko@suse.com: new changelog]
Link: http://lkml.kernel.org/r/b752d1d5-81ad-7a35-2394-7870641be51c@i-love.sakura.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-04 16:45:02 -07:00
Johannes Weiner
3100dab2aa mm: memcontrol: print proper OOM header when no eligible victim left
When the memcg OOM killer runs out of killable tasks, it currently
prints a WARN with no further OOM context.  This has caused some user
confusion.

Warnings indicate a kernel problem.  In a reported case, however, the
situation was triggered by a nonsensical memcg configuration (hard limit
set to 0).  But without any VM context this wasn't obvious from the
report, and it took some back and forth on the mailing list to identify
what is actually a trivial issue.

Handle this OOM condition like we handle it in the global OOM killer:
dump the full OOM context and tell the user we ran out of tasks.

This way the user can identify misconfigurations easily by themselves
and rectify the problem - without having to go through the hassle of
running into an obscure but unsettling warning, finding the appropriate
kernel mailing list and waiting for a kernel developer to remote-analyze
that the memcg configuration caused this.

If users cannot make sense of why the OOM killer was triggered or why it
failed, they will still report it to the mailing list, we know that from
experience.  So in case there is an actual kernel bug causing this,
kernel developers will very likely hear about it.

Link: http://lkml.kernel.org/r/20180821160406.22578-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-04 16:45:02 -07:00
Will Deacon
a6d60245d6 asm-generic/tlb: Track which levels of the page tables have been cleared
It is common for architectures with hugepage support to require only a
single TLB invalidation operation per hugepage during unmap(), rather than
iterating through the mapping at a PAGE_SIZE increment. Currently,
however, the level in the page table where the unmap() operation occurs
is not stored in the mmu_gather structure, therefore forcing
architectures to issue additional TLB invalidation operations or to give
up and over-invalidate by e.g. invalidating the entire TLB.

Ideally, we could add an interval rbtree to the mmu_gather structure,
which would allow us to associate the correct mapping granule with the
various sub-mappings within the range being invalidated. However, this
is costly in terms of book-keeping and memory management, so instead we
approximate by keeping track of the page table levels that are cleared
and provide a means to query the smallest granule required for invalidation.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-09-04 11:08:27 +01:00
Jann Horn
9da3f2b740 x86/fault: BUG() when uaccess helpers fault on kernel addresses
There have been multiple kernel vulnerabilities that permitted userspace to
pass completely unchecked pointers through to userspace accessors:

 - the waitid() bug - commit 96ca579a1e ("waitid(): Add missing
   access_ok() checks")
 - the sg/bsg read/write APIs
 - the infiniband read/write APIs

These don't happen all that often, but when they do happen, it is hard to
test for them properly; and it is probably also hard to discover them with
fuzzing. Even when an unmapped kernel address is supplied to such buggy
code, it just returns -EFAULT instead of doing a proper BUG() or at least
WARN().

Try to make such misbehaving code a bit more visible by refusing to do a
fixup in the pagefault handler code when a userspace accessor causes a #PF
on a kernel address and the current context isn't whitelisted.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: dvyukov@google.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20180828201421.157735-7-jannh@google.com
2018-09-03 15:12:09 +02:00
Dennis Zhou (Facebook)
59b57717ff blkcg: delay blkg destruction until after writeback has finished
Currently, blkcg destruction relies on a sequence of events:
  1. Destruction starts. blkcg_css_offline() is called and blkgs
     release their reference to the blkcg. This immediately destroys
     the cgwbs (writeback).
  2. With blkgs giving up their reference, the blkcg ref count should
     become zero and eventually call blkcg_css_free() which finally
     frees the blkcg.

Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
on the completion of all writeback associated with the blkcg. A count of
the number of cgwbs is maintained and once that goes to zero, blkg
destruction can follow. This should prevent premature blkg destruction
related to writeback.

The new process for blkcg cleanup is as follows:
  1. Destruction starts. blkcg_css_offline() is called which offlines
     writeback. Blkg destruction is delayed on the cgwb_refcnt count to
     avoid punting potentially large amounts of outstanding writeback
     to root while maintaining any ongoing policies. Here, the base
     cgwb_refcnt is put back.
  2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
     and handles destruction of blkgs. This is where the css reference
     held by each blkg is released.
  3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
     This finally frees the blkg.

It seems in the past blk-throttle didn't do the most understandable
things with taking data from a blkg while associating with current. So,
the simplification and unification of what blk-throttle is doing caused
this.

Fixes: 08e18eab0c ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-31 14:48:56 -06:00
Amir Goldstein
3d8f761531 vfs: implement readahead(2) using POSIX_FADV_WILLNEED
The implementation of readahead(2) syscall is identical to that of
fadvise64(POSIX_FADV_WILLNEED) with a few exceptions:
1. readahead(2) returns -EINVAL for !mapping->a_ops and fadvise64()
   ignores the request and returns 0.
2. fadvise64() checks for integer overflow corner case
3. fadvise64() calls the optional filesystem fadvise() file operation

Unite the two implementations by calling vfs_fadvise() from readahead(2)
syscall. Check the !mapping->a_ops in readahead(2) syscall to preserve
documented syscall ABI behaviour.

Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: d1d04ef857 ("ovl: stack file ops")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-08-30 20:01:32 +02:00
Amir Goldstein
45cd0faae3 vfs: add the fadvise() file operation
This is going to be used by overlayfs and possibly useful
for other filesystems.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-08-30 17:08:35 +02:00
Mukesh Ojha
13ba17bee1 notifier: Remove notifier header file wherever not used
The conversion of the hotplug notifiers to a state machine left the
notifier.h includes around in some places. Remove them.

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1535114033-4605-1-git-send-email-mojha@codeaurora.org
2018-08-30 12:56:40 +02:00
Linus Torvalds
2923b27e54 libnvdimm-for-4.19_dax-memory-failure
* memory_failure() gets confused by dev_pagemap backed mappings. The
   recovery code has specific enabling for several possible page states
   that needs new enabling to handle poison in dax mappings. Teach
   memory_failure() about ZONE_DEVICE pages.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9ui8ACgkQYGjFFmlT
 OEpNRw//XGj9s7sezfJFeol4psJlRUd935yii/gmJRgi/yPf2VxxQG9qyM6SMBUc
 75jASfOL6FSsfxHz0kplyWzMDNdrTkNNAD+9rv80FmY7GqWgcas9DaJX7jZ994vI
 5SRO7pfvNZcXlo7IhqZippDw3yxkIU9Ufi0YQKaEUm7GFieptvCZ0p9x3VYfdvwM
 BExrxQe0X1XUF4xErp5P78+WUbKxP47DLcucRDig8Q7dmHELUdyNzo3E1SVoc7m+
 3CmvyTj6XuFQgOZw7ZKun1BJYfx/eD5ZlRJLZbx6wJHRtTXv/Uea8mZ8mJ31ykN9
 F7QVd0Pmlyxys8lcXfK+nvpL09QBE0/PhwWKjmZBoU8AdgP/ZvBXLDL/D6YuMTg6
 T4wwtPNJorfV4lVD06OliFkVI4qbKbmNsfRq43Ns7PCaLueu4U/eMaSwSH99UMaZ
 MGbO140XW2RZsHiU9yTRUmZq73AplePEjxtzR8oHmnjo45nPDPy8mucWPlkT9kXA
 oUFMhgiviK7dOo19H4eaPJGqLmHM93+x5tpYxGqTr0dUOXUadKWxMsTnkID+8Yi7
 /kzQWCFvySz3VhiEHGuWkW08GZT6aCcpkREDomnRh4MEnETlZI8bblcuXYOCLs6c
 nNf1SIMtLdlsl7U1fEX89PNeQQ2y237vEDhFQZftaalPeu/JJV0=
 =Ftop
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm memory-failure update from Dave Jiang:
 "As it stands, memory_failure() gets thoroughly confused by dev_pagemap
  backed mappings. The recovery code has specific enabling for several
  possible page states and needs new enabling to handle poison in dax
  mappings.

  In order to support reliable reverse mapping of user space addresses:

   1/ Add new locking in the memory_failure() rmap path to prevent races
      that would typically be handled by the page lock.

   2/ Since dev_pagemap pages are hidden from the page allocator and the
      "compound page" accounting machinery, add a mechanism to determine
      the size of the mapping that encompasses a given poisoned pfn.

   3/ Given pmem errors can be repaired, change the speculatively
      accessed poison protection, mce_unmap_kpfn(), to be reversible and
      otherwise allow ongoing access from the kernel.

  A side effect of this enabling is that MADV_HWPOISON becomes usable
  for dax mappings, however the primary motivation is to allow the
  system to survive userspace consumption of hardware-poison via dax.
  Specifically the current behavior is:

     mce: Uncorrected hardware memory error in user-access at af34214200
     {1}[Hardware Error]: It has been corrected by h/w and requires no further action
     mce: [Hardware Error]: Machine check events logged
     {1}[Hardware Error]: event severity: corrected
     Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
     [..]
     Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
     mce: Memory error not recovered
     <reboot>

  ...and with these changes:

     Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
     Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
     Memory failure: 0x20cb00: recovery action for dax page: Recovered

  Given all the cross dependencies I propose taking this through
  nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
  folks"

* tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm, pmem: Restore page attributes when clearing errors
  x86/memory_failure: Introduce {set, clear}_mce_nospec()
  x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
  mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  filesystem-dax: Introduce dax_lock_mapping_entry()
  mm, memory_failure: Collect mapping size in collect_procs()
  mm, madvise_inject_error: Let memory_failure() optionally take a page reference
  mm, dev_pagemap: Do not clear ->mapping on final put
  mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
  filesystem-dax: Set page->index
  device-dax: Set page->index
  device-dax: Enable page_mapping()
  device-dax: Convert to vmf_insert_mixed and vm_fault_t
2018-08-25 18:43:59 -07:00
Linus Torvalds
1b2de5d039 mm/cow: don't bother write protecting already write-protected pages
This is not normally noticeable, but repeated forks are unnecessarily
expensive because they repeatedly dirty the parent page tables during
the page table copy operation.

It's trivial to just avoid write protecting the page table entry if it
was already not writable.

This patch was inspired by

    https://bugzilla.kernel.org/show_bug.cgi?id=200447

which points to an ancient "waste time re-doing fork" issue in the
presence of lots of signals.

That bug was fixed by Eric Biederman's signal handling series
culminating in commit c3ad2c3b02 ("signal: Don't restart fork when
signals come in"), but the unnecessary work for repeated forks is still
work just fixing, particularly since the fix is trivial.

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-25 13:15:03 -07:00
Linus Torvalds
33e17876ea Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

 - the rest of MM

 - various misc fixes and tweaks

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
  mm: Change return type int to vm_fault_t for fault handlers
  lib/fonts: convert comments to utf-8
  s390: ebcdic: convert comments to UTF-8
  treewide: convert ISO_8859-1 text comments to utf-8
  drivers/gpu/drm/gma500/: change return type to vm_fault_t
  docs/core-api: mm-api: add section about GFP flags
  docs/mm: make GFP flags descriptions usable as kernel-doc
  docs/core-api: split memory management API to a separate file
  docs/core-api: move *{str,mem}dup* to "String Manipulation"
  docs/core-api: kill trailing whitespace in kernel-api.rst
  mm/util: add kernel-doc for kvfree
  mm/util: make strndup_user description a kernel-doc comment
  fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
  treewide: correct "differenciate" and "instanciate" typos
  fs/afs: use new return type vm_fault_t
  drivers/hwtracing/intel_th/msu.c: change return type to vm_fault_t
  mm: soft-offline: close the race against page allocation
  mm: fix race on soft-offlining free huge pages
  namei: allow restricted O_CREAT of FIFOs and regular files
  hfs: prevent crash on exit from failed search
  ...
2018-08-23 19:20:12 -07:00
Souptick Joarder
2b74030354 mm: Change return type int to vm_fault_t for fault handlers
Use new return type vm_fault_t for fault handler.  For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno.  Once all instances are converted, vm_fault_t will become a
distinct type.

Ref-> commit 1c8f422059 ("mm: change return type to vm_fault_t")

The aim is to change the return type of finish_fault() and
handle_mm_fault() to vm_fault_t type.  As part of that clean up return
type of all other recursively called functions have been changed to
vm_fault_t type.

The places from where handle_mm_fault() is getting invoked will be
change to vm_fault_t type but in a separate patch.

vmf_error() is the newly introduce inline function in 4.17-rc6.

[akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:44 -07:00
Mike Rapoport
ff4dc77293 mm/util: add kernel-doc for kvfree
Link: http://lkml.kernel.org/r/1532626360-16650-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:43 -07:00
Mike Rapoport
b86181f1ad mm/util: make strndup_user description a kernel-doc comment
Patch series "memory management documentation updates", v3.

Here are several updates to the mm documentation.

Aside from really minor changes in the first three patches, the updates
are:

* move the documentation of kstrdup and friends to "String Manipulation"
  section
* split memory management API into a separate .rst file
* adjust formating of the GFP flags description and include it in the
  reference documentation.

This patch (of 7):

The description of the strndup_user function misses '*' character at the
beginning of the comment to be proper kernel-doc.  Add the missing
character.

Link: http://lkml.kernel.org/r/1532626360-16650-2-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:43 -07:00
Naoya Horiguchi
d4ae9916ea mm: soft-offline: close the race against page allocation
A process can be killed with SIGBUS(BUS_MCEERR_AR) when it tries to
allocate a page that was just freed on the way of soft-offline.  This is
undesirable because soft-offline (which is about corrected error) is
less aggressive than hard-offline (which is about uncorrected error),
and we can make soft-offline fail and keep using the page for good
reason like "system is busy."

Two main changes of this patch are:

- setting migrate type of the target page to MIGRATE_ISOLATE. As done
  in free_unref_page_commit(), this makes kernel bypass pcplist when
  freeing the page. So we can assume that the page is in freelist just
  after put_page() returns,

- setting PG_hwpoison on free page under zone->lock which protects
  freelists, so this allows us to avoid setting PG_hwpoison on a page
  that is decided to be allocated soon.

[akpm@linux-foundation.org: tweak set_hwpoison_free_buddy_page() comment]
Link: http://lkml.kernel.org/r/1531452366-11661-3-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <zy.zhengyi@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:43 -07:00
Naoya Horiguchi
6bc9b56433 mm: fix race on soft-offlining free huge pages
Patch series "mm: soft-offline: fix race against page allocation".

Xishi recently reported the issue about race on reusing the target pages
of soft offlining.  Discussion and analysis showed that we need make
sure that setting PG_hwpoison should be done in the right place under
zone->lock for soft offline.  1/2 handles free hugepage's case, and 2/2
hanldes free buddy page's case.

This patch (of 2):

There's a race condition between soft offline and hugetlb_fault which
causes unexpected process killing and/or hugetlb allocation failure.

The process killing is caused by the following flow:

  CPU 0               CPU 1              CPU 2

  soft offline
    get_any_page
    // find the hugetlb is free
                      mmap a hugetlb file
                      page fault
                        ...
                          hugetlb_fault
                            hugetlb_no_page
                              alloc_huge_page
                              // succeed
      soft_offline_free_page
      // set hwpoison flag
                                         mmap the hugetlb file
                                         page fault
                                           ...
                                             hugetlb_fault
                                               hugetlb_no_page
                                                 find_lock_page
                                                   return VM_FAULT_HWPOISON
                                           mm_fault_error
                                             do_sigbus
                                             // kill the process

The hugetlb allocation failure comes from the following flow:

  CPU 0                          CPU 1

                                 mmap a hugetlb file
                                 // reserve all free page but don't fault-in
  soft offline
    get_any_page
    // find the hugetlb is free
      soft_offline_free_page
      // set hwpoison flag
        dissolve_free_huge_page
        // fail because all free hugepages are reserved
                                 page fault
                                   ...
                                     hugetlb_fault
                                       hugetlb_no_page
                                         alloc_huge_page
                                           ...
                                             dequeue_huge_page_node_exact
                                             // ignore hwpoisoned hugepage
                                             // and finally fail due to no-mem

The root cause of this is that current soft-offline code is written based
on an assumption that PageHWPoison flag should be set at first to avoid
accessing the corrupted data.  This makes sense for memory_failure() or
hard offline, but does not for soft offline because soft offline is about
corrected (not uncorrected) error and is safe from data lost.  This patch
changes soft offline semantics where it sets PageHWPoison flag only after
containment of the error page completes successfully.

Link: http://lkml.kernel.org/r/1531452366-11661-2-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Suggested-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <zy.zhengyi@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:43 -07:00
Nicholas Piggin
fd1102f0aa mm: mmu_notifier fix for tlb_end_vma
The generic tlb_end_vma does not call invalidate_range mmu notifier, and
it resets resets the mmu_gather range, which means the notifier won't be
called on part of the range in case of an unmap that spans multiple
vmas.

ARM64 seems to be the only arch I could see that has notifiers and uses
the generic tlb_end_vma.  I have not actually tested it.

[ Catalin and Will point out that ARM64 currently only uses the
  notifiers for KVM, which doesn't use the ->invalidate_range()
  callback right now, so it's a bug, but one that happens to
  not affect them.  So not necessary for stable.  - Linus ]

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 11:55:58 -07:00
Peter Zijlstra
d86564a2f0 mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
Jann reported that x86 was missing required TLB invalidates when he
hit the !*batch slow path in tlb_remove_table().

This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
invalidates, the PowerPC-hash where this code originated and the
Sparc-hash where this was subsequently used did not need that. ARM
which later used this put an explicit TLB invalidate in their
__p*_free_tlb() functions, and PowerPC-radix followed that example.

But when we hooked up x86 we failed to consider this. Fix this by
(optionally) hooking tlb_remove_table() into the TLB invalidate code.

NOTE: s390 was also needing something like this and might now
      be able to use the generic code again.

[ Modified to be on top of Nick's cleanups, which simplified this patch
  now that tlb_flush_mmu_tlbonly() really only flushes the TLB - Linus ]

Fixes: 9e52fc2b50 ("x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 11:55:58 -07:00
Peter Zijlstra
a6f572084f mm/tlb: Remove tlb_remove_table() non-concurrent condition
Will noted that only checking mm_users is incorrect; we should also
check mm_count in order to cover CPUs that have a lazy reference to
this mm (and could do speculative TLB operations).

If removing this turns out to be a performance issue, we can
re-instate a more complete check, but in tlb_table_flush() eliding the
call_rcu_sched().

Fixes: 2672391169 ("mm, powerpc: move the RCU page-table freeing into generic code")
Reported-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 11:55:58 -07:00
Nicholas Piggin
db7ddef301 mm: move tlb_table_flush to tlb_flush_mmu_free
There is no need to call this from tlb_flush_mmu_tlbonly, it logically
belongs with tlb_flush_mmu_free.  This makes future fixes simpler.

[ This was originally done to allow code consolidation for the
  mmu_notifier fix, but it also ends up helping simplify the
  HAVE_RCU_TABLE_INVALIDATE fix.    - Linus ]

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 11:53:24 -07:00
Peter Zijlstra
52a288c736 x86/mm/tlb: Revert the recent lazy TLB patches
Revert commits:

  95b0e6357d x86/mm/tlb: Always use lazy TLB mode
  64482aafe5 x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
  ac03158969 x86/mm/tlb: Make lazy TLB mode lazier
  61d0beb579 x86/mm/tlb: Restructure switch_mm_irqs_off()
  2ff6ddf19c x86/mm/tlb: Leave lazy TLB mode at page table free time

In order to simplify the TLB invalidate fixes for x86 and unify the
parts that need backporting.  We'll try again later.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 18:22:04 -07:00
Nick Desaulniers
815f0ddb34 include/linux/compiler*.h: make compiler-*.h mutually exclusive
Commit cafa0010cd ("Raise the minimum required gcc version to 4.6")
recently exposed a brittle part of the build for supporting non-gcc
compilers.

Both Clang and ICC define __GNUC__, __GNUC_MINOR__, and
__GNUC_PATCHLEVEL__ for quick compatibility with code bases that haven't
added compiler specific checks for __clang__ or __INTEL_COMPILER.

This is brittle, as they happened to get compatibility by posing as a
certain version of GCC.  This broke when upgrading the minimal version
of GCC required to build the kernel, to a version above what ICC and
Clang claim to be.

Rather than always including compiler-gcc.h then undefining or
redefining macros in compiler-intel.h or compiler-clang.h, let's
separate out the compiler specific macro definitions into mutually
exclusive headers, do more proper compiler detection, and keep shared
definitions in compiler_types.h.

Fixes: cafa0010cd ("Raise the minimum required gcc version to 4.6")
Reported-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Suggested-by: Eli Friedman <efriedma@codeaurora.org>
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 17:31:34 -07:00
Linus Torvalds
cd9b44f907 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - the rest of MM

 - procfs updates

 - various misc things

 - more y2038 fixes

 - get_maintainer updates

 - lib/ updates

 - checkpatch updates

 - various epoll updates

 - autofs updates

 - hfsplus

 - some reiserfs work

 - fatfs updates

 - signal.c cleanups

 - ipc/ updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits)
  ipc/util.c: update return value of ipc_getref from int to bool
  ipc/util.c: further variable name cleanups
  ipc: simplify ipc initialization
  ipc: get rid of ids->tables_initialized hack
  lib/rhashtable: guarantee initial hashtable allocation
  lib/rhashtable: simplify bucket_table_alloc()
  ipc: drop ipc_lock()
  ipc/util.c: correct comment in ipc_obtain_object_check
  ipc: rename ipcctl_pre_down_nolock()
  ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
  ipc: reorganize initialization of kern_ipc_perm.seq
  ipc: compute kern_ipc_perm.id under the ipc lock
  init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
  fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
  adfs: use timespec64 for time conversion
  kernel/sysctl.c: fix typos in comments
  drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
  fork: don't copy inconsistent signal handler state to child
  signal: make get_signal() return bool
  signal: make sigkill_pending() return bool
  ...
2018-08-22 12:34:08 -07:00
Anna-Maria Gleixner
060288a732 bdi: use irqsave variant of refcount_dec_and_lock()
The irqsave variant of refcount_dec_and_lock handles irqsave/restore when
taking/releasing the spin lock.  With this variant the call of
local_irq_save/restore is no longer required.

[bigeasy@linutronix.de: s@atomic_dec_and_lock@refcount_dec_and_lock@g]
Link: http://lkml.kernel.org/r/20180703200141.28415-5-bigeasy@linutronix.de
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Sebastian Andrzej Siewior
e58dd0de5e bdi: use refcount_t for reference counting instead atomic_t
refcount_t type and corresponding API should be used instead of atomic_t
when the variable is used as a reference counter.  This permits avoiding
accidental refcounter overflows that might lead to use-after-free
situations.

Link: http://lkml.kernel.org/r/20180703200141.28415-4-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Dennis Zhou (Facebook)
7e8a6304d5 /proc/meminfo: add percpu populated pages count
Currently, percpu memory only exposes allocation and utilization
information via debugfs.  This more or less is only really useful for
understanding the fragmentation and allocation information at a per-chunk
level with a few global counters.  This is also gated behind a config.
BPF and cgroup, for example, have seen an increase in use causing
increased use of percpu memory.  Let's make it easier for someone to
identify how much memory is being used.

This patch adds the "Percpu" stat to meminfo to more easily look up how
much percpu memory is in use.  This number includes the cost for all
allocated backing pages and not just insight at the per a unit, per chunk
level.  Metadata is excluded.  I think excluding metadata is fair because
the backing memory scales with the numbere of cpus and can quickly
outweigh the metadata.  It also makes this calculation light.

Link: http://lkml.kernel.org/r/20180807184723.74919-1-dennisszhou@gmail.com
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Roman Gushchin
3d8b38eb81 mm, oom: introduce memory.oom.group
For some workloads an intervention from the OOM killer can be painful.
Killing a random task can bring the workload into an inconsistent state.

Historically, there are two common solutions for this
problem:
1) enabling panic_on_oom,
2) using a userspace daemon to monitor OOMs and kill
   all outstanding processes.

Both approaches have their downsides: rebooting on each OOM is an obvious
waste of capacity, and handling all in userspace is tricky and requires a
userspace agent, which will monitor all cgroups for OOMs.

In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
the necessity of enabling panic_on_oom.  Also, it can simplify the cgroup
management for userspace applications.

This commit introduces a new knob for cgroup v2 memory controller:
memory.oom.group.  The knob determines whether the cgroup should be
treated as an indivisible workload by the OOM killer.  If set, all tasks
belonging to the cgroup or to its descendants (if the memory cgroup is not
a leaf cgroup) are killed together or not at all.

To determine which cgroup has to be killed, we do traverse the cgroup
hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
and looking for the highest-level cgroup with memory.oom.group set.

Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
an exception and are never killed.

This patch doesn't change the OOM victim selection algorithm.

Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Roman Gushchin
5989ad7b5e mm, oom: refactor oom_kill_process()
Patch series "introduce memory.oom.group", v2.

This is a tiny implementation of cgroup-aware OOM killer, which adds an
ability to kill a cgroup as a single unit and so guarantee the integrity
of the workload.

Although it has only a limited functionality in comparison to what now
resides in the mm tree (it doesn't change the victim task selection
algorithm, doesn't look at memory stas on cgroup level, etc), it's also
much simpler and more straightforward.  So, hopefully, we can avoid having
long debates here, as we had with the full implementation.

As it doesn't prevent any futher development, and implements an useful and
complete feature, it looks as a sane way forward.

This patch (of 2):

oom_kill_process() consists of two logical parts: the first one is
responsible for considering task's children as a potential victim and
printing the debug information.  The second half is responsible for
sending SIGKILL to all tasks sharing the mm struct with the given victim.

This commit splits oom_kill_process() with an intention to re-use the the
second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks belonging to the
victim cgroup.  We don't need to print the debug information for the each
task, as well as play with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com
Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Oscar Salvador
03e85f9d5f mm/page_alloc: Introduce free_area_init_core_hotplug
Currently, whenever a new node is created/re-used from the memhotplug
path, we call free_area_init_node()->free_area_init_core().  But there is
some code that we do not really need to run when we are coming from such
path.

free_area_init_core() performs the following actions:

1) Initializes pgdat internals, such as spinlock, waitqueues and more.
2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
   when creating hash tables.
3) Account number of managed_pages per zone, substracting dma_reserved and
   memmap pages.
4) Initializes some fields of the zone structure data
5) Calls init_currently_empty_zone to initialize all the freelists
6) Calls memmap_init to initialize all pages belonging to certain zone

When called from memhotplug path, free_area_init_core() only performs
actions #1 and #4.

Action #2 is pointless as the zones do not have any pages since either the
node was freed, or we are re-using it, eitherway all zones belonging to
this node should have 0 pages.  For the same reason, action #3 results
always in manages_pages being 0.

Action #5 and #6 are performed later on when onlining the pages:
 online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
 online_pages()->move_pfn_range_to_zone()->memmap_init_zone()

This patch does two things:

First, moves the node/zone initializtion to their own function, so it
allows us to create a small version of free_area_init_core, where we only
perform:

1) Initialization of pgdat internals, such as spinlock, waitqueues and more
4) Initialization of some fields of the zone structure data

These two functions are: pgdat_init_internals() and zone_init_internals().

The second thing this patch does, is to introduce
free_area_init_core_hotplug(), the memhotplug version of
free_area_init_core():

Currently, we call free_area_init_node() from the memhotplug path.  In
there, we set some pgdat's fields, and call calculate_node_totalpages().
calculate_node_totalpages() calculates the # of pages the node has.

Since the node is either new, or we are re-using it, the zones belonging
to this node should not have any pages, so there is no point to calculate
this now.

Actually, we re-set these values to 0 later on with the calls to:

reset_node_managed_pages()
reset_node_present_pages()

The # of pages per node and the # of pages per zone will be calculated when
onlining the pages:

online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()

Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
__paginginit with __init, so their code gets freed up.

[osalvador@techadventures.net: fix section usage]
  Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
[osalvador@suse.de: v6]
  Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Oscar Salvador
0188dc98ad mm/page_alloc: inline function to handle CONFIG_DEFERRED_STRUCT_PAGE_INIT
Let us move the code between CONFIG_DEFERRED_STRUCT_PAGE_INIT to an inline
function.  Not having an ifdef in the function makes the code more
readable.

Link: http://lkml.kernel.org/r/20180730101757.28058-4-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Pavel Tatashin
7cc2a9596d mm: remove __paginginit
__paginginit is the same thing as __meminit except for platforms without
sparsemem, there it is defined as __init.

Remove __paginginit and use __meminit.  Use __ref in one single function
that merges __meminit and __init sections: setup_usemap().

Link: http://lkml.kernel.org/r/20180801122348.21588-4-osalvador@techadventures.net
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Pavel Tatashin
c1093b746c mm: access zone->node via zone_to_nid() and zone_set_nid()
zone->node is configured only when CONFIG_NUMA=y, so it is a good idea to
have inline functions to access this field in order to avoid ifdef's in c
files.

Link: http://lkml.kernel.org/r/20180730101757.28058-3-osalvador@techadventures.net
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Oscar Salvador
ace1db3976 mm/page_alloc.c: move ifdefery out of free_area_init_core
Patch series "Refactor free_area_init_core and add
free_area_init_core_hotplug", v6.

This patchset does three things:

 1) Clean up/refactor free_area_init_core/free_area_init_node
    by moving the ifdefery out of the functions.
 2) Move the pgdat/zone initialization in free_area_init_core to its
    own function.
 3) Introduce free_area_init_core_hotplug, a small subset of
    free_area_init_core, which is only called from memhotlug code path. In this
    way, we have:

    free_area_init_core: called during early initialization
    free_area_init_core_hotplug: called whenever a new node is allocated/re-used (memhotplug path)

This patch (of 5):

Moving the #ifdefs out of the function makes it easier to follow.

Link: http://lkml.kernel.org/r/20180730101757.28058-2-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Shakeel Butt
8de7ecc648 memcg: reduce memcg tree traversals for stats collection
Currently cgroup-v1's memcg_stat_show traverses the memcg tree ~17 times
to collect the stats while cgroup-v2's memory_stat_show traverses the
memcg tree thrice.  On a large machine, a couple thousand memcgs is very
normal and if the churn is high and memcgs stick around during to several
reasons, tens of thousands of nodes in memcg tree can exist.  This patch
has refactored and shared the stat collection code between cgroup-v1 and
cgroup-v2 and has reduced the tree traversal to just one.

I ran a simple benchmark which reads the root_mem_cgroup's stat file
1000 times in the presense of 2500 memcgs on cgroup-v1. The results are:

Without the patch:
$ time ./read-root-stat-1000-times

real    0m1.663s
user    0m0.000s
sys     0m1.660s

With the patch:
$ time ./read-root-stat-1000-times

real    0m0.468s
user    0m0.000s
sys     0m0.467s

Link: http://lkml.kernel.org/r/20180724224635.143944-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Bruce Merry <bmerry@ska.ac.za>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Jiang Biao
1c4c3b99c0 mm: fix page_freeze_refs and page_unfreeze_refs in comments
page_freeze_refs/page_unfreeze_refs have already been relplaced by
page_ref_freeze/page_ref_unfreeze , but they are not modified in the
comments.

Link: http://lkml.kernel.org/r/1532590226-106038-1-git-send-email-jiang.biao2@zte.com.cn
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Kees Cook
8c9a134cae mm: clarify CONFIG_PAGE_POISONING and usage
The Kconfig text for CONFIG_PAGE_POISONING doesn't mention that it has to
be enabled explicitly.  This updates the documentation for that and adds a
note about CONFIG_PAGE_POISONING to the "page_poison" command line docs.
While here, change description of CONFIG_PAGE_POISONING_ZERO too, as it's
not "random" data, but rather the fixed debugging value that would be used
when not zeroing.  Additionally removes a stray "bool" in the Kconfig.

Link: http://lkml.kernel.org/r/20180725223832.GA43733@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Andrew Morton
a670468f5e mm: zero out the vma in vma_init()
Rather than in vm_area_alloc().  To ensure that the various oddball
stack-based vmas are in a good state.  Some of the callers were zeroing
them out, others were not.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Mike Rapoport
a3bf6ce366 mm/mempool.c: add missing parameter description
The kernel-doc for mempool_init function is missing the description of the
pool parameter.  Add it.

Link: http://lkml.kernel.org/r/1532336274-26228-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Michal Hocko
431f42fdfd mm/oom_kill.c: clean up oom_reap_task_mm()
Andrew has noticed some inconsistencies in oom_reap_task_mm.  Notably

 - Undocumented return value.

 - comment "failed to reap part..." is misleading - sounds like it's
   referring to something which happened in the past, is in fact
   referring to something which might happen in the future.

 - fails to call trace_finish_task_reaping() in one case

 - code duplication.

 - Increases mmap_sem hold time a little by moving
   trace_finish_task_reaping() inside the locked region.  So sue me ;)

 - Sharing the finish: path means that the trace event won't
   distinguish between the two sources of finishing.

Add a short explanation for the return value and fix the rest by
reorganizing the function a bit to have unified function exit paths.

Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Rodrigo Freire
c3b78b11ef mm, oom: describe task memory unit, larger PID pad
The default page memory unit of OOM task dump events might not be
intuitive and potentially misleading for the non-initiated when debugging
OOM events: These are pages and not kBs.  Add a small printk prior to the
task dump informing that the memory units are actually memory _pages_.

Also extends PID field to align on up to 7 characters.
Reference https://lkml.org/lkml/2018/7/3/1201

Link: http://lkml.kernel.org/r/c795eb5129149ed8a6345c273aba167ff1bbd388.1530715938.git.rfreire@redhat.com
Signed-off-by: Rodrigo Freire <rfreire@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Michal Hocko
af5679fbc6 mm, oom: remove oom_lock from oom_reaper
oom_reaper used to rely on the oom_lock since e2fe14564d ("oom_reaper:
close race with exiting task").  We do not really need the lock anymore
though.  2129258024 ("mm: oom: let oom_reap_task and exit_mmap run
concurrently") has removed serialization with the exit path based on the
mm reference count and so we do not really rely on the oom_lock anymore.

Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
to prevent from races when the page allocator didn't manage to get the
freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
on and move on to another victim.  Although this is possible in principle
let's wait for it to actually happen in real life before we make the
locking more complex again.

Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and
oom_reap_task_mm).  The reaper serializes with exit_mmap by mmap_sem +
MMF_OOM_SKIP flag.  There is no synchronization with out_of_memory path
now.

[mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did]
  Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: David Rientjes <rientjes@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Michal Hocko
93065ac753 mm, oom: distinguish blockable mode for mmu notifiers
There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep.  That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.

We can do much better though.  Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.  Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range.  Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.

This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false.  This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.

I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that.  The first
part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode.  A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom.  This can be done e.g.  after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small.  Then we are looking for a proper process tear down.

[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
Reported-by: David Rientjes <rientjes@google.com>
Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Huang Ying
c2343d2761 mm/swapfile.c: put_swap_page: share more between huge/normal code path
In this patch, locking related code is shared between huge/normal code
path in put_swap_page() to reduce code duplication. The `free_entries == 0`
case is merged into the more general `free_entries != SWAPFILE_CLUSTER`
case, because the new locking method makes it easy.

The added lines is same as the removed lines.  But the code size is
increased when CONFIG_TRANSPARENT_HUGEPAGE=n.

		text	   data	    bss	    dec	    hex	filename
base:	       24123	   2004	    340	  26467	   6763	mm/swapfile.o
unified:       24485	   2004	    340	  26829	   68cd	mm/swapfile.o

Dig on step deeper with `size -A mm/swapfile.o` for base and unified
kernel and compare the result, yields,

  -.text                                17723      0
  +.text                                17835      0
  -.orc_unwind_ip                        1380      0
  +.orc_unwind_ip                        1480      0
  -.orc_unwind                           2070      0
  +.orc_unwind                           2220      0
  -Total                                26686
  +Total                                27048

The total difference is the same.  The text segment difference is much
smaller: 112.  More difference comes from the ORC unwinder segments:
(1480 + 2220) - (1380 + 2070) = 250.  If the frame pointer unwinder is
used, this costs nothing.

Link: http://lkml.kernel.org/r/20180720071845.17920-9-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Huang Ying
b32d5f32b9 mm/swapfile.c: add __swap_entry_free_locked()
The part of __swap_entry_free() with lock held is separated into a new
function __swap_entry_free_locked().  Because we want to reuse that
piece of code in some other places.

Just mechanical code refactoring, there is no any functional change in
this function.

Link: http://lkml.kernel.org/r/20180720071845.17920-8-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Huang Ying
5d5e8f1954 mm, swap, get_swap_pages: use entry_size instead of cluster in parameter
As suggested by Matthew Wilcox, it is better to use "int entry_size"
instead of "bool cluster" as parameter to specify whether to operate for
huge or normal swap entries.  Because this improve the flexibility to
support other swap entry size.  And Dave Hansen thinks that this
improves code readability too.

So in this patch, the "bool cluster" parameter of get_swap_pages() is
replaced by "int entry_size".

And nr_swap_entries() trick is used to reduce the binary size when
!CONFIG_TRANSPARENT_HUGE_PAGE.

       text	   data	    bss	    dec	    hex	filename
base  24215	   2028	    340	  26583	   67d7	mm/swapfile.o
head  24123	   2004	    340	  26467	   6763	mm/swapfile.o

Link: http://lkml.kernel.org/r/20180720071845.17920-7-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Huang Ying
a448f2d07f mm/swapfile.c: unify normal/huge code path in put_swap_page()
In this patch, the normal/huge code path in put_swap_page() and several
helper functions are unified to avoid duplicated code, bugs, etc.  and
make it easier to review the code.

The removed lines are more than added lines.  And the binary size is
kept exactly same when CONFIG_TRANSPARENT_HUGEPAGE=n.

Link: http://lkml.kernel.org/r/20180720071845.17920-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Huang Ying
33ee011e56 mm/swapfile.c: unify normal/huge code path in swap_page_trans_huge_swapped()
As suggested by Dave, we should unify the code path for normal and huge
swap support if possible to avoid duplicated code, bugs, etc.  and make
it easier to review code.

In this patch, the normal/huge code path in
swap_page_trans_huge_swapped() is unified, the added and removed lines
are same.  And the binary size is kept almost same when
CONFIG_TRANSPARENT_HUGEPAGE=n.

		 text	   data	    bss	    dec	    hex	filename
base:		24179	   2028	    340	  26547	   67b3	mm/swapfile.o
unified:	24215	   2028	    340	  26583	   67d7	mm/swapfile.o

Link: http://lkml.kernel.org/r/20180720071845.17920-5-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-and-acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:43 -07:00
Huang Ying
afa4711ef1 mm/swapfile.c: use swap_count() in swap_page_trans_huge_swapped()
In swap_page_trans_huge_swapped(), to identify whether there's any page
table mapping for a 4k sized swap entry, "si->swap_map[i] !=
SWAP_HAS_CACHE" is used.  This works correctly now, because all users of
the function will only call it after checking SWAP_HAS_CACHE.  But as
pointed out by Daniel, it is better to use "swap_count(map[i])" here,
because it works for "map[i] == 0" case too.

And this makes the implementation more consistent between normal and
huge swap entry.

Link: http://lkml.kernel.org/r/20180720071845.17920-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-and-reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:43 -07:00
Huang Ying
fe5266d5d5 mm/swapfile.c: replace some #ifdef with IS_ENABLED()
In mm/swapfile.c, THP (Transparent Huge Page) swap specific code is
enclosed by #ifdef CONFIG_THP_SWAP/#endif to avoid code dilating when
THP isn't enabled.  But #ifdef/#endif in .c file hurt the code
readability, so Dave suggested to use IS_ENABLED(CONFIG_THP_SWAP)
instead and let compiler to do the dirty job for us.  This has potential
to remove some duplicated code too.  From output of `size`,

		text	   data	    bss	    dec	    hex	filename
THP=y:         26269	   2076	    340	  28685	   700d	mm/swapfile.o
ifdef/endif:   24115	   2028	    340	  26483	   6773	mm/swapfile.o
IS_ENABLED:    24179	   2028	    340	  26547	   67b3	mm/swapfile.o

IS_ENABLED() based solution works quite well, almost as good as that of
#ifdef/#endif.  And from the diffstat, the removed lines are more than
added lines.

One #ifdef for split_swap_cluster() is kept.  Because it is a public
function with a stub implementation for CONFIG_THP_SWAP=n in swap.h.

Link: http://lkml.kernel.org/r/20180720071845.17920-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-and-acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:43 -07:00
Huang Ying
59d98bf3c2 mm: swap: add comments to lock_cluster_or_swap_info()
Patch series "swap: THP optimizing refactoring", v4.

Now the THP (Transparent Huge Page) swap optimizing is implemented in the
way like below,

  #ifdef CONFIG_THP_SWAP
  huge_function(...)
  {
  }
  #else
  normal_function(...)
  {
  }
  #endif

  general_function(...)
  {
  	if (huge)
  		return thp_function(...);
	else
  		return normal_function(...);
  }

As pointed out by Dave Hansen, this will,

1. Create a new, wholly untested code path for huge page
2. Create two places to patch bugs
3. Are not reusing code when possible

This patchset is to address these problems via merging huge/normal code
path/functions if possible.

One concern is that this may cause code size to dilate when
!CONFIG_TRANSPARENT_HUGEPAGE.  The data shows that most refactoring will
only cause quite slight code size increase.

This patch (of 8):

To improve code readability.

Link: http://lkml.kernel.org/r/20180720071845.17920-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-and-acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:43 -07:00
Kirill Tkhai
8df4a44cc4 mm: check shrinker is memcg-aware in register_shrinker_prepared()
There is a sad BUG introduced in patch adding SHRINKER_REGISTERING.
shrinker_idr business is only for memcg-aware shrinkers.  Only such type
of shrinkers have id and they must be finaly installed via idr_replace()
in this function.  For !memcg-aware shrinkers we never initialize
shrinker->id field.

But there are all types of shrinkers passed to idr_replace(), and every
!memcg-aware shrinker with random ID (most probably, its id is 0)
replaces memcg-aware shrinker pointed by the ID in IDR.

This patch fixes the problem.

Link: http://lkml.kernel.org/r/8ff8a793-8211-713a-4ed9-d6e52390c2fc@virtuozzo.com
Fixes: 7e010df53c "mm: use special value SHRINKER_REGISTERING instead of list_empty() check"
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reported-by: <syzbot+d5f648a1bfe15678786b@syzkaller.appspotmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <syzkaller-bugs@googlegroups.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:43 -07:00
Linus Torvalds
0214f46b3a Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull core signal handling updates from Eric Biederman:
 "It was observed that a periodic timer in combination with a
  sufficiently expensive fork could prevent fork from every completing.
  This contains the changes to remove the need for that restart.

  This set of changes is split into several parts:

   - The first part makes PIDTYPE_TGID a proper pid type instead
     something only for very special cases. The part starts using
     PIDTYPE_TGID enough so that in __send_signal where signals are
     actually delivered we know if the signal is being sent to a a group
     of processes or just a single process.

   - With that prep work out of the way the logic in fork is modified so
     that fork logically makes signals received while it is running
     appear to be received after the fork completes"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
  signal: Don't send signals to tasks that don't exist
  signal: Don't restart fork when signals come in.
  fork: Have new threads join on-going signal group stops
  fork: Skip setting TIF_SIGPENDING in ptrace_init_task
  signal: Add calculate_sigpending()
  fork: Unconditionally exit if a fatal signal is pending
  fork: Move and describe why the code examines PIDNS_ADDING
  signal: Push pid type down into complete_signal.
  signal: Push pid type down into __send_signal
  signal: Push pid type down into send_signal
  signal: Pass pid type into do_send_sig_info
  signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
  signal: Pass pid type into group_send_sig_info
  signal: Pass pid and pid type into send_sigqueue
  posix-timers: Noralize good_sigevent
  signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
  pid: Implement PIDTYPE_TGID
  pids: Move the pgrp and session pid pointers from task_struct to signal_struct
  kvm: Don't open code task_pid in kvm_vcpu_ioctl
  pids: Compute task_tgid using signal->leader_pid
  ...
2018-08-21 13:47:29 -07:00
Colin Ian King
1e92641929 mm/hmm.c: remove unused variables align_start and align_end
Variables align_start and align_end are being assigned but are never
used hence they are redundant and can be removed.

Cleans up clang warnings:
  warning: variable 'align_start' set but not used [-Wunused-but-set-variable]
  warning: variable 'align_size' set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/20180714161124.3923-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:33 -07:00
David Rientjes
ddbf369c0a mm, vmacache: hash addresses based on pmd
When perf profiling a wide variety of different workloads, it was found
that vmacache_find() had higher than expected cost: up to 0.08% of cpu
utilization in some cases.  This was found to rival other core VM
functions such as alloc_pages_vma() with thp enabled and default
mempolicy, and the conditionals in __get_vma_policy().

VMACACHE_HASH() determines which of the four per-task_struct slots a vma
is cached for a particular address.  This currently depends on the pfn,
so pfn 5212 occupies a different vmacache slot than its neighboring pfn
5213.

vmacache_find() iterates through all four of current's vmacache slots
when looking up an address.  Hashing based on pfn, an address has
~1/VMACACHE_SIZE chance of being cached in the first vmacache slot, or
about 25%, *if* the vma is cached.

This patch hashes an address by its pmd instead of pte to optimize for
workloads with good spatial locality.  This results in a higher
probability of vmas being cached in the first slot that is checked:
normally ~70% on the same workloads instead of 25%.

[rientjes@google.com: various updates]
  Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807231532290.109445@chino.kir.corp.google.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807091749150.114630@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Sebastian Andrzej Siewior
6b51e88199 mm/list_lru: introduce list_lru_shrink_walk_irq()
Provide list_lru_shrink_walk_irq() and let it behave like
list_lru_walk_one() except that it locks the spinlock with
spin_lock_irq().  This is used by scan_shadow_nodes() because its lock
nests within the i_pages lock which is acquired with IRQ.  This change
allows to use proper locking promitives instead hand crafted
lock_irq_disable() plus spin_lock().

There is no EXPORT_SYMBOL provided because the current user is in-kernel
only.

Add list_lru_shrink_walk_irq() which acquires the spinlock with the
proper locking primitives.

Link: http://lkml.kernel.org/r/20180716111921.5365-5-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Sebastian Andrzej Siewior
6e018968f8 mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
__list_lru_walk_one() is invoked with struct list_lru *lru, int nid as
the first two argument.  Those two are only used to retrieve struct
list_lru_node.  Since this is already done by the caller of the function
for the locking, we can pass struct list_lru_node* directly and avoid
the dance around it.

Link: http://lkml.kernel.org/r/20180716111921.5365-4-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Sebastian Andrzej Siewior
6cfe57a96b mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
Move the locking inside __list_lru_walk_one() to its caller.  This is a
preparation step in order to introduce list_lru_walk_one_irq() which
does spin_lock_irq() instead of spin_lock() for the locking.

Link: http://lkml.kernel.org/r/20180716111921.5365-3-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Sebastian Andrzej Siewior
87a5ffc163 mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
Patch series "mm/list_lru: Add list_lru_shrink_walk_irq() and a user".

This series removes the local_irq_disable() around
list_lru_shrink_walk() (as used by mm/workingset) by adding
list_lru_shrink_walk_irq().

Vladimir Davydov preferred this over `irq' argument which I added to
struct list_lru.

The initial post (of this series) received a Reviewed-by tag by Vladimir
Davydov which I added to each patch of the series.  The series applies
on top of akpm's tree which has Kirill's shrink_slab series and does not
clash with it (akpm asked me to wait a week or so and repost it then).

I tested the code paths by triggering the OOM-killer via memory over
commit and lockdep did not complain (nor did I see any warnings).

This patch (of 4):

list_lru_walk_node() invokes __list_lru_walk_one() with -1 as the
memcg_idx parameter.  The same can be achieved by list_lru_walk_one() and
passing NULL as memcg argument which then gets converted into -1.  This is
a preparation step when the spin_lock() function is lifted to the caller
of __list_lru_walk_one().  Invoke list_lru_walk_one() instead
__list_lru_walk_one() when possible.

Link: http://lkml.kernel.org/r/20180716111921.5365-2-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Huang Ying
14fef28414 mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
CONFIG_THP_SWAP should depend on CONFIG_SWAP, because it's unreasonable
to optimize swapping for THP (Transparent Huge Page) without basic
swapping support.

In original code, when CONFIG_SWAP=n and CONFIG_THP_SWAP=y,
split_swap_cluster() will not be built because it is in swapfile.c, but
it will be called in huge_memory.c.  This doesn't trigger a build error
in practice because the call site is enclosed by PageSwapCache(), which
is defined to be constant 0 when CONFIG_SWAP=n.  But this is fragile and
should be fixed.

The comments are fixed too to reflect the latest progress.

Link: http://lkml.kernel.org/r/20180713021228.439-1-ying.huang@intel.com
Fixes: 38d8b4e6bd ("mm, THP, swap: delay splitting THP during swap out")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Pavel Tatashin
2a3cb8baef mm/sparse: delete old sparse_init and enable new one
Rename new_sparse_init() to sparse_init() which enables it.  Delete old
sparse_init() and all the code that became obsolete with.

[pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()]
  Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Tested-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Pavel Tatashin
85c77f7913 mm/sparse: add new sparse_init_nid() and sparse_init()
sparse_init() requires to temporary allocate two large buffers: usemap_map
and map_map.  Baoquan He has identified that these buffers are so large
that Linux is not bootable on small memory machines, such as a kdump boot.
The buffers are especially large when CONFIG_X86_5LEVEL is set, as they
are scaled to the maximum physical memory size.

Baoquan provided a fix, which reduces these sizes of these buffers, but it
is much better to get rid of them entirely.

Add a new way to initialize sparse memory: sparse_init_nid(), which only
operates within one memory node, and thus allocates memory either in large
contiguous block or allocates section by section.  This eliminates the
need for use of temporary buffers.

For simplified bisecting and review temporarly call sparse_init()
new_sparse_init(), the new interface is going to be enabled as well as old
code removed in the next patch.

Link: http://lkml.kernel.org/r/20180712203730.8703-5-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Pavel Tatashin
afda57bc13 mm/sparse: move buffer init/fini to the common place
Now that both variants of sparse memory use the same buffers to populate
memory map, we can move sparse_buffer_init()/sparse_buffer_fini() to the
common place.

Link: http://lkml.kernel.org/r/20180712203730.8703-4-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Tested-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Pavel Tatashin
e131c06b14 mm/sparse: use the new sparse buffer functions in non-vmemmap
non-vmemmap sparse also allocated large contiguous chunk of memory, and if
fails falls back to smaller allocations.  Use the same functions to
allocate buffer as the vmemmap-sparse

Link: http://lkml.kernel.org/r/20180712203730.8703-3-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Pavel Tatashin
35fd1eb1e8 mm/sparse: abstract sparse buffer allocations
Patch series "sparse_init rewrite", v6.

In sparse_init() we allocate two large buffers to temporary hold usemap
and memmap for the whole machine.  However, we can avoid doing that if
we changed sparse_init() to operated on per-node bases instead of doing
it on the whole machine beforehand.

As shown by Baoquan
  http://lkml.kernel.org/r/20180628062857.29658-1-bhe@redhat.com

The buffers are large enough to cause machine stop to boot on small
memory systems.

Another benefit of these changes is that they also obsolete
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER.

This patch (of 5):

When struct pages are allocated for sparse-vmemmap VA layout, we first try
to allocate one large buffer, and than if that fails allocate struct pages
for each section as we go.

The code that allocates buffer is uses global variables and is spread
across several call sites.

Cleanup the code by introducing three functions to handle the global
buffer:

sparse_buffer_init()	initialize the buffer
sparse_buffer_fini()	free the remaining part of the buffer
sparse_buffer_alloc()	alloc from the buffer, and if buffer is empty
return NULL

Define these functions in sparse.c instead of sparse-vmemmap.c because
later we will use them for non-vmemmap sparse allocations as well.

[akpm@linux-foundation.org: use PTR_ALIGN()]
[akpm@linux-foundation.org: s/BUG_ON/WARN_ON/]
Link: http://lkml.kernel.org/r/20180712203730.8703-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Cannon Matthews
330d6e489a mm/hugetlb.c: don't zero 1GiB bootmem pages
When using 1GiB pages during early boot, use the new
memblock_virt_alloc_try_nid_raw() to allocate memory without zeroing it.
Zeroing out hundreds or thousands of GiB in a single core memset() call
is very slow, and can make early boot last upwards of 20-30 minutes on
multi TiB machines.

The memory does not need to be zero'd as the hugetlb pages are always
zero'd on page fault.

Tested: Booted with ~3800 1G pages, and it booted successfully in
roughly the same amount of time as with 0, as opposed to the 25+ minutes
it would take before.

Link: http://lkml.kernel.org/r/20180711213313.92481-1-cannonmatthews@google.com
Signed-off-by: Cannon Matthews <cannonmatthews@google.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Aaron Lu
d8a759b570 mm, page_alloc: double zone's batchsize
To improve page allocator's performance for order-0 pages, each CPU has
a Per-CPU-Pageset(PCP) per zone.  Whenever an order-0 page is needed,
PCP will be checked first before asking pages from Buddy.  When PCP is
used up, a batch of pages will be fetched from Buddy to improve
performance and the size of batch can affect performance.

zone's batch size gets doubled last time by commit ba56e91c9401("mm:
page_alloc: increase size of per-cpu-pages") over ten years ago.  Since
then, CPU has envolved a lot and CPU's cache sizes also increased.

Dave Hansen is concerned the current batch size doesn't fit well with
modern hardware and suggested me to do two things: first, use a page
allocator intensive benchmark, e.g.  will-it-scale/page_fault1 to find
out how performance changes with different batch sizes on various
machines and then choose a new default batch size; second, see how this
new batch size work with other workloads.

In the first test, we saw performance gains on high-core-count systems
and little to no effect on older systems with more modest core counts.
In this phase's test data, two candidates: 63 and 127 are chosen.

In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
and more will-it-scale sub-tests are tested to see how these two
candidates work with these workloads and decides a new default according
to their results.

Most test results are flat.  will-it-scale/page_fault2 process mode has
10%-18% performance increase on 4-sockets Skylake and Broadwell.
vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
4-sockets servers while for 2-sockets servers, it caused 3%-8% performance
drop.  Further analysis showed that, with a larger pcp->batch and thus
larger pcp->high(the relationship of pcp->high=6 * pcp->batch is
maintained in this patch), zone lock contention shifted to LRU add side
lock contention and that caused performance drop.  This performance drop
might be mitigated by others' work on optimizing LRU lock.

Another downside of increasing pcp->batch is, when PCP is used up and need
to fetch a batch of pages from Buddy, since batch is increased, that time
can be longer than before.  My understanding is, this doesn't affect
slowpath where direct reclaim and compaction dominates.  For fastpath,
throughput is a win(according to will-it-scale/page_fault1) but worst
latency can be larger now.

Overall, I think double the batch size from 31 to 63 is relatively safe
and provide good performance boost for high-core-count systems.

The two phase's test results are listed below(all tests are done with THP
disabled).

Phase one(will-it-scale/page_fault1) test results:

Skylake-EX: increased batch size has a good effect on zone->lock
contention, though LRU contention will rise at the same time and
limited the final performance increase.

batch   score     change   zone_contention   lru_contention   total_contention
 31   15345900    +0.00%       64%                 8%           72%
 53   17903847   +16.67%       32%                38%           70%
 63   17992886   +17.25%       24%                45%           69%
 73   18022825   +17.44%       10%                61%           71%
119   18023401   +17.45%        4%                66%           70%
127   18029012   +17.48%        3%                66%           69%
137   18036075   +17.53%        4%                66%           70%
165   18035964   +17.53%        2%                67%           69%
188   18101105   +17.95%        2%                67%           69%
223   18130951   +18.15%        2%                67%           69%
255   18118898   +18.07%        2%                67%           69%
267   18101559   +17.96%        2%                67%           69%
299   18160468   +18.34%        2%                68%           70%
320   18139845   +18.21%        2%                67%           69%
393   18160869   +18.34%        2%                68%           70%
424   18170999   +18.41%        2%                68%           70%
458   18144868   +18.24%        2%                68%           70%
467   18142366   +18.22%        2%                68%           70%
498   18154549   +18.30%        1%                68%           69%
511   18134525   +18.17%        1%                69%           70%

Broadwell-EX: similar pattern as Skylake-EX.

batch   score     change   zone_contention   lru_contention   total_contention
 31   16703983    +0.00%       67%                 7%           74%
 53   18195393    +8.93%       43%                28%           71%
 63   18288885    +9.49%       38%                33%           71%
 73   18344329    +9.82%       35%                37%           72%
119   18535529   +10.96%       24%                46%           70%
127   18513596   +10.83%       23%                48%           71%
137   18514327   +10.84%       23%                48%           71%
165   18511840   +10.82%       22%                49%           71%
188   18593478   +11.31%       17%                53%           70%
223   18601667   +11.36%       17%                52%           69%
255   18774825   +12.40%       12%                58%           70%
267   18754781   +12.28%        9%                60%           69%
299   18892265   +13.10%        7%                63%           70%
320   18873812   +12.99%        8%                62%           70%
393   18891174   +13.09%        6%                64%           70%
424   18975108   +13.60%        6%                64%           70%
458   18932364   +13.34%        8%                62%           70%
467   18960891   +13.51%        5%                65%           70%
498   18944526   +13.41%        5%                64%           69%
511   18960839   +13.51%        5%                64%           69%

Skylake-EP: although increased batch reduced zone->lock contention, but
the effect is not as good as EX: zone->lock contention is still as high as
20% with a very high batch value instead of 1% on Skylake-EX or 5% on
Broadwell-EX.  Also, total_contention actually decreased with a higher
batch but that doesn't translate to performance increase.

batch   score    change   zone_contention   lru_contention   total_contention
 31   9554867    +0.00%       66%                 3%           69%
 53   9855486    +3.15%       63%                 3%           66%
 63   9980145    +4.45%       62%                 4%           66%
 73   10092774   +5.63%       62%                 5%           67%
119   10310061   +7.90%       45%                19%           64%
127   10342019   +8.24%       42%                19%           61%
137   10358182   +8.41%       42%                21%           63%
165   10397060   +8.81%       37%                24%           61%
188   10341808   +8.24%       34%                26%           60%
223   10349135   +8.31%       31%                27%           58%
255   10327189   +8.08%       28%                29%           57%
267   10344204   +8.26%       27%                29%           56%
299   10325043   +8.06%       25%                30%           55%
320   10310325   +7.91%       25%                31%           56%
393   10293274   +7.73%       21%                31%           52%
424   10311099   +7.91%       21%                32%           53%
458   10321375   +8.02%       21%                32%           53%
467   10303881   +7.84%       21%                32%           53%
498   10332462   +8.14%       20%                33%           53%
511   10325016   +8.06%       20%                32%           52%

Broadwell-EP: zone->lock and lru lock had an agreement to make sure
performance doesn't increase and they successfully managed to keep total
contention at 70%.

batch   score    change   zone_contention   lru_contention   total_contention
 31   10121178   +0.00%       19%                50%           69%
 53   10142366   +0.21%        6%                63%           69%
 63   10117984   -0.03%       11%                58%           69%
 73   10123330   +0.02%        7%                63%           70%
119   10108791   -0.12%        2%                67%           69%
127   10166074   +0.44%        3%                66%           69%
137   10141574   +0.20%        3%                66%           69%
165   10154499   +0.33%        2%                68%           70%
188   10124921   +0.04%        2%                67%           69%
223   10137399   +0.16%        2%                67%           69%
255   10143289   +0.22%        0%                68%           68%
267   10123535   +0.02%        1%                68%           69%
299   10140952   +0.20%        0%                68%           68%
320   10163170   +0.41%        0%                68%           68%
393   10000633   -1.19%        0%                69%           69%
424   10087998   -0.33%        0%                69%           69%
458   10187116   +0.65%        0%                69%           69%
467   10146790   +0.25%        0%                69%           69%
498   10197958   +0.76%        0%                69%           69%
511   10152326   +0.31%        0%                69%           69%

Haswell-EP: similar to Broadwell-EP.

batch   score   change   zone_contention   lru_contention   total_contention
 31   10442205   +0.00%       14%                48%           62%
 53   10442255   +0.00%        5%                57%           62%
 63   10452059   +0.09%        6%                57%           63%
 73   10482349   +0.38%        5%                59%           64%
119   10454644   +0.12%        3%                60%           63%
127   10431514   -0.10%        3%                59%           62%
137   10423785   -0.18%        3%                60%           63%
165   10481216   +0.37%        2%                61%           63%
188   10448755   +0.06%        2%                61%           63%
223   10467144   +0.24%        2%                61%           63%
255   10480215   +0.36%        2%                61%           63%
267   10484279   +0.40%        2%                61%           63%
299   10466450   +0.23%        2%                61%           63%
320   10452578   +0.10%        2%                61%           63%
393   10499678   +0.55%        1%                62%           63%
424   10481454   +0.38%        1%                62%           63%
458   10473562   +0.30%        1%                62%           63%
467   10484269   +0.40%        0%                62%           62%
498   10505599   +0.61%        0%                62%           62%
511   10483395   +0.39%        0%                62%           62%

Westmere-EP: contention is pretty small so not interesting.  Note too high
a batch value could hurt performance.

batch   score   change   zone_contention   lru_contention   total_contention
 31   4831523   +0.00%        2%                 3%            5%
 53   4834086   +0.05%        2%                 4%            6%
 63   4834262   +0.06%        2%                 3%            5%
 73   4832851   +0.03%        2%                 4%            6%
119   4830534   -0.02%        1%                 3%            4%
127   4827461   -0.08%        1%                 4%            5%
137   4827459   -0.08%        1%                 3%            4%
165   4820534   -0.23%        0%                 4%            4%
188   4817947   -0.28%        0%                 3%            3%
223   4809671   -0.45%        0%                 3%            3%
255   4802463   -0.60%        0%                 4%            4%
267   4801634   -0.62%        0%                 3%            3%
299   4798047   -0.69%        0%                 3%            3%
320   4793084   -0.80%        0%                 3%            3%
393   4785877   -0.94%        0%                 3%            3%
424   4782911   -1.01%        0%                 3%            3%
458   4779346   -1.08%        0%                 3%            3%
467   4780306   -1.06%        0%                 3%            3%
498   4780589   -1.05%        0%                 3%            3%
511   4773724   -1.20%        0%                 3%            3%

Skylake-Desktop: similar to Westmere-EP, nothing interesting.

batch   score   change   zone_contention   lru_contention   total_contention
 31   3906608   +0.00%        2%                 3%            5%
 53   3940164   +0.86%        2%                 3%            5%
 63   3937289   +0.79%        2%                 3%            5%
 73   3940201   +0.86%        2%                 3%            5%
119   3933240   +0.68%        2%                 3%            5%
127   3930514   +0.61%        2%                 4%            6%
137   3938639   +0.82%        0%                 3%            3%
165   3908755   +0.05%        0%                 3%            3%
188   3905621   -0.03%        0%                 3%            3%
223   3903015   -0.09%        0%                 4%            4%
255   3889480   -0.44%        0%                 3%            3%
267   3891669   -0.38%        0%                 4%            4%
299   3898728   -0.20%        0%                 4%            4%
320   3894547   -0.31%        0%                 4%            4%
393   3875137   -0.81%        0%                 4%            4%
424   3874521   -0.82%        0%                 3%            3%
458   3880432   -0.67%        0%                 4%            4%
467   3888715   -0.46%        0%                 3%            3%
498   3888633   -0.46%        0%                 4%            4%
511   3875305   -0.80%        0%                 5%            5%

Haswell-Desktop: zone->lock is pretty low as other desktops, though lru
contention is higher than other desktops.

batch   score   change   zone_contention   lru_contention   total_contention
 31   3511158   +0.00%        2%                 5%            7%
 53   3555445   +1.26%        2%                 6%            8%
 63   3561082   +1.42%        2%                 6%            8%
 73   3547218   +1.03%        2%                 6%            8%
119   3571319   +1.71%        1%                 7%            8%
127   3549375   +1.09%        0%                 6%            6%
137   3560233   +1.40%        0%                 6%            6%
165   3555176   +1.25%        2%                 6%            8%
188   3551501   +1.15%        0%                 8%            8%
223   3531462   +0.58%        0%                 7%            7%
255   3570400   +1.69%        0%                 7%            7%
267   3532235   +0.60%        1%                 8%            9%
299   3562326   +1.46%        0%                 6%            6%
320   3553569   +1.21%        0%                 8%            8%
393   3539519   +0.81%        0%                 7%            7%
424   3549271   +1.09%        0%                 8%            8%
458   3528885   +0.50%        0%                 8%            8%
467   3526554   +0.44%        0%                 7%            7%
498   3525302   +0.40%        0%                 9%            9%
511   3527556   +0.47%        0%                 8%            8%

Sandybridge-Desktop: the 0% contention isn't accurate but caused by
dropped fractional part. Since multiple contention path's contentions
are all under 1% here, with some arithmetic operations like add, the
final deviation could be as large as 3%.

batch   score   change   zone_contention   lru_contention   total_contention
 31   1744495   +0.00%        0%                 0%            0%
 53   1755341   +0.62%        0%                 0%            0%
 63   1758469   +0.80%        0%                 0%            0%
 73   1759626   +0.87%        0%                 0%            0%
119   1770417   +1.49%        0%                 0%            0%
127   1768252   +1.36%        0%                 0%            0%
137   1767848   +1.34%        0%                 0%            0%
165   1765088   +1.18%        0%                 0%            0%
188   1766918   +1.29%        0%                 0%            0%
223   1767866   +1.34%        0%                 0%            0%
255   1768074   +1.35%        0%                 0%            0%
267   1763187   +1.07%        0%                 0%            0%
299   1765620   +1.21%        0%                 0%            0%
320   1767603   +1.32%        0%                 0%            0%
393   1764612   +1.15%        0%                 0%            0%
424   1758476   +0.80%        0%                 0%            0%
458   1758593   +0.81%        0%                 0%            0%
467   1757915   +0.77%        0%                 0%            0%
498   1753363   +0.51%        0%                 0%            0%
511   1755548   +0.63%        0%                 0%            0%

Phase two test results:
Note: all percent change is against base(batch=31).

ebizzy.throughput (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    2410037±7%     2600451±2% +7.9%     2602878 +8.0%
lkp-bdw-ex1     1493328        1489243    -0.3%     1492145 -0.1%
lkp-skl-2sp2    1329674        1345891    +1.2%     1351056 +1.6%
lkp-bdw-ep2      711511         711511     0.0%      710708 -0.1%
lkp-wsm-ep2       75750          75528    -0.3%       75441 -0.4%
lkp-skl-d01      264126         262791    -0.5%      264113 +0.0%
lkp-hsw-d01      176601         176328    -0.2%      176368 -0.1%
lkp-sb02          98937          98937    +0.0%       99030 +0.1%

kbuild.buildtime (less is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     107.00        107.67  +0.6%        107.11  +0.1%
lkp-bdw-ex1       97.33         97.33  +0.0%         97.42  +0.1%
lkp-skl-2sp2     180.00        179.83  -0.1%        179.83  -0.1%
lkp-bdw-ep2      178.17        179.17  +0.6%        177.50  -0.4%
lkp-wsm-ep2      737.00        738.00  +0.1%        738.00  +0.1%
lkp-skl-d01      642.00        653.00  +1.7%        653.00  +1.7%
lkp-hsw-d01     1310.00       1316.00  +0.5%       1311.00  +0.1%

netperf/TCP_STREAM.Throughput_total_Mbps (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     948790        947144  -0.2%        948333 -0.0%
lkp-bdw-ex1      904224        904366  +0.0%        904926 +0.1%
lkp-skl-2sp2     239731        239607  -0.1%        239565 -0.1%
lk-bdw-ep2       365764        365933  +0.0%        365951 +0.1%
lkp-wsm-ep2       93736         93803  +0.1%         93808 +0.1%
lkp-skl-d01       77314         77303  -0.0%         77375 +0.1%
lkp-hsw-d01       58617         60387  +3.0%         60208 +2.7%
lkp-sb02          29990         30137  +0.5%         30103 +0.4%

oltp.transactions (higer is better)

machine         batch=31      batch=63             batch=127
lkp-bdw-ex1      9073276       9100377     +0.3%    9036344     -0.4%
lkp-skl-2sp2     8898717       8852054     -0.5%    8894459     -0.0%
lkp-bdw-ep2     13426155      13384654     -0.3%   13333637     -0.7%
lkp-hsw-ep2     13146314      13232784     +0.7%   13193163     +0.4%
lkp-wsm-ep2      5035355       5019348     -0.3%    5033418     -0.0%
lkp-skl-d01       418485       4413339     -0.1%    4419039     +0.0%
lkp-hsw-d01      3517817±5%    3396120±3%  -3.5%    3455138±3%  -1.8%

pigz.throughput (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.513e+08     1.507e+08 -0.4%      1.511e+08 -0.2%
lkp-bdw-ex1     2.060e+08     2.052e+08 -0.4%      2.044e+08 -0.8%
lkp-skl-2sp2    8.836e+08     8.845e+08 +0.1%      8.836e+08 -0.0%
lkp-bdw-ep2     8.275e+08     8.464e+08 +2.3%      8.330e+08 +0.7%
lkp-wsm-ep2     2.224e+08     2.221e+08 -0.2%      2.218e+08 -0.3%
lkp-skl-d01     1.177e+08     1.177e+08 -0.0%      1.176e+08 -0.1%
lkp-hsw-d01     1.154e+08     1.154e+08 +0.1%      1.154e+08 -0.0%
lkp-sb02        0.633e+08     0.633e+08 +0.1%      0.633e+08 +0.0%

will-it-scale.malloc1.processes (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1      620181       620484 +0.0%         620240 +0.0%
lkp-bdw-ex1      1403610      1401201 -0.2%        1417900 +1.0%
lkp-skl-2sp2     1288097      1284145 -0.3%        1283907 -0.3%
lkp-bdw-ep2      1427879      1427675 -0.0%        1428266 +0.0%
lkp-hsw-ep2      1362546      1353965 -0.6%        1354759 -0.6%
lkp-wsm-ep2      2099657      2107576 +0.4%        2100226 +0.0%
lkp-skl-d01      1476835      1476358 -0.0%        1474487 -0.2%
lkp-hsw-d01      1308810      1303429 -0.4%        1301299 -0.6%
lkp-sb02          589286       589284 -0.0%         588101 -0.2%

will-it-scale.malloc1.threads (higher is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     21289         21125     -0.8%      21241     -0.2%
lkp-bdw-ex1      28114         28089     -0.1%      28007     -0.4%
lkp-skl-2sp2     91866         91946     +0.1%      92723     +0.9%
lkp-bdw-ep2      37637         37501     -0.4%      37317     -0.9%
lkp-hsw-ep2      43673         43590     -0.2%      43754     +0.2%
lkp-wsm-ep2      28577         28298     -1.0%      28545     -0.1%
lkp-skl-d01     175277        173343     -1.1%     173082     -1.3%
lkp-hsw-d01     130303        129566     -0.6%     129250     -0.8%
lkp-sb02        113742±3%     116911     +2.8%     116417±3%  +2.4%

will-it-scale.malloc2.processes (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.206e+09     1.206e+09 -0.0%      1.206e+09 +0.0%
lkp-bdw-ex1     1.319e+09     1.319e+09 -0.0%      1.319e+09 +0.0%
lkp-skl-2sp2    8.000e+08     8.021e+08 +0.3%      7.995e+08 -0.1%
lkp-bdw-ep2     6.582e+08     6.634e+08 +0.8%      6.513e+08 -1.1%
lkp-hsw-ep2     6.671e+08     6.669e+08 -0.0%      6.665e+08 -0.1%
lkp-wsm-ep2     1.805e+08     1.806e+08 +0.0%      1.804e+08 -0.1%
lkp-skl-d01     1.611e+08     1.611e+08 -0.0%      1.610e+08 -0.0%
lkp-hsw-d01     1.333e+08     1.332e+08 -0.0%      1.332e+08 -0.0%
lkp-sb02         82485104      82478206 -0.0%       82473546 -0.0%

will-it-scale.malloc2.threads (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.574e+09     1.574e+09 -0.0%      1.574e+09 -0.0%
lkp-bdw-ex1     1.737e+09     1.737e+09 +0.0%      1.737e+09 -0.0%
lkp-skl-2sp2    9.161e+08     9.162e+08 +0.0%      9.181e+08 +0.2%
lkp-bdw-ep2     7.856e+08     8.015e+08 +2.0%      8.113e+08 +3.3%
lkp-hsw-ep2     6.908e+08     6.904e+08 -0.1%      6.907e+08 -0.0%
lkp-wsm-ep2     2.409e+08     2.409e+08 +0.0%      2.409e+08 -0.0%
lkp-skl-d01     1.199e+08     1.199e+08 -0.0%      1.199e+08 -0.0%
lkp-hsw-d01     1.029e+08     1.029e+08 -0.0%      1.029e+08 +0.0%
lkp-sb02         68081213      68061423 -0.0%       68076037 -0.0%

will-it-scale.page_fault2.processes (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    14509125±4%   16472364 +13.5%       17123117 +18.0%
lkp-bdw-ex1     14736381      16196588  +9.9%       16364011 +11.0%
lkp-skl-2sp2     6354925       6435444  +1.3%        6436644  +1.3%
lkp-bdw-ep2      8749584       8834422  +1.0%        8827179  +0.9%
lkp-hsw-ep2      8762591       8845920  +1.0%        8825697  +0.7%
lkp-wsm-ep2      3036083       3030428  -0.2%        3021741  -0.5%
lkp-skl-d01      2307834       2304731  -0.1%        2286142  -0.9%
lkp-hsw-d01      1806237       1800786  -0.3%        1795943  -0.6%
lkp-sb02          842616        837844  -0.6%         833921  -1.0%

will-it-scale.page_fault2.threads

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     1623294       1615132±2% -0.5%     1656777    +2.1%
lkp-bdw-ex1      1995714       2025948    +1.5%     2113753±3% +5.9%
lkp-skl-2sp2     2346708       2415591    +2.9%     2416919    +3.0%
lkp-bdw-ep2      2342564       2344882    +0.1%     2300206    -1.8%
lkp-hsw-ep2      1820658       1831681    +0.6%     1844057    +1.3%
lkp-wsm-ep2      1725482       1733774    +0.5%     1740517    +0.9%
lkp-skl-d01      1832833       1823628    -0.5%     1806489    -1.4%
lkp-hsw-d01      1427913       1427287    -0.0%     1420226    -0.5%
lkp-sb02          750626        748615    -0.3%      746621    -0.5%

will-it-scale.page_fault3.processes (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    24382726      24400317 +0.1%       24668774 +1.2%
lkp-bdw-ex1     35399750      35683124 +0.8%       35829492 +1.2%
lkp-skl-2sp2    28136820      28068248 -0.2%       28147989 +0.0%
lkp-bdw-ep2     37269077      37459490 +0.5%       37373073 +0.3%
lkp-hsw-ep2     36224967      36114085 -0.3%       36104908 -0.3%
lkp-wsm-ep2     16820457      16911005 +0.5%       16968596 +0.9%
lkp-skl-d01      7721138       7725904 +0.1%        7756740 +0.5%
lkp-hsw-d01      7611979       7650928 +0.5%        7651323 +0.5%
lkp-sb02         3781546       3796502 +0.4%        3796827 +0.4%

will-it-scale.page_fault3.threads (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     1865820±3%   1900917±2%  +1.9%     1826245±4%  -2.1%
lkp-bdw-ex1      3094060      3148326     +1.8%     3150036     +1.8%
lkp-skl-2sp2     3952940      3953898     +0.0%     3989360     +0.9%
lkp-bdw-ep2      3420373±3%   3643964     +6.5%     3644910±5%  +6.6%
lkp-hsw-ep2      2609635±2%   2582310±3%  -1.0%     2780459     +6.5%
lkp-wsm-ep2      4395001      4417196     +0.5%     4432499     +0.9%
lkp-skl-d01      5363977      5400003     +0.7%     5411370     +0.9%
lkp-hsw-d01      5274131      5311294     +0.7%     5319359     +0.9%
lkp-sb02         2917314      2913004     -0.1%     2935286     +0.6%

will-it-scale.read1.processes (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    73762279±14%  69322519±10% -6.0%    69349855±13%  -6.0% (result unstable)
lkp-bdw-ex1     1.701e+08     1.704e+08    +0.1%    1.705e+08     +0.2%
lkp-skl-2sp2    63111570      63113953     +0.0%    63836573      +1.1%
lkp-bdw-ep2     79247409      79424610     +0.2%    78012656      -1.6%
lkp-hsw-ep2     67677026      68308800     +0.9%    67539106      -0.2%
lkp-wsm-ep2     13339630      13939817     +4.5%    13766865      +3.2%
lkp-skl-d01     10969487      10972650     +0.0%    no data
lkp-hsw-d01     9857342±2%    10080592±2%  +2.3%    10131560      +2.8%
lkp-sb02        5189076        5197473     +0.2%    5163253       -0.5%

will-it-scale.read1.threads (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    62468045±12%  73666726±7% +17.9%    79553123±12% +27.4% (result unstable)
lkp-bdw-ex1     1.62e+08      1.624e+08    +0.3%    1.614e+08     -0.3%
lkp-skl-2sp2    58319780      59181032     +1.5%    59821353      +2.6%
lkp-bdw-ep2     74057992      75698171     +2.2%    74990869      +1.3%
lkp-hsw-ep2     63672959      63639652     -0.1%    64387051      +1.1%
lkp-wsm-ep2     13489943      13526058     +0.3%    13259032      -1.7%
lkp-skl-d01     10297906      10338796     +0.4%    10407328      +1.1%
lkp-hsw-d01      9636721       9667376     +0.3%     9341147      -3.1%
lkp-sb02         4801938       4804496     +0.1%     4802290      +0.0%

will-it-scale.write1.processes (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.111e+08     1.104e+08±2%  -0.7%   1.122e+08±2%  +1.0%
lkp-bdw-ex1     1.392e+08     1.399e+08     +0.5%   1.397e+08     +0.4%
lkp-skl-2sp2     59369233      58994841     -0.6%    58715168     -1.1%
lkp-bdw-ep2      61820979      CPU throttle          63593123     +2.9%
lkp-hsw-ep2      57897587      57435605     -0.8%    56347450     -2.7%
lkp-wsm-ep2       7814203       7918017±2%  +1.3%     7669068     -1.9%
lkp-skl-d01       8886557       8971422     +1.0%     8818366     -0.8%
lkp-hsw-d01       9171001±5%    9189915     +0.2%     9483909     +3.4%
lkp-sb02          4475406       4475294     -0.0%     4501756     +0.6%

will-it-scale.write1.threads (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.058e+08     1.055e+08±2%  -0.2%   1.065e+08  +0.7%
lkp-bdw-ex1     1.316e+08     1.300e+08     -1.2%   1.308e+08  -0.6%
lkp-skl-2sp2     54492421      56086678     +2.9%    55975657  +2.7%
lkp-bdw-ep2      59360449      59003957     -0.6%    58101262  -2.1%
lkp-hsw-ep2      53346346±2%   52530876     -1.5%    52902487  -0.8%
lkp-wsm-ep2       7774006       7800092±2%  +0.3%     7558833  -2.8%
lkp-skl-d01       8346174       8235695     -1.3%     no data
lkp-hsw-d01       8636244       8655731     +0.2%     8658868  +0.3%
lkp-sb02          4181820       4204107     +0.5%     4182992  +0.0%

vm-scalability.anon-r-rand.throughput (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    11933873±3%   12356544±2%  +3.5%   12188624     +2.1%
lkp-bdw-ex1      7114424±2%    7330949±2%  +3.0%    7392419     +3.9%
lkp-skl-2sp2     6773277±5%    6492332±8%  -4.1%    6543962     -3.4%
lkp-bdw-ep2      7133846±4%    7233508     +1.4%    7013518±3%  -1.7%
lkp-hsw-ep2      4576626       4527098     -1.1%    4551679     -0.5%
lkp-wsm-ep2      2583599       2592492     +0.3%    2588039     +0.2%
lkp-hsw-d01       998199±2%    1028311     +3.0%    1006460±2%  +0.8%
lkp-sb02          570572        567854     -0.5%     568449     -0.4%

vm-scalability.anon-r-rand-mt.throughput (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     1789419       1787830     -0.1%    1788208     -0.1%
lkp-bdw-ex1      3492595±2%    3554966±2%  +1.8%    3558835±3%  +1.9%
lkp-skl-2sp2     3856238±2%    3975403±4%  +3.1%    3994600     +3.6%
lkp-bdw-ep2      3726963±11%   3809292±6%  +2.2%    3871924±4%  +3.9%
lkp-hsw-ep2      2131760±3%    2033578±4%  -4.6%    2130727±6%  -0.0%
lkp-wsm-ep2      2369731       2368384     -0.1%    2370252     +0.0%
lkp-skl-d01      1207128       1206220     -0.1%    1205801     -0.1%
lkp-hsw-d01       964317        992329±2%  +2.9%     992099±2%  +2.9%
lkp-sb02          567137        567346     +0.0%     566144     -0.2%

vm-scalability.lru-file-mmap-read.throughput (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    19560469±6%   23018999     +17.7%   23418800     +19.7%
lkp-bdw-ex1     17769135±14%  26141676±3%  +47.1%   26284723±5%  +47.9%
lkp-skl-2sp2    14056512      13578884      -3.4%   13146214      -6.5%
lkp-bdw-ep2     15336542      14737654      -3.9%   14088159      -8.1%
lkp-hsw-ep2     16275498      15756296      -3.2%   15018090      -7.7%
lkp-wsm-ep2     11272160      11237231      -0.3%   11310047      +0.3%
lkp-skl-d01      7322119       7324569      +0.0%    7184148      -1.9%
lkp-hsw-d01      6449234       6404542      -0.7%    6356141      -1.4%
lkp-sb02         3517943       3520668      +0.1%    3527309      +0.3%

vm-scalability.lru-file-mmap-read-rand.throughput (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     1689052       1697553  +0.5%       1698726  +0.6%
lkp-bdw-ex1      1675246       1699764  +1.5%       1712226  +2.2%
lkp-skl-2sp2     1800533       1799749  -0.0%       1800581  +0.0%
lkp-bdw-ep2      1807422       1807758  +0.0%       1804932  -0.1%
lkp-hsw-ep2      1809807       1808781  -0.1%       1807811  -0.1%
lkp-wsm-ep2      1800198       1802434  +0.1%       1801236  +0.1%
lkp-skl-d01       696689        695537  -0.2%        694106  -0.4%
lkp-hsw-d01       698364        698666  +0.0%        696686  -0.2%
lkp-sb02          258939        258787  -0.1%        258199  -0.3%

Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Michal Hocko
a195d3f5b7 mm/oom_kill.c: document oom_lock
Add comments describing oom_lock's scope.

Requested-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/20180711120121.25635-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Mike Kravetz
40d18ebffb mm/hugetlb: remove gigantic page support for HIGHMEM
This reverts ee8f248d26 ("hugetlb: add phys addr to struct
huge_bootmem_page").

At one time powerpc used this field and supporting code.  However that
was removed with commit 79cc38ded1 ("powerpc/mm/hugetlb: Add support
for reserving gigantic huge pages via kernel command line").

There are no users of this field and supporting code, so remove it.

Link: http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Michal Hocko
9bfe5ded05 mm, oom: remove sleep from under oom_lock
Tetsuo has pointed out that since 27ae357fa8 ("mm, oom: fix concurrent
munlock and oom reaper unmap, v3") we have a strong synchronization
between the oom_killer and victim's exiting because both have to take
the oom_lock.  Therefore the original heuristic to sleep for a short
time in out_of_memory doesn't serve the original purpose.

Moreover Tetsuo has noticed that the short sleep can be more harmful
than actually useful.  Hammering the system with many processes can lead
to a starvation when the task holding the oom_lock can block for a long
time (minutes) and block any further progress because the oom_reaper
depends on the oom_lock as well.

Drop the short sleep from out_of_memory when we hold the lock.  Keep the
sleep when the trylock fails to throttle the concurrent OOM paths a bit.
This should be solved in a more reasonable way (e.g.  sleep proportional
to the time spent in the active reclaiming etc.) but this is much more
complex thing to achieve.  This is a quick fixup to remove a stale code.

Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Marek Szyprowski
6518202970 mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
cma_alloc() doesn't really support gfp flags other than __GFP_NOWARN, so
convert gfp_mask parameter to boolean no_warn parameter.

This will help to avoid giving false feeling that this function supports
standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer,
what has already been an issue: see commit dd65a941f6 ("arm64:
dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag").

Link: http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqTPJKij2939229392eucas1p2j@eucas1p2.samsung.com
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michał Nazarewicz <mina86@mina86.com>
Acked-by: Laura Abbott <labbott@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Rik van Riel
50c150f262 Revert "mm: always flush VMA ranges affected by zap_page_range"
There was a bug in Linux that could cause madvise (and mprotect?) system
calls to return to userspace without the TLB having been flushed for all
the pages involved.

This could happen when multiple threads of a process made simultaneous
madvise and/or mprotect calls.

This was noticed in the summer of 2017, at which time two solutions
were created:

  56236a5955 ("mm: refactor TLB gathering API")
  99baac21e4 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
and
  4647706ebe ("mm: always flush VMA ranges affected by zap_page_range")

We need only one of these solutions, and the former appears to be a
little more efficient than the latter, so revert that one.

This reverts 4647706ebe ("mm: always flush VMA ranges affected by
zap_page_range")

Link: http://lkml.kernel.org/r/20180706131019.51e3a5f0@imladris.surriel.com
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Baoquan He
c98aff6493 mm/sparse: optimize memmap allocation during sparse_init()
In sparse_init(), two temporary pointer arrays, usemap_map and map_map
are allocated with the size of NR_MEM_SECTIONS.  They are used to store
each memory section's usemap and mem map if marked as present.  With the
help of these two arrays, continuous memory chunk is allocated for
usemap and memmap for memory sections on one node.  This avoids too many
memory fragmentations.  Like below diagram, '1' indicates the present
memory section, '0' means absent one.  The number 'n' could be much
smaller than NR_MEM_SECTIONS on most of systems.

  |1|1|1|1|0|0|0|0|1|1|0|0|...|1|0||1|0|...|1||0|1|...|0|
  -------------------------------------------------------
   0 1 2 3         4 5         i   i+1     n-1   n

If we fail to populate the page tables to map one section's memmap, its
->section_mem_map will be cleared finally to indicate that it's not
present.  After use, these two arrays will be released at the end of
sparse_init().

In 4-level paging mode, each array costs 4M which can be ignorable.
While in 5-level paging, they costs 256M each, 512M altogether.  Kdump
kernel Usually only reserves very few memory, e.g 256M.  So, even thouth
they are temporarily allocated, still not acceptable.

In fact, there's no need to allocate them with the size of
NR_MEM_SECTIONS.  Since the ->section_mem_map clearing has been deferred
to the last, the number of present memory sections are kept the same
during sparse_init() until we finally clear out the memory section's
->section_mem_map if its usemap or memmap is not correctly handled.
Thus in the middle whenever for_each_present_section_nr() loop is taken,
the i-th present memory section is always the same one.

Here only allocate usemap_map and map_map with the size of
'nr_present_sections'.  For the i-th present memory section, install its
usemap and memmap to usemap_map[i] and mam_map[i] during allocation.
Then in the last for_each_present_section_nr() loop which clears the
failed memory section's ->section_mem_map, fetch usemap and memmap from
usemap_map[] and map_map[] array and set them into mem_section[]
accordingly.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180628062857.29658-5-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oscar Salvador <osalvador@techadventures.net>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Baoquan He
9258631b33 mm/sparse.c: add a new parameter 'data_unit_size' for alloc_usemap_and_memmap
It's used to pass the size of map data unit into
alloc_usemap_and_memmap, and is preparation for next patch.

Link: http://lkml.kernel.org/r/20180228032657.32385-4-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Baoquan He
07a34a8c36 mm/sparsemem.c: defer the ms->section_mem_map clearing
In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system
will allocate one continuous memory chunk for mem maps on one node and
populate the relevant page tables to map memory section one by one.  If
fail to populate for a certain mem section, print warning and its
->section_mem_map will be cleared to cancel the marking of being
present.  Like this, the number of mem sections marked as present could
become less during sparse_init() execution.

Here just defer the ms->section_mem_map clearing if failed to populate
its page tables until the last for_each_present_section_nr() loop.  This
is in preparation for later optimizing the mem map allocation.

[akpm@linux-foundation.org: remove now-unused local `ms', per Oscar]
Link: http://lkml.kernel.org/r/20180228032657.32385-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Baoquan He
f2fc10e0b3 mm/sparse.c: add a static variable nr_present_sections
Patch series "mm/sparse: Optimize memmap allocation during
sparse_init()", v6.

In sparse_init(), two temporary pointer arrays, usemap_map and map_map
are allocated with the size of NR_MEM_SECTIONS.  They are used to store
each memory section's usemap and mem map if marked as present.  In
5-level paging mode, this will cost 512M memory though they will be
released at the end of sparse_init().  System with few memory, like
kdump kernel which usually only has about 256M, will fail to boot
because of allocation failure if CONFIG_X86_5LEVEL=y.

In this patchset, optimize the memmap allocation code to only use
usemap_map and map_map with the size of nr_present_sections.  This makes
kdump kernel boot up with normal crashkernel='' setting when
CONFIG_X86_5LEVEL=y.

This patch (of 5):

nr_present_sections is used to record how many memory sections are
marked as present during system boot up, and will be used in the later
patch.

Link: http://lkml.kernel.org/r/20180228032657.32385-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
7e010df53c mm: use special value SHRINKER_REGISTERING instead of list_empty() check
The patch introduces a special value SHRINKER_REGISTERING to use instead
of list_empty() to differ a registering shrinker from unregistered
shrinker.  Why we need that at all?

Shrinker registration is split in two parts.  The first one is
prealloc_shrinker(), which allocates shrinker memory and reserves ID in
shrinker_idr.  This function can fail.  The second is
register_shrinker_prepared(), and it finalizes the registration.  This
function actually makes shrinker available to be used from
shrink_slab(), and it can't fail.

One shrinker may be based on more then one LRU lists.  So, we never
clear the bit in memcg shrinker maps, when (one of) corresponding LRU
list becomes empty, since other LRU lists may be not empty.  See
superblock shrinker for example: it is based on two LRU lists:
s_inode_lru and s_dentry_lru.  We do not want to clear shrinker bit,
when there are no inodes in s_inode_lru, as s_dentry_lru may contain
dentries.

Instead of that, we use special algorithm to detect shrinkers having no
elements at all its LRU lists, and this is made in shrink_slab_memcg().
See the comment in this function for the details.

Also, in shrink_slab_memcg() we clear shrinker bit in the map, when we
meet unregistered shrinker (bit is set, while there is no a shrinker in
IDR).  Otherwise, we would have done that at the moment of shrinker
unregistration for all memcgs (and this looks worse, since iteration
over all memcg may take much time).  Also this would have imposed
restrictions on shrinker unregistration order for its users: they would
have had to guarantee, there are no new elements after
unregister_shrinker() (otherwise, a new added element would have set a
bit).

So, if we meet a set bit in map and no shrinker in IDR when we're
iterating over the map in shrink_slab_memcg(), this means the
corresponding shrinker is unregistered, and we must clear the bit.

Another case is shrinker registration.  We want two things there:

1) do_shrink_slab() can be called only for completely registered
   shrinkers;

2) shrinker internal lists may be populated in any order with
   register_shrinker_prepared() (let's talk on the example with sb).  Both
   of:

  a)list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu0]
    memcg_set_shrinker_bit();                               [cpu0]
    ...
    register_shrinker_prepared();                           [cpu1]

  and

  b)register_shrinker_prepared();                           [cpu0]
    ...
    list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu1]
    memcg_set_shrinker_bit();                               [cpu1]

   are legitimate.  We don't want to impose restriction here and to
   force people to use only (b) variant.  We don't want to force people to
   care, there is no elements in LRU lists before the shrinker is
   completely registered.  Internal users of LRU lists and shrinker code
   are two different subsystems, and they have to be closed in themselves
   each other.

In (a) case we have the bit set before shrinker is completely
registered.  We don't want do_shrink_slab() is called at this moment, so
we have to detect such the registering shrinkers.

Before this patch list_empty() (shrinker is not linked to the list)
check was used for that.  So, in (a) there could be a bit set, but we
don't call do_shrink_slab() unless shrinker is linked to the list.  It's
just an indicator, I just overloaded linking to the list.

This was not the best solution, since it's better not to touch the
shrinker memory from shrink_slab_memcg() before it's completely
registered (this also will be useful in the future to make shrink_slab()
completely lockless).

So, this patch introduces better way to detect registering shrinker,
which allows not to dereference shrinker memory.  It's just a ~0UL
value, which we insert into the IDR during ID allocation.  After
shrinker is ready to be used, we insert actual shrinker pointer in the
IDR, and it becomes available to shrink_slab_memcg().

We can't use NULL instead of this new value for this purpose as:
shrink_slab_memcg() already uses NULL to detect unregistered shrinkers,
and we don't want the function sees NULL and clears the bit, otherwise
(a) won't work.

This is the only thing the patch makes: the better way to detect
registering shrinker.  Nothing else this patch makes.

Also this gives a better assembler, but it's minor side of the patch:

Before:
  callq  <idr_find>
  mov    %rax,%r15
  test   %rax,%rax
  je     <shrink_slab_memcg+0x1d5>
  mov    0x20(%rax),%rax
  lea    0x20(%r15),%rdx
  cmp    %rax,%rdx
  je     <shrink_slab_memcg+0xbd>
  mov    0x8(%rsp),%edx
  mov    %r15,%rsi
  lea    0x10(%rsp),%rdi
  callq  <do_shrink_slab>

After:
  callq  <idr_find>
  mov    %rax,%r15
  lea    -0x1(%rax),%rax
  cmp    $0xfffffffffffffffd,%rax
  ja     <shrink_slab_memcg+0x1cd>
  mov    0x8(%rsp),%edx
  mov    %r15,%rsi
  lea    0x10(%rsp),%rdi
  callq  ffffffff810cefd0 <do_shrink_slab>

[ktkhai@virtuozzo.com: add #ifdef CONFIG_MEMCG_KMEM around idr_replace()]
  Link: http://lkml.kernel.org/r/758b8fec-7573-47eb-b26a-7b2847ae7b8c@virtuozzo.com
Link: http://lkml.kernel.org/r/153355467546.11522.4518015068123480218.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
ac7fb3ad27 mm/vmscan.c: move check for SHRINKER_NUMA_AWARE to do_shrink_slab()
In case of shrink_slab_memcg() we do not zero nid, when shrinker is not
numa-aware.  This is not a real problem, since currently all memcg-aware
shrinkers are numa-aware too (we have two: super_block shrinker and
workingset shrinker), but something may change in the future.

Link: http://lkml.kernel.org/r/153320759911.18959.8842396230157677671.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
f90280d6b7 mm/vmscan.c: clear shrinker bit if there are no objects related to memcg
To avoid further unneed calls of do_shrink_slab() for shrinkers, which
already do not have any charged objects in a memcg, their bits have to
be cleared.

This patch introduces a lockless mechanism to do that without races
without parallel list lru add.  After do_shrink_slab() returns
SHRINK_EMPTY the first time, we clear the bit and call it once again.
Then we restore the bit, if the new return value is different.

Note, that single smp_mb__after_atomic() in shrink_slab_memcg() covers
two situations:

1)list_lru_add()     shrink_slab_memcg
    list_add_tail()    for_each_set_bit() <--- read bit
                         do_shrink_slab() <--- missed list update (no barrier)
    <MB>                 <MB>
    set_bit()            do_shrink_slab() <--- seen list update

This situation, when the first do_shrink_slab() sees set bit, but it
doesn't see list update (i.e., race with the first element queueing), is
rare.  So we don't add <MB> before the first call of do_shrink_slab()
instead of this to do not slow down generic case.  Also, it's need the
second call as seen in below in (2).

2)list_lru_add()      shrink_slab_memcg()
    list_add_tail()     ...
    set_bit()           ...
  ...                   for_each_set_bit()
  do_shrink_slab()        do_shrink_slab()
    clear_bit()           ...
  ...                     ...
  list_lru_add()          ...
    list_add_tail()       clear_bit()
    <MB>                  <MB>
    set_bit()             do_shrink_slab()

The barriers guarantee that the second do_shrink_slab() in the right
side task sees list update if really cleared the bit.  This case is
drawn in the code comment.

[Results/performance of the patchset]

After the whole patchset applied the below test shows signify increase
of performance:

  $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
  $mkdir /sys/fs/cgroup/memory/ct
  $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
      $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i;
			    echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
			    mkdir -p s/$i; mount -t tmpfs $i s/$i;
			    touch s/$i/file; done

Then, 5 sequential calls of drop caches:

  $time echo 3 > /proc/sys/vm/drop_caches

1)Before:
  0.00user 13.78system 0:13.78elapsed 99%CPU
  0.00user 5.59system 0:05.60elapsed 99%CPU
  0.00user 5.48system 0:05.48elapsed 99%CPU
  0.00user 8.35system 0:08.35elapsed 99%CPU
  0.00user 8.34system 0:08.35elapsed 99%CPU

2)After
  0.00user 1.10system 0:01.10elapsed 99%CPU
  0.00user 0.00system 0:00.01elapsed 64%CPU
  0.00user 0.01system 0:00.01elapsed 82%CPU
  0.00user 0.00system 0:00.01elapsed 64%CPU
  0.00user 0.01system 0:00.01elapsed 82%CPU

The results show the performance increases at least in 548 times.

Shakeel Butt tested this patchset with fork-bomb on his configuration:

 > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
 > file containing few KiBs on corresponding mount. Then in a separate
 > memcg of 200 MiB limit ran a fork-bomb.
 >
 > I ran the "perf record -ag -- sleep 60" and below are the results:
 >
 > Without the patch series:
 > Samples: 4M of event 'cycles', Event count (approx.): 3279403076005
 > +  36.40%            fb.sh  [kernel.kallsyms]    [k] shrink_slab
 > +  18.97%            fb.sh  [kernel.kallsyms]    [k] list_lru_count_one
 > +   6.75%            fb.sh  [kernel.kallsyms]    [k] super_cache_count
 > +   0.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
 > +   0.44%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
 > +   0.27%            fb.sh  [kernel.kallsyms]    [k] up_read
 > +   0.21%            fb.sh  [kernel.kallsyms]    [k] osq_lock
 > +   0.13%            fb.sh  [kernel.kallsyms]    [k] shmem_unused_huge_count
 > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
 > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node
 >
 > With the patch series:
 > Samples: 4M of event 'cycles', Event count (approx.): 2756866824946
 > +  47.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
 > +  30.72%            fb.sh  [kernel.kallsyms]    [k] up_read
 > +   9.51%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
 > +   1.69%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
 > +   1.35%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_protected
 > +   1.05%            fb.sh  [kernel.kallsyms]    [k] queued_spin_lock_slowpath
 > +   0.85%            fb.sh  [kernel.kallsyms]    [k] _raw_spin_lock
 > +   0.78%            fb.sh  [kernel.kallsyms]    [k] lruvec_lru_size
 > +   0.57%            fb.sh  [kernel.kallsyms]    [k] shrink_node
 > +   0.54%            fb.sh  [kernel.kallsyms]    [k] queue_work_on
 > +   0.46%            fb.sh  [kernel.kallsyms]    [k] shrink_slab_memcg

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112561772.4097.11011071937553113003.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063070859.1818.11870882950920963480.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
9b996468cf mm: add SHRINK_EMPTY shrinker methods return value
We need to distinguish the situations when shrinker has very small
amount of objects (see vfs_pressure_ratio() called from
super_cache_count()), and when it has no objects at all.  Currently, in
the both of these cases, shrinker::count_objects() returns 0.

The patch introduces new SHRINK_EMPTY return value, which will be used
for "no objects at all" case.  It's is a refactoring mostly, as
SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
patch, and all the magic will happen in further.

Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Vladimir Davydov
aeed1d325d mm/vmscan.c: generalize shrink_slab() calls in shrink_node()
The patch makes shrink_slab() be called for root_mem_cgroup in the same
way as it's called for the rest of cgroups.  This simplifies the logic
and improves the readability.

[ktkhai@virtuozzo.com: wrote changelog]
Link: http://lkml.kernel.org/r/153063068338.1818.11496084754797453962.stgit@localhost.localdomain
Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
b0dedc49a2 mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()
Using the preparations made in previous patches, in case of memcg
shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
bitmap.  To do that, we separate iterations over memcg-aware and
!memcg-aware shrinkers, and memcg-aware shrinkers are chosen via
for_each_set_bit() from the bitmap.  In case of big nodes, having many
isolated environments, this gives significant performance growth.  See
next patches for the details.

Note that the patch does not respect to empty memcg shrinkers, since we
never clear the bitmap bits after we set it once.  Their shrinkers will
be called again, with no shrinked objects as result.  This functionality
is provided by next patches.

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112558507.4097.12713813335683345488.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063066653.1818.976035462801487910.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
fae91d6d8b mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance
Introduce set_shrinker_bit() function to set shrinker-related bit in
memcg shrinker bitmap, and set the bit after the first item is added and
in case of reparenting destroyed memcg's items.

This will allow next patch to make shrinkers be called only, in case of
they have charged objects at the moment, and to improve shrink_slab()
performance.

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112557572.4097.17315791419810749985.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063065671.1818.15914674956134687268.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
dfd2f10ccf mm/memcontrol.c: export mem_cgroup_is_root()
This will be used in next patch.

Link: http://lkml.kernel.org/r/153063064347.1818.1987011484100392706.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
3b82c4dcc2 mm/list_lru.c: pass lru argument to memcg_drain_list_lru_node()
This is just refactoring to allow next patches to have lru pointer in
memcg_drain_list_lru_node().

Link: http://lkml.kernel.org/r/153063063164.1818.55009531386089350.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
9bec5c35bf mm/list_lru: pass dst_memcg argument to memcg_drain_list_lru_node()
This is just refactoring to allow the next patches to have dst_memcg
pointer in memcg_drain_list_lru_node().

Link: http://lkml.kernel.org/r/153063062118.1818.2761273817739499749.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
44bd4a4759 mm/list_lru.c: add memcg argument to list_lru_from_kmem()
This is just refactoring to allow the next patches to have memcg pointer
in list_lru_from_kmem().

Link: http://lkml.kernel.org/r/153063060664.1818.9541345386733498582.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
c92e8e10ca fs: propagate shrinker::id to list_lru
Add list_lru::shrinker_id field and populate it by registered shrinker
id.

This will be used to set correct bit in memcg shrinkers map by lru code
in next patches, after there appeared the first related to memcg element
in list_lru.

Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
39887653aa mm/workingset.c: refactor workingset_init()
Use prealloc_shrinker()/register_shrinker_prepared() instead of
register_shrinker().  This will be used in next patch.

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112550112.4097.16606173020912323761.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063057666.1818.17625951186610808734.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Kirill Tkhai
0a4465d340 mm, memcg: assign memcg-aware shrinkers bitmap to memcg
Imagine a big node with many cpus, memory cgroups and containers.  Let
we have 200 containers, every container has 10 mounts, and 10 cgroups.
All container tasks don't touch foreign containers mounts.  If there is
intensive pages write, and global reclaim happens, a writing task has to
iterate over all memcgs to shrink slab, before it's able to go to
shrink_page_list().

Iteration over all the memcg slabs is very expensive: the task has to
visit 200 * 10 = 2000 shrinkers for every memcg, and since there are
2000 memcgs, the total calls are 2000 * 2000 = 4000000.

So, the shrinker makes 4 million do_shrink_slab() calls just to try to
isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via
shrink_page_list().  I've observed a node spending almost 100% in
kernel, making useless iteration over already shrinked slab.

This patch adds bitmap of memcg-aware shrinkers to memcg.  The size of
the bitmap depends on bitmap_nr_ids, and during memcg life it's
maintained to be enough to fit bitmap_nr_ids shrinkers.  Every bit in
the map is related to corresponding shrinker id.

Next patches will maintain set bit only for really charged memcg.  This
will allow shrink_slab() to increase its performance in significant way.
See the last patch for the numbers.

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain
[ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()]
  Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com
Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Kirill Tkhai
b05706f100 mm/memcontrol.c: move up for_each_mem_cgroup{, _tree} defines
Next patch requires these defines are above their current position, so
here they are moved to declarations.

Link: http://lkml.kernel.org/r/153063055665.1818.5200425793649695598.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Kirill Tkhai
b4c2b231c3 mm: assign id to every memcg-aware shrinker
Introduce shrinker::id number, which is used to enumerate memcg-aware
shrinkers.  The number start from 0, and the code tries to maintain it
as small as possible.

This will be used to represent a memcg-aware shrinkers in memcg
shrinkers map.

Since all memcg-aware shrinkers are based on list_lru, which is
per-memcg in case of !CONFIG_MEMCG_KMEM only, the new functionality will
be under this config option.

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112546435.4097.10607140323811756557.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063054586.1818.6041047871606697364.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Kirill Tkhai
84c07d11aa mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB
Introduce new config option, which is used to replace repeating
CONFIG_MEMCG && !CONFIG_SLOB pattern.  Next patches add a little more
memcg+kmem related code, so let's keep the defines more clearly.

Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Kirill Tkhai
e0295238e5 mm/list_lru.c: combine code under the same define
Patch series "Improve shrink_slab() scalability (old complexity was O(n^2), new is O(n))", v8.

This patcheset solves the problem with slow shrink_slab() occuring on
the machines having many shrinkers and memory cgroups (i.e., with many
containers).  The problem is complexity of shrink_slab() is O(n^2) and
it grows too fast with the growth of containers numbers.

Let us have 200 containers, and every container has 10 mounts and 10
cgroups.  All container tasks are isolated, and they don't touch foreign
containers mounts.

In case of global reclaim, a task has to iterate all over the memcgs and
to call all the memcg-aware shrinkers for all of them.  This means, the
task has to visit 200 * 10 = 2000 shrinkers for every memcg, and since
there are 2000 memcgs, the total calls of do_shrink_slab() are 2000 *
2000 = 4000000.

4 million calls are not a number operations, which can takes 1 cpu
cycle.  E.g., super_cache_count() accesses at least two lists, and makes
arifmetical calculations.  Even, if there are no charged objects, we do
these calculations, and replaces cpu caches by read memory.  I observed
nodes spending almost 100% time in kernel, in case of intensive writing
and global reclaim.  The writer consumes pages fast, but it's need to
shrink_slab() before the reclaimer reached shrink pages function (and
frees SWAP_CLUSTER_MAX pages).  Even if there is no writing, the
iterations just waste the time, and slows reclaim down.

Let's see the small test below:

  $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
  $mkdir /sys/fs/cgroup/memory/ct
  $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
  $for i in `seq 0 4000`;
          do mkdir /sys/fs/cgroup/memory/ct/$i;
          echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
          mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file;
  done

Then, let's see drop caches time (5 sequential calls):

  $time echo 3 > /proc/sys/vm/drop_caches

  0.00user 13.78system 0:13.78elapsed 99%CPU
  0.00user 5.59system 0:05.60elapsed 99%CPU
  0.00user 5.48system 0:05.48elapsed 99%CPU
  0.00user 8.35system 0:08.35elapsed 99%CPU
  0.00user 8.34system 0:08.35elapsed 99%CPU

The last four calls don't actually shrink anything.  So, the iterations
over slab shrinkers take 5.48 seconds.  Not so good for scalability.

The patchset solves the problem by making shrink_slab() of O(n)
complexity.  There are following functional actions:

1) Assign id to every registered memcg-aware shrinker.

2) Maintain per-memcgroup bitmap of memcg-aware shrinkers, and set a
   shrinker-related bit after the first element is added to lru list
   (also, when removed child memcg elements are reparanted).

3) Split memcg-aware shrinkers and !memcg-aware shrinkers, and call a
   shrinker if its bit is set in memcg's shrinker bitmap.  (Also, there is
   a functionality to clear the bit, after last element is shrinked).

This gives significant performance increase.  The result after patchset
is applied:

  $time echo 3 > /proc/sys/vm/drop_caches

  0.00user 1.10system 0:01.10elapsed 99%CPU
  0.00user 0.00system 0:00.01elapsed 64%CPU
  0.00user 0.01system 0:00.01elapsed 82%CPU
  0.00user 0.00system 0:00.01elapsed 64%CPU
  0.00user 0.01system 0:00.01elapsed 82%CPU

The results show the performance increases at least in 548 times.

So, the patchset makes shrink_slab() of less complexity and improves the
performance in such types of load I pointed.  This will give a profit in
case of !global reclaim case, since there also will be less
do_shrink_slab() calls.

This patch (of 17):

These two pairs of blocks of code are under the same #ifdef #else
#endif.

Link: http://lkml.kernel.org/r/153063052519.1818.9393587113056959488.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Roman Gushchin <guro@fb.com>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Waiman Long <longman@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Mike Rapoport
a36aab890c mm/memblock.c: replace u64 with phys_addr_t where appropriate
Most functions in memblock already use phys_addr_t to represent a
physical address with __memblock_free_late() being an exception.

This patch replaces u64 with phys_addr_t in __memblock_free_late() and
switches several format strings from %llx to %pa to avoid casting from
phys_addr_t to u64.

Link: http://lkml.kernel.org/r/1530637506-1256-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Oscar Salvador
4e40987f12 mm/sparse.c: make sparse_init_one_section void and remove check
sparse_init_one_section() is being called from two sites: sparse_init()
and sparse_add_one_section().  The former calls it from a
for_each_present_section_nr() loop, and the latter marks the section as
present before calling it.  This means that when
sparse_init_one_section() gets called, we already know that the section
is present.  So there is no point to double check that in the function.

This removes the check and makes the function void.

[ross.zwisler@linux.intel.com: fix error path in sparse_add_one_section]
  Link: http://lkml.kernel.org/r/20180706190658.6873-1-ross.zwisler@linux.intel.com
[ross.zwisler@linux.intel.com: simplification suggested by Oscar]
  Link: http://lkml.kernel.org/r/20180706223358.742-1-ross.zwisler@linux.intel.com
Link: http://lkml.kernel.org/r/20180702154325.12196-1-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Michal Hocko
29ef680ae7 memcg, oom: move out_of_memory back to the charge path
Commit 3812c8c8f3 ("mm: memcg: do not trap chargers with full
callstack on OOM") has changed the ENOMEM semantic of memcg charges.
Rather than invoking the oom killer from the charging context it delays
the oom killer to the page fault path (pagefault_out_of_memory).  This
in turn means that many users (e.g.  slab or g-u-p) will get ENOMEM when
the corresponding memcg hits the hard limit and the memcg is is OOM.
This is behavior is inconsistent with !memcg case where the oom killer
is invoked from the allocation context and the allocator keeps retrying
until it succeeds.

The difference in the behavior is user visible.  mmap(MAP_POPULATE)
might result in not fully populated ranges while the mmap return code
doesn't tell that to the userspace.  Random syscalls might fail with
ENOMEM etc.

The primary motivation of the different memcg oom semantic was the
deadlock avoidance.  Things have changed since then, though.  We have an
async oom teardown by the oom reaper now and so we do not have to rely
on the victim to tear down its memory anymore.  Therefore we can return
to the original semantic as long as the memcg oom killer is not handed
over to the users space.

There is still one thing to be careful about here though.  If the oom
killer is not able to make any forward progress - e.g.  because there is
no eligible task to kill - then we have to bail out of the charge path
to prevent from same class of deadlocks.  We have basically two options
here.  Either we fail the charge with ENOMEM or force the charge and
allow overcharge.  The first option has been considered more harmful
than useful because rare inconsistencies in the ENOMEM behavior is hard
to test for and error prone.  Basically the same reason why the page
allocator doesn't fail allocations under such conditions.  The later
might allow runaways but those should be really unlikely unless somebody
misconfigures the system.  E.g.  allowing to migrate tasks away from the
memcg to a different unlimited memcg with move_charge_at_immigrate
disabled.

Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Mike Rapoport
d39f8fb4b7 mm: make DEFERRED_STRUCT_PAGE_INIT explicitly depend on SPARSEMEM
The deferred memory initialization relies on section definitions, e.g
PAGES_PER_SECTION, that are only available when CONFIG_SPARSEMEM=y on
most architectures.

Initially DEFERRED_STRUCT_PAGE_INIT depended on explicit
ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT configuration option, but since
the commit 2e3ca40f03 ("mm: relax deferred struct page
requirements") this requirement was relaxed and now it is possible to
enable DEFERRED_STRUCT_PAGE_INIT on architectures that support
DISCONTINGMEM and NO_BOOTMEM which causes build failures.

For instance, setting SMP=y and DEFERRED_STRUCT_PAGE_INIT=y on arc
causes the following build failure:

    CC      mm/page_alloc.o
  mm/page_alloc.c: In function 'update_defer_init':
  mm/page_alloc.c:321:14: error: 'PAGES_PER_SECTION'
  undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
        (pfn & (PAGES_PER_SECTION - 1)) == 0) {
                ^~~~~~~~~~~~~~~~~
                USEC_PER_SEC
  mm/page_alloc.c:321:14: note: each undeclared identifier is reported only once for each function it appears in
  In file included from include/linux/cache.h:5:0,
                   from include/linux/printk.h:9,
                   from include/linux/kernel.h:14,
                   from include/asm-generic/bug.h:18,
                   from arch/arc/include/asm/bug.h:32,
                   from include/linux/bug.h:5,
                   from include/linux/mmdebug.h:5,
                   from include/linux/mm.h:9,
                   from mm/page_alloc.c:18:
  mm/page_alloc.c: In function 'deferred_grow_zone':
  mm/page_alloc.c:1624:52: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
    unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
                                                      ^
  include/uapi/linux/kernel.h:11:47: note: in definition of macro '__ALIGN_KERNEL_MASK'
   #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
                                                 ^~~~
  include/linux/kernel.h:58:22: note: in expansion of macro '__ALIGN_KERNEL'
   #define ALIGN(x, a)  __ALIGN_KERNEL((x), (a))
                        ^~~~~~~~~~~~~~
  mm/page_alloc.c:1624:34: note: in expansion of macro 'ALIGN'
    unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
                                    ^~~~~
  In file included from include/asm-generic/bug.h:18:0,
                   from arch/arc/include/asm/bug.h:32,
                   from include/linux/bug.h:5,
                   from include/linux/mmdebug.h:5,
                   from include/linux/mm.h:9,
                   from mm/page_alloc.c:18:
  mm/page_alloc.c: In function 'free_area_init_node':
  mm/page_alloc.c:6379:50: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
    pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                                                    ^
  include/linux/kernel.h:812:22: note: in definition of macro '__typecheck'
     (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                        ^
  include/linux/kernel.h:836:24: note: in expansion of macro '__safe_cmp'
    __builtin_choose_expr(__safe_cmp(x, y), \
                          ^~~~~~~~~~
  include/linux/kernel.h:904:27: note: in expansion of macro '__careful_cmp'
   #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
                             ^~~~~~~~~~~~~
  mm/page_alloc.c:6379:29: note: in expansion of macro 'min_t'
    pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                               ^~~~~
  include/linux/kernel.h:836:2: error: first argument to '__builtin_choose_expr' not a constant
    __builtin_choose_expr(__safe_cmp(x, y), \
    ^
  include/linux/kernel.h:904:27: note: in expansion of macro '__careful_cmp'
   #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
                             ^~~~~~~~~~~~~
  mm/page_alloc.c:6379:29: note: in expansion of macro 'min_t'
    pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                               ^~~~~
  scripts/Makefile.build:317: recipe for target 'mm/page_alloc.o' failed

Let's make the DEFERRED_STRUCT_PAGE_INIT explicitly depend on SPARSEMEM
as the systems that support DISCONTIGMEM do not seem to have that huge
amounts of memory that would make DEFERRED_STRUCT_PAGE_INIT relevant.

Link: http://lkml.kernel.org/r/1530279308-24988-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00