- Add support for region alignment configuration and enforcement to
fix compatibility across architectures and PowerPC page size
configurations.
- Introduce 'zero_page_range' as a dax operation. This facilitates
filesystem-dax operation without a block-device.
- Introduce phys_to_target_node() to facilitate drivers that want to
know resulting numa node if a given reserved address range was
onlined.
- Advertise a persistence-domain for of_pmem and papr_scm. The
persistence domain indicates where cpu-store cycles need to reach in
the platform-memory subsystem before the platform will consider them
power-fail protected.
- Promote numa_map_to_online_node() to a cross-kernel generic facility.
- Save x86 numa information to allow for node-id lookups for reserved
memory ranges, deploy that capability for the e820-pmem driver.
- Pick up some miscellaneous minor fixes, that missed v5.6-final,
including a some smatch reports in the ioctl path and some unit test
compilation fixups.
- Fixup some flexible-array declarations.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEf41QbsdZzFdA8EfZHtKRamZ9iAIFAl6LtIAACgkQHtKRamZ9
iAIwRA/8CLVVuQpgHQ1tqK4h8CZPrISFXh7wy7uhocEU2xrDh6iGVnLztmoLRr2k
5f8T9lRzreSAwIVL5DbGqP1pFncqIt9VMnKsFlaPMBGCBNR+hURY0iBCNjIT+jiq
BOzLd52MR2rqJxeXGTMUbWrBrbmuj4mZPdmGVuFFe7GFRpoaVpCgOo+296eWa/ot
gIOFUTonZY7STYjNvDok0TXCmiCFuJb+P+y5ldfCPShHvZhTiaF53jircja8vAjO
G5dt8ixBKUK0rXRc4SEQsQhAZNcAFHb6Gy5lg4C2QzhTF374xTc9usJZNWbIE9iM
5mipBYvjVuoY+XaCNZDkaRcJIy/jqB15O6l3QIWbZLGaK9m95YPp9LmkPFwd3JpO
e3rO24ML471DxqB9iWIiJCNcBBocLOlnd6qAQTpppWDpGNbudwXvfsmKHmKIScSE
x+IDCdscLmmm+WG2dLmLraWOVPu42xZFccoQCi4M3TTqfeB9pZ9XckFQ37zX62zG
5t+7Ek+t1W4QVt/JQYVKH03XT15sqUpVknvx0Hl4Y5TtbDOkFLkO8RN0/HyExDef
7iegS35kqTsM4EfZQ+9juKbI2JBAjHANcbj0V4dogqaRj6vr3akumBzUtuYqAofv
qU3s9skmLsEemOJC+ns2PT8vl5dyIoeDfH0r2XvGWxYqolMqJpA=
=sY4N
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm and dax updates from Dan Williams:
"There were multiple touches outside of drivers/nvdimm/ this round to
add cross arch compatibility to the devm_memremap_pages() interface,
enhance numa information for persistent memory ranges, and add a
zero_page_range() dax operation.
This cycle I switched from the patchwork api to Konstantin's b4 script
for collecting tags (from x86, PowerPC, filesystem, and device-mapper
folks), and everything looks to have gone ok there. This has all
appeared in -next with no reported issues.
Summary:
- Add support for region alignment configuration and enforcement to
fix compatibility across architectures and PowerPC page size
configurations.
- Introduce 'zero_page_range' as a dax operation. This facilitates
filesystem-dax operation without a block-device.
- Introduce phys_to_target_node() to facilitate drivers that want to
know resulting numa node if a given reserved address range was
onlined.
- Advertise a persistence-domain for of_pmem and papr_scm. The
persistence domain indicates where cpu-store cycles need to reach
in the platform-memory subsystem before the platform will consider
them power-fail protected.
- Promote numa_map_to_online_node() to a cross-kernel generic
facility.
- Save x86 numa information to allow for node-id lookups for reserved
memory ranges, deploy that capability for the e820-pmem driver.
- Pick up some miscellaneous minor fixes, that missed v5.6-final,
including a some smatch reports in the ioctl path and some unit
test compilation fixups.
- Fixup some flexible-array declarations"
* tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits)
dax: Move mandatory ->zero_page_range() check in alloc_dax()
dax,iomap: Add helper dax_iomap_zero() to zero a range
dax: Use new dax zero page method for zeroing a page
dm,dax: Add dax zero_page_range operation
s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
dax, pmem: Add a dax operation zero_page_range
pmem: Add functions for reading/writing page to/from pmem
libnvdimm: Update persistence domain value for of_pmem and papr_scm device
tools/test/nvdimm: Fix out of tree build
libnvdimm/region: Fix build error
libnvdimm/region: Replace zero-length array with flexible-array member
libnvdimm/label: Replace zero-length array with flexible-array member
ACPI: NFIT: Replace zero-length array with flexible-array member
libnvdimm/region: Introduce an 'align' attribute
libnvdimm/region: Introduce NDD_LABELING
libnvdimm/namespace: Enforce memremap_compat_align()
libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
libnvdimm: Out of bounds read in __nd_ioctl()
acpi/nfit: improve bounds checking for 'func'
mm/memremap_pages: Introduce memremap_compat_align()
...
__get_user_pages_locked() will return 0 instead of -EINTR after commit
4426e945df ("mm/gup: allow VM_FAULT_RETRY for multiple times") which
added extra code to allow gup detect fatal signal faster.
Restore the original -EINTR behavior.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: 4426e945df ("mm/gup: allow VM_FAULT_RETRY for multiple times")
Reported-by: syzbot+3be1a33f04dc782e9fd5@syzkaller.appspotmail.com
Signed-off-by: Hillf Danton <hdanton@sina.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's definitely incorrect to mark the lock as taken even if
down_read_killable() failed.
This wass overlooked when we switched from down_read() to
down_read_killable() because down_read() won't fail while
down_read_killable() could.
Fixes: 71335f37c5 ("mm/gup: allow to react to fatal signals")
Reported-by: syzbot+a8c70b7f3579fc0587dc@syzkaller.appspotmail.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lookup_node() uses gup to pin the page and get node information. It
checks against ret>=0 assuming the page will be filled in. However it's
also possible that gup will return zero, for example, when the thread is
quickly killed with a fatal signal. Teach lookup_node() to gracefully
return an error -EFAULT if it happens.
Meanwhile, initialize "page" to NULL to avoid potential risk of
exploiting the pointer.
Fixes: 4426e945df ("mm/gup: allow VM_FAULT_RETRY for multiple times")
Reported-by: syzbot+693dc11fcb53120b5559@syzkaller.appspotmail.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
filter_irq_stacks() can be used by other tools (e.g. KMSAN), so it needs
to be moved to a common location. lib/stackdepot.c seems a good place, as
filter_irq_stacks() is usually applied to the output of
stack_trace_save().
This patch has been previously mailed as part of KMSAN RFC patch series.
[glider@google.co: nds32: linker script: add SOFTIRQENTRY_TEXT\
Link: http://lkml.kernel.org/r/20200311121002.241430-1-glider@google.com
[glider@google.com: add IRQENTRY_TEXT and SOFTIRQENTRY_TEXT to linker script]
Link: http://lkml.kernel.org/r/20200311121124.243352-1-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Link: http://lkml.kernel.org/r/20200220141916.55455-3-glider@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that "struct proc_ops" exist we can start putting there stuff which
could not fly with VFS "struct file_operations"...
Most of fs/proc/inode.c file is dedicated to make open/read/.../close
reliable in the event of disappearing /proc entries which usually happens
if module is getting removed. Files like /proc/cpuinfo which never
disappear simply do not need such protection.
Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
"permanent" files.
Enable "permanent" flag for
/proc/cpuinfo
/proc/kmsg
/proc/modules
/proc/slabinfo
/proc/stat
/proc/sysvipc/*
/proc/swaps
More will come once I figure out foolproof way to prevent out module
authors from marking their stuff "permanent" for performance reasons
when it is not.
This should help with scalability: benchmark is "read /proc/cpuinfo R times
by N threads scattered over the system".
N R t, s (before) t, s (after)
-----------------------------------------------------
64 4096 1.582458 1.530502 -3.2%
256 4096 6.371926 6.125168 -3.9%
1024 4096 25.64888 24.47528 -4.6%
Benchmark source:
#include <chrono>
#include <iostream>
#include <thread>
#include <vector>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
int N;
const char *filename;
int R;
int xxx = 0;
int glue(int n)
{
cpu_set_t m;
CPU_ZERO(&m);
CPU_SET(n, &m);
return sched_setaffinity(0, sizeof(cpu_set_t), &m);
}
void f(int n)
{
glue(n % NR_CPUS);
while (*(volatile int *)&xxx == 0) {
}
for (int i = 0; i < R; i++) {
int fd = open(filename, O_RDONLY);
char buf[4096];
ssize_t rv = read(fd, buf, sizeof(buf));
asm volatile ("" :: "g" (rv));
close(fd);
}
}
int main(int argc, char *argv[])
{
if (argc < 4) {
std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
";
return 1;
}
N = atoi(argv[1]);
filename = argv[2];
R = atoi(argv[3]);
for (int i = 0; i < NR_CPUS; i++) {
if (glue(i) == 0)
break;
}
std::vector<std::thread> T;
T.reserve(N);
for (int i = 0; i < N; i++) {
T.emplace_back(f, i);
}
auto t0 = std::chrono::system_clock::now();
{
*(volatile int *)&xxx = 1;
for (auto& t: T) {
t.join();
}
}
auto t1 = std::chrono::system_clock::now();
std::chrono::duration<double> dt = t1 - t0;
std::cout << dt.count() << '
';
return 0;
}
P.S.:
Explicit randomization marker is added because adding non-function pointer
will silently disable structure layout randomization.
[akpm@linux-foundation.org: coding style fixes]
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously there was a check if 'size' is aligned to 'align' and if not
then it was aligned. This check was expensive as both branch and division
are expensive instructions in most architectures. 'ALIGN' function on
already aligned value will not change it, and as it is cheaper than branch
+ division it can be executed all the time and branch can be removed.
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200320173317.26408-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MAX_ZONELISTS is a compile time constant, so it should be compared using
BUILD_BUG_ON not BUG_ON.
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Link: http://lkml.kernel.org/r/20200228224617.11343-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The parameter of remap_pfn_range() @pfn passed from the caller is actually
a page-frame number converted by corresponding physical address of kernel
memory, the original comment is ambiguous that may mislead the users.
Meanwhile, there is an ambiguous typo "VMM" in the comment of
vm_area_struct. So fixing them will make the code more readable.
Signed-off-by: chenqiwu <chenqiwu@xiaomi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1583026921-15279-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sparse reports a warning at unpin_tag()()
warning: context imbalance in unpin_tag() - unexpected unlock
The root cause is the missing annotation at unpin_tag()
Add the missing __releases(bitlock) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/20200214204741.94112-14-jbi.octave@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sparse reports a warning at pin_tag()()
warning: context imbalance in pin_tag() - wrong count at exit
The root cause is the missing annotation at pin_tag()
Add the missing __acquires(bitlock) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/20200214204741.94112-13-jbi.octave@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sparse reports a warning at migrate_read_unlock()()
warning: context imbalance in migrate_read_unlock() - unexpected unlock
The root cause is the missing annotation at migrate_read_unlock()
Add the missing __releases(&zspage->lock) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/20200214204741.94112-12-jbi.octave@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sparse reports a warning at migrate_read_lock()()
warning: context imbalance in migrate_read_lock() - wrong count at exit
The root cause is the missing annotation at migrate_read_lock()
Add the missing __acquires(&zspage->lock) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/20200214204741.94112-11-jbi.octave@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sparse reports a warning at put_map()()
warning: context imbalance in put_map() - unexpected unlock
The root cause is the missing annotation at put_map()
Add the missing __releases(&object_map_lock) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200214204741.94112-10-jbi.octave@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sparse reports a warning at get_map()()
warning: context imbalance in get_map() - wrong count at exit
The root cause is the missing annotation at get_map()
Add the missing __acquires(&object_map_lock) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200214204741.94112-9-jbi.octave@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sparse reports a warning at queue_pages_pmd()
context imbalance in queue_pages_pmd() - unexpected unlock
The root cause is the missing annotation at queue_pages_pmd()
Add the missing __releases(ptl)
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200214204741.94112-8-jbi.octave@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sparse reports a warning at gather_surplus_pages()
warning: context imbalance in hugetlb_cow() - unexpected unlock
The root cause is the missing annotation at gather_surplus_pages()
Add the missing __must_hold(&hugetlb_lock)
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Link: http://lkml.kernel.org/r/20200214204741.94112-7-jbi.octave@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sparse reports a warning at compact_lock_irqsave()
warning: context imbalance in compact_lock_irqsave() - wrong count at exit
The root cause is the missing annotation at compact_lock_irqsave()
Add the missing __acquires(lock) annotation.
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200214204741.94112-6-jbi.octave@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The compressed cache for swap pages (zswap) currently needs from 1 to 3
extra kernel command line parameters in order to make it work: it has to
be enabled by adding a "zswap.enabled=1" command line parameter and if one
wants a different compressor or pool allocator than the default lzo / zbud
combination then these choices also need to be specified on the kernel
command line in additional parameters.
Using a different compressor and allocator for zswap is actually pretty
common as guides often recommend using the lz4 / z3fold pair instead of
the default one. In such case it is also necessary to remember to enable
the appropriate compression algorithm and pool allocator in the kernel
config manually.
Let's avoid the need for adding these kernel command line parameters and
automatically pull in the dependencies for the selected compressor
algorithm and pool allocator by adding an appropriate default switches to
Kconfig.
The default values for these options match what the code was using
previously as its defaults.
Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Link: http://lkml.kernel.org/r/20200202000112.456103-1-mail@maciej.szmigiero.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I recently build the RISC-V port with LLVM trunk, which has introduced a
new warning when casting from a pointer to an enum of a smaller size.
This patch simply casts to a long in the middle to stop the warning. I'd
be surprised this is the only one in the kernel, but it's the only one I
saw.
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200227211741.83165-1-palmer@dabbelt.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi writes:
Currently, when truncating a shmem file, if the range is partly in a THP
(start or end is in the middle of THP), the pages actually will just get
cleared rather than being freed, unless the range covers the whole THP.
Even though all the subpages are truncated (randomly or sequentially), the
THP may still be kept in page cache.
This might be fine for some usecases which prefer preserving THP, but
balloon inflation is handled in base page size. So when using shmem THP
as memory backend, QEMU inflation actually doesn't work as expected since
it doesn't free memory. But the inflation usecase really needs to get the
memory freed. (Anonymous THP will also not get freed right away, but will
be freed eventually when all subpages are unmapped: whereas shmem THP
still stays in page cache.)
Split THP right away when doing partial hole punch, and if split fails
just clear the page so that read of the punched area will return zeroes.
Hugh Dickins adds:
Our earlier "team of pages" huge tmpfs implementation worked in the way
that Yang Shi proposes; and we have been using this patch to continue to
split the huge page when hole-punched or truncated, since converting over
to the compound page implementation. Although huge tmpfs gives out huge
pages when available, if the user specifically asks to truncate or punch a
hole (perhaps to free memory, perhaps to reduce the memcg charge), then
the filesystem should do so as best it can, splitting the huge page.
That is not always possible: any additional reference to the huge page
prevents split_huge_page() from succeeding, so the result can be flaky.
But in practice it works successfully enough that we've not seen any
problem from that.
Add shmem_punch_compound() to encapsulate the decision of when a split is
needed, and doing the split if so. Using this simplifies the flow in
shmem_undo_range(); and the first (trylock) pass does not need to do any
page clearing on failure, because the second pass will either succeed or
do that clearing. Following the example of zero_user_segment() when
clearing a partial page, add flush_dcache_page() and set_page_dirty() when
clearing a hole - though I'm not certain that either is needed.
But: split_huge_page() would be sure to fail if shmem_undo_range()'s
pagevec holds further references to the huge page. The easiest way to fix
that is for find_get_entries() to return early, as soon as it has put one
compound head or tail into the pagevec. At first this felt like a hack;
but on examination, this convention better suits all its callers - or will
do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
speedup by checking for compound pages there.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously 0 was assigned to variable 'error' but the variable was never
read before reassignemnt later. So the assignment can be removed.
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200301152832.24595-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Variables declared in a switch statement before any case statements cannot
be automatically initialized with compiler instrumentation (as they are
not part of any execution flow). With GCC's proposed automatic stack
variable initialization feature, this triggers a warning (and they don't
get initialized). Clang's automatic stack variable initialization (via
CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also doesn't
initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase, so
even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.
To avoid these problems, move such variables into the "case" where they're
used or lift them up into the main function body.
mm/shmem.c: In function `shmem_getpage_gfp':
mm/shmem.c:1816:10: warning: statement will never be executed [-Wswitch-unreachable]
1816 | loff_t i_size;
| ^~~~~~
[1] https://bugs.llvm.org/show_bug.cgi?id=44916
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alexander Potapenko <glider@google.com>
Link: http://lkml.kernel.org/r/20200220062312.69165-1-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use __pfn_to_section() API instead of open-coding for better code
readability.
Signed-off-by: chenqiwu <chenqiwu@xiaomi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/1584345134-16671-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For now, distributions implement advanced udev rules to essentially
- Don't online any hotplugged memory (s390x)
- Online all memory to ZONE_NORMAL (e.g., most virt environments like
hyperv)
- Online all memory to ZONE_MOVABLE in case the zone imbalance is taken
care of (e.g., bare metal, special virt environments)
In summary: All memory is usually onlined the same way, however, the
kernel always has to ask user space to come up with the same answer.
E.g., Hyper-V always waits for a memory block to get onlined before
continuing, otherwise it might end up adding memory faster than
onlining it, which can result in strange OOM situations. This waiting
slows down adding of a bigger amount of memory.
Let's allow to specify a default online_type, not just "online" and
"offline". This allows distributions to configure the default online_type
when booting up and be done with it.
We can now specify "offline", "online", "online_movable" and
"online_kernel" via
- "memhp_default_state=" on the kernel cmdline
- /sys/devices/system/memory/auto_online_blocks
just like we are able to specify for a single memory block via
/sys/devices/system/memory/memoryX/state
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Yumei Huang <yuhuang@redhat.com>
Link: http://lkml.kernel.org/r/20200317104942.11178-9-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... and rename it to memhp_default_online_type. This is a preparation
for more detailed default online behavior.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Yumei Huang <yuhuang@redhat.com>
Link: http://lkml.kernel.org/r/20200317104942.11178-8-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All in-tree users except the mm-core are gone. Let's drop the export.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Yumei Huang <yuhuang@redhat.com>
Link: http://lkml.kernel.org/r/20200317104942.11178-7-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No functional change.
[bhe@redhat.com: move functions into CONFIG_MEMORY_HOTPLUG ifdeffery scope]
Link: http://lkml.kernel.org/r/20200316045804.GC3486@MiWiFi-R3L-srv
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200312124414.439-6-bhe@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
And tell check_pfn_span() gating the porper alignment and size of hot
added memory region.
And also move the code comments from inside section_deactivate() to being
above it. The code comments are reasonable for the whole function, and
the moving makes code cleaner.
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Link: http://lkml.kernel.org/r/20200312124414.439-5-bhe@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, to support subsection aligned memory region adding for pmem,
subsection map is added to track which subsection is present.
However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means
subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the
classic sparse, it's meaningless. Even worse, it may confuse people when
checking code related to the classic sparse.
About the classic sparse which doesn't support subsection hotplug, Dan
said it's more because the effort and maintenance burden outweighs the
benefit. Besides, the current 64 bit ARCHes all enable
SPARSEMEM_VMEMMAP_ENABLE by default.
Combining the above reasons, no need to provide subsection map and the
relevant handling for the classic sparse. Let's remove them.
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Link: http://lkml.kernel.org/r/20200312124414.439-4-bhe@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Factor out the code which clear subsection map of one memory region from
section_deactivate() into clear_subsection_map().
And also add helper function is_subsection_map_empty() to check if the
current subsection map is empty or not.
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Link: http://lkml.kernel.org/r/20200312124414.439-3-bhe@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/hotplug: Only use subsection map for VMEMMAP", v4.
Memory sub-section hotplug was added to fix the issue that nvdimm could be
mapped at non-section aligned starting address. A subsection map is added
into struct mem_section_usage to implement it.
However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means
subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the
classic sparse, subsection map is meaningless and confusing.
About the classic sparse which doesn't support subsection hotplug, Dan
said it's more because the effort and maintenance burden outweighs the
benefit. Besides, the current 64 bit ARCHes all enable
SPARSEMEM_VMEMMAP_ENABLE by default.
This patch (of 5):
Factor out the code that fills the subsection map from section_activate()
into fill_subsection_map(), this makes section_activate() cleaner and
easier to follow.
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200312124414.439-2-bhe@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's drop the basically unused section stuff and simplify. The logic now
matches the logic in __remove_pages().
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Segher Boessenkool <segher@kernel.crashing.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Link: http://lkml.kernel.org/r/20200228095819.10750-3-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 52fb87c81f ("mm/memory_hotplug: cleanup __remove_pages()"), we
cleaned up __remove_pages(), and introduced a shorter variant to calculate
the number of pages to the next section boundary.
Turns out we can make this calculation easier to read. We always want to
have the number of pages (> 0) to the next section boundary, starting from
the current pfn.
We'll clean up __remove_pages() in a follow-up patch and directly make use
of this computation.
Suggested-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Link: http://lkml.kernel.org/r/20200228095819.10750-2-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 357b4da50a ("x86: respect memory size limiting via mem=
parameter") a global varialbe max_mem_size is added to store the value
parsed from 'mem= ', then checked when memory region is added. This truly
stops those DIMMs from being added into system memory during boot-time.
However, it also limits the later memory hotplug functionality. Any DIMM
can't be hotplugged any more if its region is beyond the max_mem_size. We
will get errors like:
[ 216.387164] acpi PNP0C80:02: add_memory failed
[ 216.389301] acpi PNP0C80:02: acpi_memory_enable_device() error
[ 216.392187] acpi PNP0C80:02: Enumeration failure
This will cause issue in a known use case where 'mem=' is added to the
hypervisor. The memory that lies after 'mem=' boundary will be assigned
to KVM guests. After commit 357b4da50a merged, memory can't be extended
dynamically if system memory on hypervisor is not sufficient.
So fix it by also checking if it's during boot-time restricting to add
memory. Otherwise, skip the restriction.
And also add this use case to document of 'mem=' kernel parameter.
Fixes: 357b4da50a ("x86: respect memory size limiting via mem= parameter")
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200204050643.20925-1-bhe@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit c5e79ef561 ("mm/memory_hotplug.c: don't allow to
online/offline memory blocks with holes") we disallow to offline any
memory with holes. As all boot memory is online and hotplugged memory
cannot contain holes, we never online memory with holes.
This present check can be dropped.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Link: http://lkml.kernel.org/r/20200127110424.5757-4-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add API to enable/disable writeprotect a vma range. Unlike mprotect, this
doesn't split/merge vmas.
[peterx@redhat.com:
- use the helper to find VMA;
- return -ENOENT if not found to match mcopy case;
- use the new MM_CP_UFFD_WP* flags for change_protection
- check against mmap_changing for failures
- replace find_dst_vma with vma_find_uffd]
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220163112.11409-13-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't collapse the huge PMD if there is any userfault write protected
small PTEs. The problem is that the write protection is in small page
granularity and there's no way to keep all these write protection
information if the small pages are going to be merged into a huge PMD.
The same thing needs to be considered for swap entries and migration
entries. So do the check as well disregarding khugepaged_max_ptes_swap.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-12-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For either swap and page migration, we all use the bit 2 of the entry to
identify whether this entry is uffd write-protected. It plays a similar
role as the existing soft dirty bit in swap entries but only for keeping
the uffd-wp tracking for a specific PTE/PMD.
Something special here is that when we want to recover the uffd-wp bit
from a swap/migration entry to the PTE bit we'll also need to take care of
the _PAGE_RW bit and make sure it's cleared, otherwise even with the
_PAGE_UFFD_WP bit we can't trap it at all.
In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
That can lead to data mismatch if the page that we are going to write
protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch
also applies/removes the uffd-wp bit even for the swap entries.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
UFFD_EVENT_FORK support for uffd-wp should be already there, except that
we should clean the uffd-wp bit if uffd fork event is not enabled. Detect
that to avoid _PAGE_UFFD_WP being set even if the VMA is not being tracked
by VM_UFFD_WP. Do this for both small PTEs and huge PMDs.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-9-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
change_protection() when used with uffd-wp and make sure the two new flags
are exclusively used. Then,
- For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
when a range of memory is write protected by uffd
- For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
_PAGE_RW when write protection is resolved from userspace
And use this new interface in mwriteprotect_range() to replace the old
MM_CP_DIRTY_ACCT.
Do this change for both PTEs and huge PMDs. Then we can start to identify
which PTE/PMD is write protected by general (e.g., COW or soft dirty
tracking), and which is for userfaultfd-wp.
Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
into _PAGE_CHG_MASK as well. Meanwhile, since we have this new bit, we
can be even more strict when detecting uffd-wp page faults in either
do_wp_page() or wp_huge_pmd().
After we're with _PAGE_UFFD_WP, a special case is when a page is both
protected by the general COW logic and also userfault-wp. Here the
userfault-wp will have higher priority and will be handled first. Only
after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
the general COW. These are the steps on what will happen with such a
page:
1. CPU accesses write protected shared page (so both protected by
general COW and uffd-wp), blocked by uffd-wp first because in
do_wp_page we'll handle uffd-wp first, so it has higher priority
than general COW.
2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
to remove the uffd-wp bit upon the PTE/PMD. However here we
still keep the write bit cleared. Notify the blocked CPU.
3. The blocked CPU resumes the page fault process with a fault
retry, during retry it'll notice it was not with the uffd-wp bit
this time but it is still write protected by general COW, then
it'll go though the COW path in the fault handler, copy the page,
apply write bit where necessary, and retry again.
4. The CPU will be able to access this page with write bit set.
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
change_protection() was used by either the NUMA or mprotect() code,
there's one parameter for each of the callers (dirty_accountable and
prot_numa). Further, these parameters are passed along the calls:
- change_protection_range()
- change_p4d_range()
- change_pud_range()
- change_pmd_range()
- ...
Now we introduce a flag for change_protect() and all these helpers to
replace these parameters. Then we can avoid passing multiple parameters
multiple times along the way.
More importantly, it'll greatly simplify the work if we want to introduce
any new parameters to change_protection(). In the follow up patches, a
new parameter for userfaultfd write protection will be introduced.
No functional change at all.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are several cases write protection fault happens. It could be a
write to zero page, swaped page or userfault write protected page. When
the fault happens, there is no way to know if userfault write protect the
page before. Here we just blindly issue a userfault notification for vma
with VM_UFFD_WP regardless if app write protects it yet. Application
should be ready to handle such wp fault.
In the swapin case, always swapin as readonly. This will cause false
positive userfaults. We need to decide later if to eliminate them with a
flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY).
hugetlbfs wouldn't need to worry about swapouts but and tmpfs would be
handled by a swap entry bit like anonymous memory.
The main problem with no easy solution to eliminate the false positives,
will be if/when userfaultfd is extended to real filesystem pagecache.
When the pagecache is freed by reclaim we can't leave the radix tree
pinned if the inode and in turn the radix tree is reclaimed as well.
The estimation is that full accuracy and lack of false positives could be
easily provided only to anonymous memory (as long as there's no fork or as
long as MADV_DONTFORK is used on the userfaultfd anonymous range) tmpfs
and hugetlbfs, it's most certainly worth to achieve it but in a later
incremental patch.
[peterx@redhat.com: don't conditionally drop FAULT_FLAG_WRITE in do_swap_page]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-3-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to keep ourselves from reporting pages that are just going to be
reused again in the case of heavy churn we can put a limit on how many
total pages we will process per pass. Doing this will allow the worker
thread to go into idle much more quickly so that we avoid competing with
other threads that might be allocating or freeing pages.
The logic added here will limit the worker thread to no more than one
sixteenth of the total free pages in a given area per list. Once that
limit is reached it will update the state so that at the end of the pass
we will reschedule the worker to try again in 2 seconds when the memory
churn has hopefully settled down.
Again this optimization doesn't show much of a benefit in the standard
case as the memory churn is minmal. However with page allocator shuffling
enabled the gain is quite noticeable. Below are the results with a THP
enabled version of the will-it-scale page_fault1 test showing the
improvement in iterations for 16 processes or threads.
Without:
tasks processes processes_idle threads threads_idle
16 8283274.75 0.17 5594261.00 38.15
With:
tasks processes processes_idle threads threads_idle
16 8767010.50 0.21 5791312.75 36.98
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: wei qi <weiqi4@huawei.com>
Link: http://lkml.kernel.org/r/20200211224719.29318.72113.stgit@localhost.localdomain
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rather than walking over the same pages again and again to get to the
pages that have yet to be reported we can save ourselves a significant
amount of time by simply rotating the list so that when we have a full
list of reported pages the head of the list is pointing to the next
non-reported page. Doing this should save us some significant time when
processing each free list.
This doesn't gain us much in the standard case as all of the non-reported
pages should be near the top of the list already. However in the case of
page shuffling this results in a noticeable improvement. Below are the
will-it-scale page_fault1 w/ THP numbers for 16 tasks with and without
this patch.
Without:
tasks processes processes_idle threads threads_idle
16 8093776.25 0.17 5393242.00 38.20
With:
tasks processes processes_idle threads threads_idle
16 8283274.75 0.17 5594261.00 38.15
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: wei qi <weiqi4@huawei.com>
Link: http://lkml.kernel.org/r/20200211224708.29318.16862.stgit@localhost.localdomain
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to pave the way for free page reporting in virtualized
environments we will need a way to get pages out of the free lists and
identify those pages after they have been returned. To accomplish this,
this patch adds the concept of a Reported Buddy, which is essentially
meant to just be the Uptodate flag used in conjunction with the Buddy page
type.
To prevent the reported pages from leaking outside of the buddy lists I
added a check to clear the PageReported bit in the del_page_from_free_list
function. As a result any reported page that is split, merged, or
allocated will have the flag cleared prior to the PageBuddy value being
cleared.
The process for reporting pages is fairly simple. Once we free a page
that meets the minimum order for page reporting we will schedule a worker
thread to start 2s or more in the future. That worker thread will begin
working from the lowest supported page reporting order up to MAX_ORDER - 1
pulling unreported pages from the free list and storing them in the
scatterlist.
When processing each individual free list it is necessary for the worker
thread to release the zone lock when it needs to stop and report the full
scatterlist of pages. To reduce the work of the next iteration the worker
thread will rotate the free list so that the first unreported page in the
free list becomes the first entry in the list.
It will then call a reporting function providing information on how many
entries are in the scatterlist. Once the function completes it will
return the pages to the free area from which they were allocated and start
over pulling more pages from the free areas until there are no longer
enough pages to report on to keep the worker busy, or we have processed as
many pages as were contained in the free area when we started processing
the list.
The worker thread will work in a round-robin fashion making its way though
each zone requesting reporting, and through each reportable free list
within that zone. Once all free areas within the zone have been processed
it will check to see if there have been any requests for reporting while
it was processing. If so it will reschedule the worker thread to start up
again in roughly 2s and exit.
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: wei qi <weiqi4@huawei.com>
Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomain
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are cases where we would benefit from avoiding having to go through
the allocation and free cycle to return an isolated page.
Examples for this might include page poisoning in which we isolate a page
and then put it back in the free list without ever having actually
allocated it.
This will enable us to also avoid notifiers for the future free page
reporting which will need to avoid retriggering page reporting when
returning pages that have been reported on.
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: wei qi <weiqi4@huawei.com>
Link: http://lkml.kernel.org/r/20200211224624.29318.89287.stgit@localhost.localdomain
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to enable the use of the zone from the list manipulator functions
I will need access to the zone pointer. As it turns out most of the
accessors were always just being directly passed &zone->free_area[order]
anyway so it would make sense to just fold that into the function itself
and pass the zone and order as arguments instead of the free area.
In order to be able to reference the zone we need to move the declaration
of the functions down so that we have the zone defined before we define
the list manipulation functions. Since the functions are only used in the
file mm/page_alloc.c we can just move them there to reduce noise in the
header.
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: wei qi <weiqi4@huawei.com>
Link: http://lkml.kernel.org/r/20200211224613.29318.43080.stgit@localhost.localdomain
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm / virtio: Provide support for free page reporting", v17.
This series provides an asynchronous means of reporting free guest pages
to a hypervisor so that the memory associated with those pages can be
dropped and reused by other processes and/or guests on the host. Using
this it is possible to avoid unnecessary I/O to disk and greatly improve
performance in the case of memory overcommit on the host.
When enabled we will be performing a scan of free memory every 2 seconds
while pages of sufficiently high order are being freed. In each pass at
least one sixteenth of each free list will be reported. By doing this we
avoid racing against other threads that may be causing a high amount of
memory churn.
The lowest page order currently scanned when reporting pages is
pageblock_order so that this feature will not interfere with the use of
Transparent Huge Pages in the case of virtualization.
Currently this is only in use by virtio-balloon however there is the hope
that at some point in the future other hypervisors might be able to make
use of it. In the virtio-balloon/QEMU implementation the hypervisor is
currently using MADV_DONTNEED to indicate to the host kernel that the page
is currently free. It will be zeroed and faulted back into the guest the
next time the page is accessed.
To track if a page is reported or not the Uptodate flag was repurposed and
used as a Reported flag for Buddy pages. We walk though the free list
isolating pages and adding them to the scatterlist until we either
encounter the end of the list or have processed at least one sixteenth of
the pages that were listed in nr_free prior to us starting. If we fill
the scatterlist before we reach the end of the list we rotate the list so
that the first unreported page we encounter is moved to the head of the
list as that is where we will resume after we have freed the reported
pages back into the tail of the list.
Below are the results from various benchmarks. I primarily focused on two
tests. The first is the will-it-scale/page_fault2 test, and the other is
a modified version of will-it-scale/page_fault1 that was enabled to use
THP. I did this as it allows for better visibility into different parts
of the memory subsystem. The guest is running with 32G for RAM on one
node of a E5-2630 v3. The host has had some features such as CPU turbo
disabled in the BIOS.
Test page_fault1 (THP) page_fault2
Name tasks Process Iter STDEV Process Iter STDEV
Baseline 1 1012402.50 0.14% 361855.25 0.81%
16 8827457.25 0.09% 3282347.00 0.34%
Patches Applied 1 1007897.00 0.23% 361887.00 0.26%
16 8784741.75 0.39% 3240669.25 0.48%
Patches Enabled 1 1010227.50 0.39% 359749.25 0.56%
16 8756219.00 0.24% 3226608.75 0.97%
Patches Enabled 1 1050982.00 4.26% 357966.25 0.14%
page shuffle 16 8672601.25 0.49% 3223177.75 0.40%
Patches enabled 1 1003238.00 0.22% 360211.00 0.22%
shuffle w/ RFC 16 8767010.50 0.32% 3199874.00 0.71%
The results above are for a baseline with a linux-next-20191219 kernel,
that kernel with this patch set applied but page reporting disabled in
virtio-balloon, the patches applied and page reporting fully enabled, the
patches enabled with page shuffling enabled, and the patches applied with
page shuffling enabled and an RFC patch that makes used of MADV_FREE in
QEMU. These results include the deviation seen between the average value
reported here versus the high and/or low value. I observed that during
the test memory usage for the first three tests never dropped whereas with
the patches fully enabled the VM would drop to using only a few GB of the
host's memory when switching from memhog to page fault tests.
Any of the overhead visible with this patch set enabled seems due to page
faults caused by accessing the reported pages and the host zeroing the
page before giving it back to the guest. This overhead is much more
visible when using THP than with standard 4K pages. In addition page
shuffling seemed to increase the amount of faults generated due to an
increase in memory churn. The overehad is reduced when using MADV_FREE as
we can avoid the extra zeroing of the pages when they are reintroduced to
the host, as can be seen when the RFC is applied with shuffling enabled.
The overall guest size is kept fairly small to only a few GB while the
test is running. If the host memory were oversubscribed this patch set
should result in a performance improvement as swapping memory in the host
can be avoided.
A brief history on the background of free page reporting can be found at:
https://lore.kernel.org/lkml/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com/
This patch (of 9):
Move the head/tail adding logic out of the shuffle code and into the
__free_one_page function since ultimately that is where it is really
needed anyway. By doing this we should be able to reduce the overhead and
can consolidate all of the list addition bits in one spot.
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: wei qi <weiqi4@huawei.com>
Link: http://lkml.kernel.org/r/20200211224602.29318.84523.stgit@localhost.localdomain
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some comments for MADV_FREE is revised and added to help people understand
the MADV_FREE code, especially the page flag, PG_swapbacked. This makes
page_is_file_cache() isn't consistent with its comments. So the function
is renamed to page_is_file_lru() to make them consistent again. All these
are put in one patch as one logical change.
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This updates get_user_pages()'s argument in ksm_test_exit()'s comment
Signed-off-by: Li Chen <chenli@uniontech.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/30ac2417-f1c7-f337-0beb-df561295298c@uniontech.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e496cf3d78 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
notes that it should be reverted when the PowerPC problem was fixed. The
commit fixing the PowerPC problem (953c66c2b2) did not revert the
commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
CONFIG_TRANSPARENT_HUGEPAGE. Checking with Kirill and Aneesh, this was an
oversight, so remove the Kconfig symbol and undo the work of commit
e496cf3d78.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The thp_fault_fallback and thp_file_fallback vmstats are incremented if
either the hugepage allocation fails through the page allocator or the
hugepage charge fails through mem cgroup.
This patch leaves this field untouched but adds two new fields,
thp_{fault,file}_fallback_charge, which is incremented only when the mem
cgroup charge fails.
This distinguishes between attempted hugepage allocations that fail due to
fragmentation (or low memory conditions) and those that fail due to mem
cgroup limits. That can be used to determine the impact of fragmentation
on the system by excluding faults that failed due to memcg usage.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jeremy Cline <jcline@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2003061422070.7412@chino.kir.corp.google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The existing thp_fault_fallback indicates when thp attempts to allocate a
hugepage but fails, or if the hugepage cannot be charged to the mem cgroup
hierarchy.
Extend this to shmem as well. Adds a new thp_file_fallback to complement
thp_file_alloc that gets incremented when a hugepage is attempted to be
allocated but fails, or if it cannot be charged to the mem cgroup
hierarchy.
Additionally, remove the check for CONFIG_TRANSPARENT_HUGE_PAGECACHE from
shmem_alloc_hugepage() since it is only called with this configuration
option.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jeremy Cline <jcline@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2003061421240.7412@chino.kir.corp.google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the migration code doesn't migrate PG_readahead flag.
Theoretically this would incur slight performance loss as the application
might have to ramp its readahead back up again. Even though such problem
happens, it might be hidden by something else since migration is typically
triggered by compaction and NUMA balancing, any of which should be more
noticeable.
Migrate the flag after end_page_writeback() since it may clear PG_reclaim
flag, which is the same bit as PG_readahead, for the new page.
[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Link: http://lkml.kernel.org/r/1581640185-95731-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It can currently happen that we store the status of a page twice:
* Once we detect that it is already on the target node
* Once we moved a bunch of pages, and a page that's already on the
target node is contained in the current interval.
Let's simplify the code and always call do_move_pages_to_node() in case we
did not queue a page for migration. Note that pages that are already on
the target node are not added to the pagelist and are, therefore, ignored
by do_move_pages_to_node() - there is no functional change.
The status of such a page is now only stored once.
[david@redhat.com rephrase changelog]
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200214003017.25558-5-richardw.yang@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When pagelist is empty, it is not necessary to do the move and store.
Also it consolidate the empty list check in one place.
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200214003017.25558-4-richardw.yang@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Usually, do_move_pages_to_node() and store_status() are used in
combination. We have three similar call sites.
Let's provide a wrapper for both function calls -
move_pages_and_store_status - to make the calling code easier to maintain
and fix (as noted by Yang Shi, the return value handling of
do_move_pages_to_node() has a flaw).
[david@redhat.com rephrase changelog]
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200214003017.25558-3-richardw.yang@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "cleanup on do_pages_move()", v5.
The logic in do_pages_move() is a little mess for audience to read and has
some potential error on handling the return value. Especially there are
three calls on do_move_pages_to_node() and store_status() with almost the
same form.
This patch set tries to make the code a little friendly for audience by
consolidate the calls.
This patch (of 4):
At this point, we always have i >= start. If i == start, store_status()
will return 0. So we can drop the check for i > start.
[david@redhat.com rephrase changelog]
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200214003017.25558-2-richardw.yang@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a typo in comment, fix it.
"exeeds" -> "exceeds"
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200404060136.10838-1-hqjagain@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 4e4a9eb921 ("mm/rmap.c: reuse mergeable
anon_vma as parent when fork").
In dup_mmap(), anon_vma_fork() is called for attaching anon_vma and
parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and
->vm_prev as its parent vma. That causes the anon_vma used by parent been
mistakenly shared by child (In anon_vma_clone(), the code added by that
commit will do this reuse work).
Besides this issue, the design of reusing anon_vma from vma which has gone
through fork should be avoided ([1]). So, this patch reverts that commit
and maintains the consistent logic of reusing anon_vma for
fork/split/merge vma.
Reusing anon_vma within the process is fine. But if a vma has gone
through fork(), then that vma's anon_vma should not be shared with its
neighbor vma. As explained in [1], when vma gone through fork(), the
check for list_is_singular(vma->anon_vma_chain) will be false, and
don't share anon_vma.
With current issue, one example can clarify more. Parent process do
below two steps:
1. p_vma_1 is created and p_anon_vma_1 is prepared;
2. p_vma_2 is created and share p_anon_vma_1; (this is allowed,
becaues p_vma_1 didn't gothrough fork()); parent process do fork():
3. c_vma_1 is dup from p_vma_1, and has its own c_anon_vma_1
prepared; at this point, c_vma_1->anon_vma_chain has two items, one
for p_anon_vma_1 and one for c_anon_vma_1;
4. c_vma_2 is dup from p_vma_2, it is not allowed to share
c_anon_vma_1, because
c_vma_1->anon_vma_chain has two items.
[1] commit d0e9fe1758 ("Simplify and comment on anon_vma re-use for
anon_vma_prepare()") explains the test of "list_is_singular()".
Fixes: 4e4a9eb921 ("mm/rmap.c: reuse mergeable anon_vma as parent when fork")
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/1581150928-3214-3-git-send-email-lixinhai.lxh@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The root of the hierarchy cannot have high set, so we will never reclaim
based on it. This makes that clearer and avoids another entry.
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Link: http://lkml.kernel.org/r/20200312164137.GA1753625@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJehnToAAoJEAx081l5xIa+bYEP/3IW+bip83OSR/Ay/29qmeBh
FMZjz9G+jClVArea+8dlbmGohpQfkLuBiDBE1Ujxl9iqsm3STdIdbv9bHccqs2g8
mtptkZ5qKwuOi7NhcNG5E5vy60bEAbZ9/QtXok5nckega2sdP7cr+uzZgp/Zc/Vo
v9H8Wk6/l/MUF8agIXmgChpXII17lIyYbtbH5NV+PpsZMhAaAg2g4Z4vBP5Ue+Nc
myNcdzKLF3nq++gBfIZ4gzAAnnqN2eYFvkSdvRSdn9HuXcur1tQHjMwC/DJuk8h7
5dsaplrRLceMEqn6d61oWBJclPefXlkazvHzqNA9Zwr98yVev5h7tiT3BKNVTbKW
iPoXCt55fJosvXAsJxW4UgXZy7kMGZdZ8GmSlwmZsA0kJRvOuuvWChvu/ugwnIeR
DUWb5sa0Bn9aoczJ4Qq61O7CqtvhOf6NK24Jcc/HSk/iDbZ2tEnCPEXeCm0GibQ5
PAFLfE1fZUcEeZlOp+zbZ6ni6XbLL9LX2Dkum/3zEvhf1rdF+0692ZM4o9VwedAX
2TpE4kywhbYxhUq3MbyRzP3knu7pJYb0KCOfyg6Rqn/vCo17+PksRF+6XvzUVlzr
VtRYU87TVP5FqIw+e3yela2alP/oo4kEe37n536TcRgFtU7vItcCA5vLuDSOivjX
08B6Hy4QK2M0yKFuuAT5
=KO6E
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2020-04-03-1' of git://anongit.freedesktop.org/drm/drm
Pull drm hugepage support from Dave Airlie:
"This adds support for hugepages to TTM and has been tested with the
vmwgfx drivers, though I expect other drivers to start using it"
* tag 'drm-next-2020-04-03-1' of git://anongit.freedesktop.org/drm/drm:
drm/vmwgfx: Hook up the helpers to align buffer objects
drm/vmwgfx: Introduce a huge page aligning TTM range manager
drm: Add a drm_get_unmapped_area() helper
drm/vmwgfx: Support huge page faults
drm/ttm, drm/vmwgfx: Support huge TTM pagefaults
mm: Add vmf_insert_pfn_xxx_prot() for huge page-table entries
mm: Split huge pages on write-notify or COW
mm: Introduce vma_is_special_huge
fs: Constify vma argument to vma_is_dax
Pull cgroup updates from Tejun Heo:
- Christian extended clone3 so that processes can be spawned into
cgroups directly.
This is not only neat in terms of semantics but also avoids grabbing
the global cgroup_threadgroup_rwsem for migration.
- Daniel added !root xattr support to cgroupfs.
Userland already uses xattrs on cgroupfs for bookkeeping. This will
allow delegated cgroups to support such usages.
- Prateek tried to make cpuset hotplug handling synchronous but that
led to possible deadlock scenarios. Reverted.
- Other minor changes including release_agent_path handling cleanup.
* 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup-v1: Document the cpuset_v2_mode mount option
Revert "cpuset: Make cpuset hotplug synchronous"
cgroupfs: Support user xattrs
kernfs: Add option to enable user xattrs
kernfs: Add removed_size out param for simple_xattr_set
kernfs: kvmalloc xattr value instead of kmalloc
cgroup: Restructure release_agent_path handling
selftests/cgroup: add tests for cloning into cgroups
clone3: allow spawning processes into cgroups
cgroup: add cgroup_may_write() helper
cgroup: refactor fork helpers
cgroup: add cgroup_get_from_file() helper
cgroup: unify attach permission checking
cpuset: Make cpuset hotplug synchronous
cgroup.c: Use built-in RCU list checking
kselftest/cgroup: add cgroup destruction test
cgroup: Clean up css_set task traversal
- Promote numa_map_to_online_node() to a cross-kernel generic facility.
- Save x86 numa information to allow for node-id lookups for reserved
memory ranges, deploy that capability for the e820-pmem driver.
- Introduce phys_to_target_node() to facilitate drivers that want to
know resulting numa node if a given reserved address range was
onlined.
Huge page-table entries for TTM
In order to reduce CPU usage [1] and in theory TLB misses this patchset enables
huge- and giant page-table entries for TTM and TTM-enabled graphics drivers.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Hellstrom (VMware) <thomas_os@shipmail.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20200325073102.6129-1-thomas_os@shipmail.org
Pull percpu updates from Dennis Zhou:
"This is just a few documentation fixes for percpu refcount and bitmap
helpers that went in v5.6, and moving my emails to all be at korg"
* 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu: update copyright emails to dennis@kernel.org
include/bitmap.h: add new functions to documentation
include/bitmap.h: add missing parameter in docs
percpu_ref: Fix comment regarding percpu_ref_init flags
Merge updates from Andrew Morton:
"A large amount of MM, plenty more to come.
Subsystems affected by this patch series:
- tools
- kthread
- kbuild
- scripts
- ocfs2
- vfs
- mm: slub, kmemleak, pagecache, gup, swap, memcg, pagemap, mremap,
sparsemem, kasan, pagealloc, vmscan, compaction, mempolicy,
hugetlbfs, hugetlb"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits)
include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP
mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS
selftests/vm: fix map_hugetlb length used for testing read and write
mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge()
mm/hugetlb.c: clean code by removing unnecessary initialization
hugetlb_cgroup: add hugetlb_cgroup reservation docs
hugetlb_cgroup: add hugetlb_cgroup reservation tests
hugetlb: support file_region coalescing again
hugetlb_cgroup: support noreserve mappings
hugetlb_cgroup: add accounting for shared mappings
hugetlb: disable region_add file_region coalescing
hugetlb_cgroup: add reservation accounting for private mappings
mm/hugetlb_cgroup: fix hugetlb_cgroup migration
hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
hugetlb_cgroup: add hugetlb_cgroup reservation counter
hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
mm/memblock.c: remove redundant assignment to variable max_addr
mm: mempolicy: require at least one nodeid for MPOL_PREFERRED
mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk()
...
Pull exec/proc updates from Eric Biederman:
"This contains two significant pieces of work: the work to sort out
proc_flush_task, and the work to solve a deadlock between strace and
exec.
Fixing proc_flush_task so that it no longer requires a persistent
mount makes improvements to proc possible. The removal of the
persistent mount solves an old regression that that caused the hidepid
mount option to only work on remount not on mount. The regression was
found and reported by the Android folks. This further allows Alexey
Gladkov's work making proc mount options specific to an individual
mount of proc to move forward.
The work on exec starts solving a long standing issue with exec that
it takes mutexes of blocking userspace applications, which makes exec
extremely deadlock prone. For the moment this adds a second mutex with
a narrower scope that handles all of the easy cases. Which makes the
tricky cases easy to spot. With a little luck the code to solve those
deadlocks will be ready by next merge window"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
signal: Extend exec_id to 64bits
pidfd: Use new infrastructure to fix deadlocks in execve
perf: Use new infrastructure to fix deadlocks in execve
proc: io_accounting: Use new infrastructure to fix deadlocks in execve
proc: Use new infrastructure to fix deadlocks in execve
kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
kernel: doc: remove outdated comment cred.c
mm: docs: Fix a comment in process_vm_rw_core
selftests/ptrace: add test cases for dead-locks
exec: Fix a deadlock in strace
exec: Add exec_update_mutex to replace cred_guard_mutex
exec: Move exec_mmap right after de_thread in flush_old_exec
exec: Move cleanup of posix timers on exec out of de_thread
exec: Factor unshare_sighand out of de_thread and call it separately
exec: Only compute current once in flush_old_exec
pid: Improve the comment about waiting in zap_pid_ns_processes
proc: Remove the now unnecessary internal mount of proc
uml: Create a private mount of proc for mconsole
uml: Don't consult current to find the proc_mnt in mconsole_proc
proc: Use a list of inodes to flush from proc
...
Commit f1e61557f0 ("mm: pack compound_dtor and compound_order into one
word in struct page") changed compound_dtor from a pointer to an array
index in order to pack it. To check if page has the hugeltbfs
compound_dtor, we can just compare the index directly without fetching the
function pointer. Said commit did that with PageHuge() and we can do the
same with PageHeadHuge() to make the code a bit smaller and faster.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Neha Agarwal <nehaagarwal@google.com>
Link: http://lkml.kernel.org/r/20200311172440.6988-1-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously variable 'check_addr' was initialized, but was not read later
before reassigning. So the initialization can be removed.
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Link: http://lkml.kernel.org/r/20200303212354.25226-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An earlier patch in this series disabled file_region coalescing in order
to hang the hugetlb_cgroup uncharge info on the file_region entries.
This patch re-adds support for coalescing of file_region entries.
Essentially everytime we add an entry, we call a recursive function that
tries to coalesce the added region with the regions next to it. The worst
case call depth for this function is 3: one to coalesce with the region
next to it, one to coalesce to the region prev, and one to reach the base
case.
This is an important performance optimization as private mappings add
their entries page by page, and we could incur big performance costs for
large mappings with lots of file_region entries in their resv_map.
[almasrymina@google.com: fix CONFIG_CGROUP_HUGETLB ifdefs]
Link: http://lkml.kernel.org/r/20200214204544.231482-1-almasrymina@google.com
[almasrymina@google.com: remove check_coalesce_bug debug code]
Link: http://lkml.kernel.org/r/20200219233610.13808-1-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: http://lkml.kernel.org/r/20200211213128.73302-7-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Support MAP_NORESERVE accounting as part of the new counter.
For each hugepage allocation, at allocation time we check if there is a
reservation for this allocation or not. If there is a reservation for
this allocation, then this allocation was charged at reservation time, and
we don't re-account it. If there is no reserevation for this allocation,
we charge the appropriate hugetlb_cgroup.
The hugetlb_cgroup to uncharge for this allocation is stored in
page[3].private. We use new APIs added in an earlier patch to set this
pointer.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: http://lkml.kernel.org/r/20200211213128.73302-6-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
in the resv_map entries, in file_region->reservation_counter.
After a call to region_chg, we charge the approprate hugetlb_cgroup, and
if successful, we pass on the hugetlb_cgroup info to a follow up
region_add call. When a file_region entry is added to the resv_map via
region_add, we put the pointer to that cgroup in
file_region->reservation_counter. If charging doesn't succeed, we report
the error to the caller, so that the kernel fails the reservation.
On region_del, which is when the hugetlb memory is unreserved, we also
uncharge the file_region->reservation_counter.
[akpm@linux-foundation.org: forward declare struct file_region]
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: http://lkml.kernel.org/r/20200211213128.73302-5-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A follow up patch in this series adds hugetlb cgroup uncharge info the
file_region entries in resv->regions. The cgroup uncharge info may differ
for different regions, so they can no longer be coalesced at region_add
time. So, disable region coalescing in region_add in this patch.
Behavior change:
Say a resv_map exists like this [0->1], [2->3], and [5->6].
Then a region_chg/add call comes in region_chg/add(f=0, t=5).
Old code would generate resv->regions: [0->5], [5->6].
New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
[5->6].
Special care needs to be taken to handle the resv->adds_in_progress
variable correctly. In the past, only 1 region would be added for every
region_chg and region_add call. But now, each call may add multiple
regions, so we can no longer increment adds_in_progress by 1 in
region_chg, or decrement adds_in_progress by 1 after region_add or
region_abort. Instead, region_chg calls add_reservation_in_range() to
count the number of regions needed and allocates those, and that info is
passed to region_add and region_abort to decrement adds_in_progress
correctly.
We've also modified the assumption that region_add after region_chg never
fails. region_chg now pre-allocates at least 1 region for region_add. If
region_add needs more regions than region_chg has allocated for it, then
it may fail.
[almasrymina@google.com: fix file_region entry allocations]
Link: http://lkml.kernel.org/r/20200219012736.20363-1-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Link: http://lkml.kernel.org/r/20200211213128.73302-4-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Normally the pointer to the cgroup to uncharge hangs off the struct page,
and gets queried when it's time to free the page. With hugetlb_cgroup
reservations, this is not possible. Because it's possible for a page to
be reserved by one task and actually faulted in by another task.
The best place to put the hugetlb_cgroup pointer to uncharge for
reservations is in the resv_map. But, because the resv_map has different
semantics for private and shared mappings, the code patch to
charge/uncharge shared and private mappings is different. This patch
implements charging and uncharging for private mappings.
For private mappings, the counter to uncharge is in
resv_map->reservation_counter. On initializing the resv_map this is set
to NULL. On reservation of a region in private mapping, the tasks
hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
resv_map->reservation_counter.
On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.
[akpm@linux-foundation.org: forward declare struct resv_map]
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: http://lkml.kernel.org/r/20200211213128.73302-3-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c32300516047 ("hugetlb_cgroup: add interface for charge/uncharge
hugetlb reservations") mistakingly doesn't handle the migration of *both*
the reservation hugetlb_cgroup and the fault hugetlb_cgroup correctly.
What should happen is that both cgroups shuold be queried from the old
page, then both set to NULL on the old page, then both inserted into the
new page.
The mistake also creates the following warning:
mm/hugetlb_cgroup.c: In function 'hugetlb_cgroup_migrate':
mm/hugetlb_cgroup.c:777:25: warning: variable 'h_cg' set but not used
[-Wunused-but-set-variable]
struct hugetlb_cgroup *h_cg;
^~~~
Solution is to add the missing steps, namly setting the reservation
hugetlb_cgroup to NULL on the old page, and setting the fault
hugetlb_cgroup on the new page.
Fixes: c32300516047 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: http://lkml.kernel.org/r/20200218194727.46995-1-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage
or hugetlb reservation counter.
Adds a new interface to uncharge a hugetlb_cgroup counter via
hugetlb_cgroup_uncharge_counter.
Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: http://lkml.kernel.org/r/20200211213128.73302-2-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These counters will track hugetlb reservations rather than hugetlb memory
faulted in. This patch only adds the counter, following patches add the
charging and uncharging of the counter.
This is patch 1 of an 9 patch series.
Problem:
Currently tasks attempting to reserve more hugetlb memory than is
available get a failure at mmap/shmget time. This is thanks to Hugetlbfs
Reservations [1]. However, if a task attempts to reserve more hugetlb
memory than its hugetlb_cgroup limit allows, the kernel will allow the
mmap/shmget call, but will SIGBUS the task when it attempts to fault in
the excess memory.
We have users hitting their hugetlb_cgroup limits and thus we've been
looking at this failure mode. We'd like to improve this behavior such
that users violating the hugetlb_cgroup limits get an error on mmap/shmget
time, rather than getting SIGBUS'd when they try to fault the excess
memory in. This gives the user an opportunity to fallback more gracefully
to non-hugetlbfs memory for example.
The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time. Thus,
enforcing the hugetlb_cgroup limit only happens at fault time, and the
offending task gets SIGBUS'd.
Proposed Solution:
A new page counter named
'hugetlb.xMB.rsvd.[limit|usage|max_usage]_in_bytes'. This counter has
slightly different semantics than
'hugetlb.xMB.[limit|usage|max_usage]_in_bytes':
- While usage_in_bytes tracks all *faulted* hugetlb memory,
rsvd.usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb
memory faulted in without a prior reservation.
- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than rsvd.limit_in_bytes, the kernel will fail this
reservation.
This proposal is implemented in this patch series, with tests to verify
functionality and show the usage.
Alternatives considered:
1. A new cgroup, instead of only a new page_counter attached to the
existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
duplication with hugetlb_cgroup. Keeping hugetlb related page counters
under hugetlb_cgroup seemed cleaner as well.
2. Instead of adding a new counter, we considered adding a sysctl that
modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do
accounting at reservation time rather than fault time. Adding a new
page_counter seems better as userspace could, if it wants, choose to
enforce different cgroups differently: one via limit_in_bytes, and
another via rsvd.limit_in_bytes. This could be very useful if you're
transitioning how hugetlb memory is partitioned on your system one
cgroup at a time, for example. Also, someone may find usage for both
limit_in_bytes and rsvd.limit_in_bytes concurrently, and this approach
gives them the option to do so.
Testing:
- Added tests passing.
- Used libhugetlbfs for regression testing.
[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Link: http://lkml.kernel.org/r/20200211213128.73302-1-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken
into account as well.
Instead of writing the required complicated code for this rare
occurrence, just eliminate the race. i_mmap_rwsem is now held in read
mode for the duration of page fault processing. Hold i_mmap_rwsem in
write mode when modifying i_size. In this way, truncation can not
proceed when page faults are being processed. In addition, i_size
will not change during fault processing so a single check can be made
to ensure faults are not beyond (proposed) end of file. Faults can
still race with hole punch, but that race is handled by existing code
and the use of hugetlb_fault_mutex.
With this modification, checks for races with truncation in the page
fault path can be simplified and removed. remove_inode_hugepages no
longer needs to take hugetlb_fault_mutex in the case of truncation.
Comments are expanded to explain reasoning behind locking.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Link: http://lkml.kernel.org/r/20200316205756.146666-3-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.
While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races. These issues are:
1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
reserve counts and state.
A previous attempt was made to use i_mmap_rwsem in this manner as
described at [2]. However, those patches were reverted starting with [3]
due to locking issues.
To effectively use i_mmap_rwsem to address the above issues it needs to be
held (in read mode) during page fault processing. However, during fault
processing we need to lock the page we will be adding. Lock ordering
requires we take page lock before i_mmap_rwsem. Waiting until after
taking the page lock is too late in the fault process for the
synchronization we want to do.
To address this lock ordering issue, the following patches change the lock
ordering for hugetlb pages. This is not too invasive as hugetlbfs
processing is done separate from core mm in many places. However, I don't
really like this idea. Much ugliness is contained in the new routine
hugetlb_page_mapping_lock_write() of patch 1.
The only other way I can think of to address these issues is by catching
all the races. After catching a race, cleanup, backout, retry ... etc,
as needed. This can get really ugly, especially for huge page
reservations. At one time, I started writing some of the reservation
backout code for page faults and it got so ugly and complicated I went
down the path of adding synchronization to avoid the races. Any other
suggestions would be welcome.
[1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
[2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
[4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
[5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/
This patch (of 2):
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with
the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults. This is not the order
specified in the rest of mm code. Handling of hugetlbfs pages is mostly
isolated today. Therefore, we use this alternative locking order for
PageHuge() pages.
mapping->i_mmap_rwsem
hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
page->flags PG_locked (lock_page)
To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
introduced to write lock the i_mmap_rwsem associated with a page.
In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma. A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The variable max_addr is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Link: http://lkml.kernel.org/r/20200228235003.112718-1-colin.king@canonical.com
Addresses-Coverity: ("Unused value")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using an empty (malformed) nodelist that is not caught during mount option
parsing leads to a stack-out-of-bounds access.
The option string that was used was: "mpol=prefer:,". However,
MPOL_PREFERRED requires a single node number, which is not being provided
here.
Add a check that 'nodes' is not empty after parsing for MPOL_PREFERRED's
nodeid.
Fixes: 095f1fc4eb ("mempolicy: rework shmem mpol parsing and display")
Reported-by: Entropy Moe <3ntr0py1337@gmail.com>
Reported-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Link: http://lkml.kernel.org/r/89526377-7eb6-b662-e1d8-4430928abde9@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The VM_BUG_ON() is already used by queue_pages_test_walk(), it sounds
better to dump more debug information by using VM_BUG_ON_VMA() to help
debugging.
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Li Xinhai" <lixinhai.lxh@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1579068565-110432-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vma_migratable() is called to check if pages in vma can be migrated before
go ahead to further actions. Currently it is used in below code path:
- task_numa_work
- mbind
- move_pages
For hugetlb mapping, whether vma is migratable or not is determined by:
- CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
- arch_hugetlb_migration_supported
Issue: current code only checks for CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
alone, and no code should use it directly. (note that current code in
vma_migratable don't cause failure or bug because
unmap_and_move_huge_page() will catch unsupported hugepage and handle it
properly)
This patch checks the two factors by hugepage_migration_supported for
impoving code logic and robustness. It will enable early bail out of
hugepage migration procedure, but because currently all architecture
supporting hugepage migration is able to support all page size, we would
not see performance gain with this patch applied.
vma_migratable() is moved to mm/mempolicy.c, because of the circular
reference of mempolicy.h and hugetlb.h cause defining it as inline not
feasible.
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Link: http://lkml.kernel.org/r/1579786179-30633-1-git-send-email-lixinhai.lxh@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MPOL_MF_STRICT is used in mbind() for purposes:
(1) MPOL_MF_STRICT is set alone without MPOL_MF_MOVE or
MPOL_MF_MOVE_ALL, to check if there is misplaced page and return -EIO;
(2) MPOL_MF_STRICT is set with MPOL_MF_MOVE or MPOL_MF_MOVE_ALL, to
check if there is misplaced page which is failed to isolate, or page
is success on isolate but failed to move, and return -EIO.
For non hugepage mapping, (1) and (2) are implemented as expectation. For
hugepage mapping, (1) is not implemented. And in (2), the part about
failed to isolate and report -EIO is not implemented.
This patch implements the missed parts for hugepage mapping. Benefits
with it applied:
- User space can apply same code logic to handle mbind() on hugepage and
non hugepage mapping;
- Reliably using MPOL_MF_STRICT alone to check whether there is
misplaced page or not when bind policy on address range, especially for
address range which contains both hugepage and non hugepage mapping.
Analysis of potential impact to existing users:
- If MPOL_MF_STRICT alone was previously used, hugetlb pages not
following the memory policy would not cause an EIO error. After this
change, hugetlb pages are treated like all other pages. If
MPOL_MF_STRICT alone is used and hugetlb pages do not follow memory
policy an EIO error will be returned.
- For users who using MPOL_MF_STRICT with MPOL_MF_MOVE or
MPOL_MF_MOVE_ALL, the semantic about some pages could not be moved will
not be changed by this patch, because failed to isolate and failed to
move have same effects to users, so their existing code will not be
impacted.
In mbind man page, the note about 'MPOL_MF_STRICT is ignored on huge page
mappings' can be removed after this patch is applied.
Mike:
: The current behavior with MPOL_MF_STRICT and hugetlb pages is inconsistent
: and does not match documentation (as described above). The special
: behavior for hugetlb pages ideally should have been removed when hugetlb
: page migration was introduced. It is unlikely that anyone relies on
: today's inconsistent behavior, and removing one more case of special
: handling for hugetlb pages is a good thing.
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-man <linux-man@vger.kernel.org>
Link: http://lkml.kernel.org/r/1581559627-6206-1-git-send-email-lixinhai.lxh@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously 0 was assigned to variable 'last_migrated_pfn'. But the
variable is not read after that, so the assignment can be removed.
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Link: http://lkml.kernel.org/r/20200318174509.15021-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 5bbe3547aa ("mm: allow compaction of unevictable pages")
it is allowed to examine mlocked pages and compact them by default. On
-RT even minor pagefaults are problematic because it may take a few 100us
to resolve them and until then the task is blocked.
Make compact_unevictable_allowed = 0 default and issue a warning on RT if
it is changed.
[bigeasy@linutronix.de: v5]
Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
Link: http://lkml.kernel.org/r/20200319165536.ovi75tsr2seared4@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
Link: http://lkml.kernel.org/r/20200303202225.nhqc3v5gwlb7x6et@linutronix.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan reports:
The patch 5e1f0f098b: "mm, compaction: capture a page under direct
compaction" from Mar 5, 2019, leads to the following Smatch complaint:
mm/compaction.c:2321 compact_zone_order()
error: we previously assumed 'capture' could be null (see line 2313)
mm/compaction.c
2288 static enum compact_result compact_zone_order(struct zone *zone, int order,
2289 gfp_t gfp_mask, enum compact_priority prio,
2290 unsigned int alloc_flags, int classzone_idx,
2291 struct page **capture)
^^^^^^^
2313 if (capture)
^^^^^^^
Check for NULL
2314 current->capture_control = &capc;
2315
2316 ret = compact_zone(&cc, &capc);
2317
2318 VM_BUG_ON(!list_empty(&cc.freepages));
2319 VM_BUG_ON(!list_empty(&cc.migratepages));
2320
2321 *capture = capc.page;
^^^^^^^^
Unchecked dereference.
2322 current->capture_control = NULL;
2323
In practice this is not an issue, as the only caller path passes non-NULL
capture:
__alloc_pages_direct_compact()
struct page *page = NULL;
try_to_compact_pages(capture = &page);
compact_zone_order(capture = capture);
So let's remove the unnecessary check, which should also make Smatch happy.
Fixes: 5e1f0f098b ("mm, compaction: capture a page under direct compaction")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: http://lkml.kernel.org/r/18b0df3c-0589-d96c-23fa-040798fee187@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The code to implement THP migrations already exists, and the code for CMA
to clear out a region of memory already exists.
Only a few small tweaks are needed to allow CMA to move THP memory when
attempting an allocation from alloc_contig_range.
With these changes, migrating THPs from a CMA area works when allocating a
1GB hugepage from CMA memory.
[riel@surriel.com: fix hugetlbfs pages per Mike, cleanup per Vlastimil]
Link: http://lkml.kernel.org/r/20200228104700.0af2f18d@imladris.surriel.com
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Link: http://lkml.kernel.org/r/20200227213238.1298752-2-riel@surriel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "fix THP migration for CMA allocations", v2.
Transparent huge pages are allocated with __GFP_MOVABLE, and can end up in
CMA memory blocks. Transparent huge pages also have most of the
infrastructure in place to allow migration.
However, a few pieces were missing, causing THP migration to fail when
attempting to use CMA to allocate 1GB hugepages.
With these patches in place, THP migration from CMA blocks seems to work,
both for anonymous THPs and for tmpfs/shmem THPs.
This patch (of 2):
Add information to struct compact_control to indicate that the allocator
would really like to clear out this specific part of memory, used by for
example CMA.
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Link: http://lkml.kernel.org/r/20200227213238.1298752-1-riel@surriel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sc->memcg_low_skipped resets skipped_deactivate to 0 but this is not
needed as this code path is never reachable with skipped_deactivate != 0
due to previous sc->skipped_deactivate branch.
[mhocko@kernel.org: rewrite changelog]
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Link: http://lkml.kernel.org/r/20200319165938.23354-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This gives some size improvement:
$size mm/vmscan.o (before)
text data bss dec hex filename
53670 24123 12 77805 12fed mm/vmscan.o
$size mm/vmscan.o (after)
text data bss dec hex filename
53648 24123 12 77783 12fd7 mm/vmscan.o
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/Message-ID:
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously 0 was assigned to variable 'lruvec_size', but the variable was
never read later. So the assignment can be removed.
Fixes: f87bccde6a ("mm/vmscan: remove unused lru_pages argument")
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20200229214022.11853-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pgdat->kswapd_classzone_idx could be accessed concurrently in
wakeup_kswapd(). Plain writes and reads without any lock protection
result in data races. Fix them by adding a pair of READ|WRITE_ONCE() as
well as saving a branch (compilers might well optimize the original code
in an unintentional way anyway). While at it, also take care of
pgdat->kswapd_order and non-kswapd threads in allow_direct_reclaim(). The
data races were reported by KCSAN,
BUG: KCSAN: data-race in wakeup_kswapd / wakeup_kswapd
write to 0xffff9f427ffff2dc of 4 bytes by task 7454 on cpu 13:
wakeup_kswapd+0xf1/0x400
wakeup_kswapd at mm/vmscan.c:3967
wake_all_kswapds+0x59/0xc0
wake_all_kswapds at mm/page_alloc.c:4241
__alloc_pages_slowpath+0xdcc/0x1290
__alloc_pages_slowpath at mm/page_alloc.c:4512
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x16e/0x6f0
__handle_mm_fault+0xcd5/0xd40
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
1 lock held by mtest01/7454:
#0: ffff9f425afe8808 (&mm->mmap_sem#2){++++}, at:
do_page_fault+0x143/0x6f9
do_user_addr_fault at arch/x86/mm/fault.c:1405
(inlined by) do_page_fault at arch/x86/mm/fault.c:1539
irq event stamp: 6944085
count_memcg_event_mm+0x1a6/0x270
count_memcg_event_mm+0x119/0x270
__do_softirq+0x34c/0x57c
irq_exit+0xa2/0xc0
read to 0xffff9f427ffff2dc of 4 bytes by task 7472 on cpu 38:
wakeup_kswapd+0xc8/0x400
wake_all_kswapds+0x59/0xc0
__alloc_pages_slowpath+0xdcc/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x16e/0x6f0
__handle_mm_fault+0xcd5/0xd40
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
1 lock held by mtest01/7472:
#0: ffff9f425a9ac148 (&mm->mmap_sem#2){++++}, at:
do_page_fault+0x143/0x6f9
irq event stamp: 6793561
count_memcg_event_mm+0x1a6/0x270
count_memcg_event_mm+0x119/0x270
__do_softirq+0x34c/0x57c
irq_exit+0xa2/0xc0
BUG: KCSAN: data-race in kswapd / wakeup_kswapd
write to 0xffff90973ffff2dc of 4 bytes by task 820 on cpu 6:
kswapd+0x27c/0x8d0
kthread+0x1e0/0x200
ret_from_fork+0x27/0x50
read to 0xffff90973ffff2dc of 4 bytes by task 6299 on cpu 0:
wakeup_kswapd+0xf3/0x450
wake_all_kswapds+0x59/0xc0
__alloc_pages_slowpath+0xdcc/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x170/0x700
__handle_mm_fault+0xc9f/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/1582749472-5171-1-git-send-email-cai@lca.pw
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kswapd kernel thread starts either with a CPU affinity set to the full cpu
mask of its target node or without any affinity at all if the node is
CPUless. There is a cpu hotplug callback (kswapd_cpu_online) that
implements an elaborate way to update this mask when a cpu is onlined.
It is not really clear whether there is any actual benefit from this
scheme. Completely CPU-less NUMA nodes rarely gain a new CPU during
runtime. Drop the code for that reason. If there is a real usecase then
we can resurrect and simplify the code.
[mhocko@suse.com rewrite changelog]
Suggested-by: Michal Hocko <mhocko@suse.org>
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/20200218224422.3407-1-richardw.yang@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The commit 98fa15f34c ("mm: replace all open encodings for
NUMA_NO_NODE") did the replacement across the kernel tree, but we got
some more in vmscan.c since then.
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1581568298-45317-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use mem_cgroup_is_root() API to check if memcg is root memcg instead of
open coding.
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1581398649-125989-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When kstrndup fails, no memory was allocated and we can exit directly.
[david@redhat.com: reword changelog]
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1581398649-125989-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously if branch condition was false, the assignment was not executed.
The assignment can be safely executed even when the condition is false
and it is not incorrect as it assigns the value of 'nodemask' to
'ac.nodemask' which already has the same value.
So as the assignment can be executed unconditionally, the branch can be
removed.
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200307225335.31300-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use free_area_empty() API to replace list_empty() for better code
readability.
Signed-off-by: chenqiwu <chenqiwu@xiaomi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: http://lkml.kernel.org/r/1583674354-7713-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes ALLOC_KSWAPD equal to __GFP_KSWAPD_RECLAIM (cast to int).
Thanks to that code like:
if (gfp_mask & __GFP_KSWAPD_RECLAIM)
alloc_flags |= ALLOC_KSWAPD;
can be changed to:
alloc_flags |= (__force int) (gfp_mask &__GFP_KSWAPD_RECLAIM);
Thanks to this one branch less is generated in the assembly.
In case of ALLOC_KSWAPD flag two branches are saved, first one in code
that always executes in the beginning of page allocation and the second
one in loop in page allocator slowpath.
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: http://lkml.kernel.org/r/20200304162118.14784-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the vm.min_free_kbytes sysctl value is capped at a hardcoded
64M in init_per_zone_wmark_min (unless it is overridden by khugepaged
initialization).
This value has not been modified since 2005, and enterprise-grade systems
now frequently have hundreds of GB of RAM and multiple 10, 40, or even 100
GB NICs. We have seen page allocation failures on heavily loaded systems
related to NIC drivers. These issues were resolved by an increase to
vm.min_free_kbytes.
This patch increases the hardcoded value by a factor of 4 as a temporary
solution.
Further work to make the calculation of vm.min_free_kbytes more consistent
throughout the kernel would be desirable.
As an example of the inconsistency of the current method, this value is
recalculated by init_per_zone_wmark_min() in the case of memory hotplug
which will override the value set by set_recommended_min_free_kbytes()
called during khugepaged initialization even if khugepaged remains
enabled, however an on/off toggle of khugepaged will then recalculate and
set the value via set_recommended_min_free_kbytes().
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Rafael Aquini <aquini@redhat.com>
Link: http://lkml.kernel.org/r/20200220150103.5183-1-jsavitz@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "fix the missing underflow in memory operation function", v4.
The patchset helps to produce a KASAN report when size is negative in
memory operation functions. It is helpful for programmer to solve an
undefined behavior issue. Patch 1 based on Dmitry's review and
suggestion, patch 2 is a test in order to verify the patch 1.
[1]https://bugzilla.kernel.org/show_bug.cgi?id=199341
[2]https://lore.kernel.org/linux-arm-kernel/20190927034338.15813-1-walter-zh.wu@mediatek.com/
This patch (of 2):
KASAN missed detecting size is a negative number in memset(), memcpy(),
and memmove(), it will cause out-of-bounds bug. So needs to be detected
by KASAN.
If size is a negative number, then it has a reason to be defined as
out-of-bounds bug type. Casting negative numbers to size_t would indeed
turn up as a large size_t and its value will be larger than ULONG_MAX/2,
so that this can qualify as out-of-bounds.
KASAN report is shown below:
BUG: KASAN: out-of-bounds in kmalloc_memmove_invalid_size+0x70/0xa0
Read of size 18446744073709551608 at addr ffffff8069660904 by task cat/72
CPU: 2 PID: 72 Comm: cat Not tainted 5.4.0-rc1-next-20191004ajb-00001-gdb8af2f372b2-dirty #1
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x288
show_stack+0x14/0x20
dump_stack+0x10c/0x164
print_address_description.isra.9+0x68/0x378
__kasan_report+0x164/0x1a0
kasan_report+0xc/0x18
check_memory_region+0x174/0x1d0
memmove+0x34/0x88
kmalloc_memmove_invalid_size+0x70/0xa0
[1] https://bugzilla.kernel.org/show_bug.cgi?id=199341
[cai@lca.pw: fix -Wdeclaration-after-statement warn]
Link: http://lkml.kernel.org/r/1583509030-27939-1-git-send-email-cai@lca.pw
[peterz@infradead.org: fix objtool warning]
Link: http://lkml.kernel.org/r/20200305095436.GV2596@hirez.programming.kicks-ass.net
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Link: http://lkml.kernel.org/r/20191112065302.7015-1-walter-zh.wu@mediatek.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When allocating memmap for hot added memory with the classic sparse, the
specified 'nid' is ignored in populate_section_memmap().
While in allocating memmap for the classic sparse during boot, the node
given by 'nid' is preferred. And VMEMMAP prefers the node of 'nid' in
both boot stage and memory hot adding. So seems no reason to not respect
the node of 'nid' for the classic sparse when hot adding memory.
Use kvmalloc_node instead to use the passed in 'nid'.
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/20200316125625.GH3486@MiWiFi-R3L-srv
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This change makes populate_section_memmap()/depopulate_section_memmap
much simpler.
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200316125450.GG3486@MiWiFi-R3L-srv
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After introducing mem sub section concept, pfn_present() loses its literal
meaning, and will not be necessary a truth on partial populated mem
section.
Since all of the callers use it to judge an absent section, it is better
to rename pfn_present() as pfn_in_present_section().
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Leonardo Bras <leonardo@linux.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Link: http://lkml.kernel.org/r/1581919110-29575-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memmap should be the address to page struct instead of address to pfn.
As mentioned by David, if system memory and devmem sit within a section,
the mismatch address would lead kdump to dump unexpected memory.
Since sub-section only works for SPARSEMEM_VMEMMAP, pfn_to_page() is valid
to get the page struct address at this point.
Fixes: ba72b4c8cf ("mm/sparsemem: support sub-section hotplug")
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Baoquan He <bhe@redhat.com>
Link: http://lkml.kernel.org/r/20200210005048.10437-1-richardw.yang@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Even on 64 bit kernel, the mmap failure can happen for a 32 bit task.
Virtual memory space shortage of a task on mmap is reported to userspace
as -ENOMEM. It can be confused as physical memory shortage of overall
system.
The vm_unmapped_area can be called to by some drivers or other kernel core
system like filesystem. In my platform, GPU driver calls to
vm_unmapped_area and the driver returns -ENOMEM even in GPU side shortage.
It can be hard to distinguish which code layer returns the -ENOMEM.
Create mmap trace file and add trace point of vm_unmapped_area.
i.e.)
277.156599: vm_unmapped_area: addr=77e0d03000 err=0 total_vm=0x17014b flags=0x1 len=0x400000 lo=0x8000 hi=0x7878c27000 mask=0x0 ofs=0x1
342.838740: vm_unmapped_area: addr=0 err=-12 total_vm=0xffb08 flags=0x0 len=0x100000 lo=0x40000000 hi=0xfffff000 mask=0x0 ofs=0x22
[akpm@linux-foundation.org: prefix address printk with 0x, per Matthew]
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200320055823.27089-3-jaewon31.kim@samsung.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: mmap: add mmap trace point", v3.
Create mmap trace file and add trace point of vm_unmapped_area().
This patch (of 2):
In preparation for next patch remove inline of vm_unmapped_area and move
code to mmap.c. There is no logical change.
Also remove unmapped_area[_topdown] out of mm.h, there is no code
calling to them.
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20200320055823.27089-2-jaewon31.kim@samsung.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The param "start" actually referes to the physical memory start, which is
to be mapped into virtual area vma. And it is the field vma->vm_start
which stands for the start of the area.
Most of the time, we do not read through whole implementation of a
function but only the definition and essential comments. Accurate
comments are definitely the base stone.
Signed-off-by: Wang Wenhu <wenhu.wang@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200318052206.105104-1-wenhu.wang@vivo.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It really made me scratch my head. Replace the comment with an accurate
and consistent description.
The parameter pfn actually refers to the page frame number which is
right-shifted by PAGE_SHIFT from the physical address.
Signed-off-by: WANG Wenhu <wenhu.wang@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200310073955.43415-1-wenhu.wang@vivo.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The existing gup code does not react to the fatal signals in many code
paths. For example, in one retry path of gup we're still using
down_read() rather than down_read_killable(). Also, when doing page
faults we don't pass in FAULT_FLAG_KILLABLE as well, which means that
within the faulting process we'll wait in non-killable way as well. These
were spotted by Linus during the code review of some other patches.
Let's allow the gup code to react to fatal signals to improve the
responsiveness of threads when during gup and being killed.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220160256.9887-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the gup counterpart of the change that allows the VM_FAULT_RETRY
to happen for more than once. One thing to mention is that we must check
the fatal signal here before retry because the GUP can be interrupted by
that, otherwise we can loop forever.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220195357.16371-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The idea comes from a discussion between Linus and Andrea [1].
Before this patch we only allow a page fault to retry once. We achieved
this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
handle_mm_fault() the second time. This was majorly used to avoid
unexpected starvation of the system by looping over forever to handle the
page fault on a single page. However that should hardly happen, and after
all for each code path to return a VM_FAULT_RETRY we'll first wait for a
condition (during which time we should possibly yield the cpu) to happen
before VM_FAULT_RETRY is really returned.
This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
flag when we receive VM_FAULT_RETRY. It means that the page fault handler
now can retry the page fault for multiple times if necessary without the
need to generate another page fault event. Meanwhile we still keep the
FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
page fault is the first attempt or not.
Then we'll have these combinations of fault flags (only considering
ALLOW_RETRY flag and TRIED flag):
- ALLOW_RETRY and !TRIED: this means the page fault allows to
retry, and this is the first try
- ALLOW_RETRY and TRIED: this means the page fault allows to
retry, and this is not the first try
- !ALLOW_RETRY and !TRIED: this means the page fault does not allow
to retry at all
- !ALLOW_RETRY and TRIED: this is forbidden and should never be used
In existing code we have multiple places that has taken special care of
the first condition above by checking against (fault_flags &
FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to detect
the first retry of a page fault by checking against both (fault_flags &
FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
even the 2nd try will have the ALLOW_RETRY set, then use that helper in
all existing special paths. One example is in __lock_page_or_retry(), now
we'll drop the mmap_sem only in the first attempt of page fault and we'll
keep it in follow up retries, so old locking behavior will be retained.
This will be a nice enhancement for current code [2] at the same time a
supporting material for the future userfaultfd-writeprotect work, since in
that work there will always be an explicit userfault writeprotect retry
for protected pages, and if that cannot resolve the page fault (e.g., when
userfaultfd-writeprotect is used in conjunction with swapped pages) then
we'll possibly need a 3rd retry of the page fault. It might also benefit
other potential users who will have similar requirement like userfault
write-protection.
GUP code is not touched yet and will be covered in follow up patch.
Please read the thread below for more information.
[1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
[2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When follow_hugetlb_page() returns with *locked==0, it means we've got a
VM_FAULT_RETRY within the fauling process and we've released the mmap_sem.
When that happens, we should stop and bail out.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220155353.8676-3-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: Page fault enhancements", v6.
This series contains cleanups and enhancements to current page fault
logic. The whole idea comes from the discussion between Andrea and Linus
on the bug reported by syzbot here:
https://lkml.org/lkml/2017/11/2/833
Basically it does two things:
(a) Allows the page fault logic to be more interactive on not only
SIGKILL, but also the rest of userspace signals, and,
(b) Allows the page fault retry (VM_FAULT_RETRY) to happen for more
than once.
For (a): with the changes we should be able to react faster when page
faults are working in parallel with userspace signals like SIGSTOP and
SIGCONT (and more), and with that we can remove the buggy part in
userfaultfd and benefit the whole page fault mechanism on faster signal
processing to reach the userspace.
For (b), we should be able to allow the page fault handler to loop for
even more than twice. Some context: for now since we have
FAULT_FLAG_ALLOW_RETRY we can allow to retry the page fault once with the
same interrupt context, however never more than twice. This can be not
only a potential cleanup to remove this assumption since AFAIU the code
itself doesn't really have this twice-only limitation (though that should
be a protective approach in the past), at the same time it'll greatly
simplify future works like userfaultfd write-protect where it's possible
to retry for more than twice (please have a look at [1] below for a
possible user that might require the page fault to be handled for a third
time; if we can remove the retry limitation we can simply drop that patch
and those complexity).
This patch (of 16):
There's plenty of places around __get_user_pages() that has a parameter
"nonblocking" which does not really mean that "it won't block" (because it
can really block) but instead it shows whether the mmap_sem is released by
up_read() during the page fault handling mostly when VM_FAULT_RETRY is
returned.
We have the correct naming in e.g. get_user_pages_locked() or
get_user_pages_remote() as "locked", however there're still many places
that are using the "nonblocking" as name.
Renaming the places to "locked" where proper to better suite the
functionality of the variable. While at it, fixing up some of the
comments accordingly.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220155353.8676-2-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the declaration and definition for is_vma_temporary_stack() are
scattered. Lets make is_vma_temporary_stack() helper available for
general use and also drop the declaration from (include/linux/huge_mm.h)
which is no longer required. While at this, rename this as
vma_is_temporary_stack() in line with existing helpers. This should not
cause any functional change.
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1582782965-3274-4-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/vma: some more minor changes", v2.
The motivation here is to consolidate VMA flags and helpers in generic
memory header and reduce code duplication when ever applicable. If there
are other possible similar instances which might be missing here, please
do let me me know. I will be happy to incorporate them.
This patch (of 3):
Move VM_NO_KHUGEPAGED into generic header (include/linux/mm.h). This just
makes sure that no VMA flag is scattered in individual function files any
longer. While at this, fix an old comment which is no longer valid. This
should not cause any functional change.
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1582782965-3274-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Following the update of pagewalk code commit a07984d48146 ("mm: pagewalk:
add p4d_entry() and pgd_entry()") we can modify the mapping_dirty_helpers'
huge page-table entry callbacks to avoid splitting when a huge pud or -pmd
is encountered.
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Steven Price <steven.price@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200203154305.15045-1-thomas_os@shipmail.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a task is getting moved out of the OOMing cgroup, it might result in
unexpected OOM killings if memory.oom.group is used anywhere in the cgroup
tree.
Imagine the following example:
A (oom.group = 1)
/ \
(OOM) B C
Let's say B's memory.max is exceeded and it's OOMing. The OOM killer
selects a task in B as a victim, but someone asynchronously moves the task
into C. mem_cgroup_get_oom_group() will iterate over all ancestors of C
up to the root cgroup. In theory it had to stop at the oom_domain level -
the memory cgroup which is OOMing. But because B is not an ancestor of C,
it's not happening. Instead it chooses A (because it's oom.group is set),
and kills all tasks in A. This behavior is wrong because the OOM happened
in B, so there is no reason to kill anything outside.
Fix this by checking it the memory cgroup to which the task belongs is a
descendant of the oom_domain. If not, memory.oom.group should be ignored,
and the OOM killer should kill only the victim task.
Reported-by: Dan Schatzberg <dschatzberg@fb.com>
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: http://lkml.kernel.org/r/20200316223510.3176148-1-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The read side of this is all protected, but we can still tear if multiple
iterations of mem_cgroup_protected are going at the same time.
There's some intentional racing in mem_cgroup_protected which is ok, but
load/store tearing should be avoided.
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/d1e9fbc0379fe8db475d82c8b6fbe048876e12ae.1584034301.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The write side of this is xchg()/smp_mb(), so that's all good. Just a few
sites missing a READ_ONCE.
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/bbec2c3d822217334855c8877a9d28b2a6d395fb.1584034301.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This can be set concurrently with reads, which may cause the wrong value
to be propagated.
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/e809b4e6b0c1626dac6945970de06409a180ee65.1584034301.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This can be set concurrently with reads, which may cause the wrong value
to be propagated.
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/448206f44b0fa7be9dad2ca2601d2bcb2c0b7844.1584034301.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This one is a bit more nuanced because we have memcg_max_mutex, which is
mostly just used for enforcing invariants, but we still need to READ_ONCE
since (despite its name) it doesn't really protect memory.max access.
On write (page_counter_set_max() and memory_max_write()) we use xchg(),
which uses smp_mb(), so that's already fine.
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/50a31e5f39f8ae6c8fb73966ba1455f0924e8f44.1584034301.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A mem_cgroup's high attribute can be concurrently set at the same time as
we are trying to read it -- for example, if we are in memory_high_write at
the same time as we are trying to do high reclaim.
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/2f66f7038ed1d4688e59de72b627ae0ea52efa83.1584034301.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_id_get_many() is currently used only when MMU or MEMCG_SWAP
configuration options are enabled. Having them disabled triggers the
following warning at compile time:
linux/mm/memcontrol.c:4797:13: warning: `mem_cgroup_id_get_many' defined but not used [-Wunused-function]
static void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n)
Make mem_cgroup_id_get_many() __maybe_unused to address the issue.
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200305164354.48147-1-vincenzo.frascino@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently multiple locations in memcg code, css_tryget_online() is being
used. However it doesn't matter whether the cgroup is online for the
callers. Online used to matter when we had reparenting on offlining and
we needed a way to prevent new ones from showing up.
The failure case for couple of these css_tryget_online usage is to
fallback to root_mem_cgroup which kind of make bypassing the memcg
limits possible for some workloads. For example creating an inotify
group in a subcontainer and then deleting that container after moving the
process to a different container will make all the event objects
allocated for that group to the root_mem_cgroup. So, using
css_tryget_online() is dangerous for such cases.
Two locations still use the online version. The swapin of offlined
memcg's pages and the memcg kmem cache creation. The kmem cache indeed
needs the online version as the kernel does the reparenting of memcg
kmem caches. For the swapin case, it has been left for later as the
fallback is not really that concerning.
With swap accounting enabled, if the memcg of the swapped out page is
not online then the memcg extracted from the given 'mm' will be charged
and if 'mm' is NULL then root memcg will be charged. However I could
not find a code path where the given 'mm' will be NULL for swap-in
case.
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/20200302203109.179417-1-shakeelb@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says. The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.
Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.
Consider memory in a data center host. At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing. Both branches are further subdivided into individual
services, job components etc.
We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other. Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree. Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general. This
results in very inefficient resource distribution.
Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level. Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads. It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g. rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.
It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.
To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.
That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now. However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size. Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree. But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.
E.g. a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.
This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.
The floating protection composes naturally with fixed protection.
Consider the following example tree:
A A: low = 2G
/ \ A1: low = 1G
A1 A2 A2: low = 0G
As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.
There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree. Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior. Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The effective protection of any given cgroup is a somewhat complicated
construct that depends on the ancestor's configuration, siblings'
configurations, as well as current memory utilization in all these groups.
It's done this way to satisfy hierarchical delegation requirements while
also making the configuration semantics flexible and expressive in complex
real life scenarios.
Unfortunately, all the rules and requirements are sparsely documented, and
the code is a little too clever in merging different scenarios into a
single min() expression. This makes it hard to reason about the
implementation and avoid breaking semantics when making changes to it.
This patch documents each semantic rule individually and splits out the
handling of the overcommit case from the regular case.
Michal Koutný also points out that the points of equilibrium as described
in the existing example scenarios aren't actually accurate. Delete these
examples for now to avoid confusion.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-3-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: memcontrol: recursive memory.low protection", v3.
The current memory.low (and memory.min) semantics require protection to be
assigned to a cgroup in an untinterrupted chain from the top-level cgroup
all the way to the leaf.
In practice, we want to protect entire cgroup subtrees from each other
(system management software vs. workload), but we would like the VM to
balance memory optimally *within* each subtree, without having to make
explicit weight allocations among individual components. The current
semantics make that impossible.
They also introduce unmanageable complexity into more advanced resource
trees. For example:
host root
`- system.slice
`- rpm upgrades
`- logging
`- workload.slice
`- a container
`- system.slice
`- workload.slice
`- job A
`- component 1
`- component 2
`- job B
At a host-level perspective, we would like to protect the outer
workload.slice subtree as a whole from rpm upgrades, logging etc. But for
that to be effective, right now we'd have to propagate it down through the
container, the inner workload.slice, into the job cgroup and ultimately
the component cgroups where memory is actually, physically allocated.
This may cross several tree delegation points and namespace boundaries,
which make such a setup near impossible.
CPU and IO on the other hand are already distributed recursively. The
user would simply configure allowances at the host level, and they would
apply to the entire subtree without any downward propagation.
To enable the above-mentioned usecases and bring memory in line with other
resource controllers, this patch series extends memory.low/min such that
settings apply recursively to the entire subtree. Users can still assign
explicit shares in subgroups, but if they don't, any ancestral protection
will be distributed such that children compete freely amongst each other -
as if no memory control were enabled inside the subtree - but enjoy
protection from neighboring trees.
In the above example, the user would then be able to configure shares of
CPU, IO and memory at the host level to comprehensively protect and
isolate the workload.slice as a whole from system.slice activity.
Patch #1 fixes an existing bug that can give a cgroup tree more protection
than it should receive as per ancestor configuration.
Patch #2 simplifies and documents the existing code to make it easier to
reason about the changes in the next patch.
Patch #3 finally implements recursive memory protection semantics.
Because of a risk of regressing legacy setups, the new semantics are
hidden behind a cgroup2 mount option, 'memory_recursiveprot'.
More details in patch #3.
This patch (of 3):
When memory.low is overcommitted - i.e. the children claim more
protection than their shared ancestor grants them - the allowance is
distributed in proportion to how much each sibling uses their own declared
protection:
low_usage = min(memory.low, memory.current)
elow = parent_elow * (low_usage / siblings_low_usage)
However, siblings_low_usage is not the sum of all low_usages. It sums
up the usages of *only those cgroups that are within their memory.low*
That means that low_usage can be *bigger* than siblings_low_usage, and
consequently the total protection afforded to the children can be
bigger than what the ancestor grants the subtree.
Consider three groups where two are in excess of their protection:
A/memory.low = 10G
A/A1/memory.low = 10G, memory.current = 20G
A/A2/memory.low = 10G, memory.current = 20G
A/A3/memory.low = 10G, memory.current = 8G
siblings_low_usage = 8G (only A3 contributes)
A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(8G) = 10.0G
(the 12.5G are capped to the explicit memory.low setting of 10G)
With that, the sum of all awarded protection below A is 30G, when A
only grants 10G for the entire subtree.
What does this mean in practice? A1 and A2 would still be in excess of
their 10G allowance and would be reclaimed, whereas A3 would not. As
they eventually drop below their protection setting, they would be
counted in siblings_low_usage again and the error would right itself.
When reclaim was applied in a binary fashion (cgroup is reclaimed when
it's above its protection, otherwise it's skipped) this would actually
work out just fine. However, since 1bc63fb127 ("mm, memcg: make scan
aggression always exclude protection"), reclaim pressure is scaled to
how much a cgroup is above its protection. As a result this
calculation error unduly skews pressure away from A1 and A2 toward the
rest of the system.
But why did we do it like this in the first place?
The reasoning behind exempting groups in excess from
siblings_low_usage was to go after them first during reclaim in an
overcommitted subtree:
A/memory.low = 2G, memory.current = 4G
A/A1/memory.low = 3G, memory.current = 2G
A/A2/memory.low = 1G, memory.current = 2G
siblings_low_usage = 2G (only A1 contributes)
A1/elow = parent_elow(2G) * low_usage(2G) / siblings_low_usage(2G) = 2G
A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G
While the children combined are overcomitting A and are technically
both at fault, A2 is actively declaring unprotected memory and we
would like to reclaim that first.
However, while this sounds like a noble goal on the face of it, it
doesn't make much difference in actual memory distribution: Because A
is overcommitted, reclaim will not stop once A2 gets pushed back to
within its allowance; we'll have to reclaim A1 either way. The end
result is still that protection is distributed proportionally, with A1
getting 3/4 (1.5G) and A2 getting 1/4 (0.5G) of A's allowance.
[ If A weren't overcommitted, it wouldn't make a difference since each
cgroup would just get the protection it declares:
A/memory.low = 2G, memory.current = 3G
A/A1/memory.low = 1G, memory.current = 1G
A/A2/memory.low = 1G, memory.current = 2G
With the current calculation:
siblings_low_usage = 1G (only A1 contributes)
A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
Including excess groups in siblings_low_usage:
siblings_low_usage = 2G
A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G
A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G ]
Simplify the calculation and fix the proportional reclaim bug by
including excess cgroups in siblings_low_usage.
After this patch, the effective memory.low distribution from the
example above would be as follows:
A/memory.low = 10G
A/A1/memory.low = 10G, memory.current = 20G
A/A2/memory.low = 10G, memory.current = 20G
A/A3/memory.low = 10G, memory.current = 8G
siblings_low_usage = 28G
A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(28G) = 2.8G
Fixes: 1bc63fb127 ("mm, memcg: make scan aggression always exclude protection")
Fixes: 230671533d ("mm: memory.low hierarchical behavior")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-2-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Drop the _memcg suffix from (__)memcg_kmem_(un)charge functions. It's
shorter and more obvious.
These are the most basic functions which are just (un)charging the given
cgroup with the given amount of pages.
Also fix up the corresponding comments.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200109202659.752357-7-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are many places in memcg_charge_slab() and memcg_uncharge_slab()
which are calculating the number of pages to charge, css references to
grab etc depending on the order of the slab page.
Let's simplify the code by calculating it once and caching in the local
variable.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200109202659.752357-6-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These functions are charging the given number of kernel pages to the given
memory cgroup. The number doesn't have to be a power of two. Let's make
them to take the unsigned int nr_pages as an argument instead of the page
order.
It makes them look consistent with the corresponding uncharge functions
and functions like: mem_cgroup_charge_skmem(memcg, nr_pages).
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200109202659.752357-5-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page()
to better reflect what they are actually doing:
1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge
the current memcg
2) set or clear the PageKmemcg flag
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Drop the unused page argument and put the memcg pointer at the first
place. This make the function consistent with its peers:
__memcg_kmem_uncharge_memcg(), memcg_kmem_charge_memcg(), etc.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200109202659.752357-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: memcg: kmem API cleanup", v2.
This patchset aims to clean up the kernel memory charging API. It doesn't
bring any functional changes, just removes unused arguments, renames some
functions and fixes some comments.
Currently it's not obvious which functions are most basic
(memcg_kmem_(un)charge_memcg()) and which are based on them
(memcg_kmem_(un)charge()). The patchset renames these functions and
removes unused arguments:
TL;DR:
was:
memcg_kmem_charge_memcg(page, gfp, order, memcg)
memcg_kmem_uncharge_memcg(memcg, nr_pages)
memcg_kmem_charge(page, gfp, order)
memcg_kmem_uncharge(page, order)
now:
memcg_kmem_charge(memcg, gfp, nr_pages)
memcg_kmem_uncharge(memcg, nr_pages)
memcg_kmem_charge_page(page, gfp, order)
memcg_kmem_uncharge_page(page, order)
This patch (of 6):
The first argument of memcg_kmem_charge_memcg() and
__memcg_kmem_charge_memcg() is the page pointer and it's not used. Let's
drop it.
Memcg pointer is passed as the last argument. Move it to the first place
for consistency with other memcg functions, e.g.
__memcg_kmem_uncharge_memcg() or try_charge().
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200109202659.752357-2-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sometimes we need to get a memcg pointer from a charged kernel object.
The right way to get it depends on whether it's a proper slab object or
it's backed by raw pages (e.g. it's a vmalloc alloction). In the first
case the kmem_cache->memcg_params.memcg indirection should be used; in
other cases it's just page->mem_cgroup.
To simplify this task and hide the implementation details let's use the
mem_cgroup_from_obj() helper, which takes a pointer to any kernel object
and returns a valid memcg pointer or NULL.
Passing a kernel address rather than a pointer to a page will allow to use
this helper for per-object (rather than per-page) tracked objects in the
future.
The caller is still responsible to ensure that the returned memcg isn't
going away underneath: take the rcu read lock, cgroup mutex etc; depending
on the context.
mem_cgroup_from_kmem() defined in mm/list_lru.c is now obsolete and can be
removed.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200117203609.3146239-1-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The shrinker_map may be touched from any cpu (e.g., a bit there may be set
by a task running everywhere) but kswapd is always bound to specific node.
So allocate shrinker_map from the related NUMA node to respect its NUMA
locality. Also, this follows generic way we use for allocation of memcg's
per-node data.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When I manually set default n to MEMCG_KMEM in init/Kconfig, bellow error
occurs,
mm/slab_common.c: In function 'memcg_slab_start':
mm/slab_common.c:1530:30: error: 'struct mem_cgroup' has no member named
'kmem_caches'
return seq_list_start(&memcg->kmem_caches, *pos);
^
mm/slab_common.c: In function 'memcg_slab_next':
mm/slab_common.c:1537:32: error: 'struct mem_cgroup' has no member named
'kmem_caches'
return seq_list_next(p, &memcg->kmem_caches, pos);
^
mm/slab_common.c: In function 'memcg_slab_show':
mm/slab_common.c:1551:16: error: 'struct mem_cgroup' has no member named
'kmem_caches'
if (p == memcg->kmem_caches.next)
^
CC arch/x86/xen/smp.o
mm/slab_common.c: In function 'memcg_slab_start':
mm/slab_common.c:1531:1: warning: control reaches end of non-void function
[-Wreturn-type]
}
^
mm/slab_common.c: In function 'memcg_slab_next':
mm/slab_common.c:1538:1: warning: control reaches end of non-void function
[-Wreturn-type]
}
^
That's because kmem_caches is defined only when CONFIG_MEMCG_KMEM is set,
while memcg_slab_start() will use it no matter CONFIG_MEMCG_KMEM is defined
or not.
By the way, the reason I mannuly undefined CONFIG_MEMCG_KMEM is to verify
whether my some other code change is still stable when CONFIG_MEMCG_KMEM is
not set. Unfortunately, the existing code has been already unstable since
v4.11.
Fixes: bc2791f857 ("slab: link memcg kmem_caches on their associated memory cgroup")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Link: http://lkml.kernel.org/r/1580970260-2045-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
add_to_swap_cache() and delete_from_swap_cache() are counterparts, while
currently they use different ways to count pages.
It doesn't break anything because we only have two sizes for PageAnon, but
this is confusing and not good practice.
This patch corrects it by making both functions use hpage_nr_pages().
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200315012920.2687-1-richard.weiyang@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory barrier is needed after setting LRU bit, but smp_mb() is too
strong. Some architectures, i.e. x86, imply memory barrier with atomic
operations, so replacing it with smp_mb__after_atomic() sounds better,
which is nop on strong ordered machines, and full memory barriers on
others. With this change the vm-scalability cases would perform better on
x86, I saw total 6% improvement with this patch and previous inline fix.
The test data (lru-file-readtwice throughput) against v5.6-rc4:
mainline w/ inline fix w/ both (adding this)
150MB 154MB 159MB
Fixes: 9c4e6b1a70 ("mm, mlock, vmscan: no more skipping pagevecs")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: http://lkml.kernel.org/r/1584500541-46817-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When backporting commit 9c4e6b1a70 ("mm, mlock, vmscan: no more skipping
pagevecs") to our 4.9 kernel, our test bench noticed around 10% down with
a couple of vm-scalability's test cases (lru-file-readonce,
lru-file-readtwice and lru-file-mmap-read). I didn't see that much down
on my VM (32c-64g-2nodes). It might be caused by the test configuration,
which is 32c-256g with NUMA disabled and the tests were run in root memcg,
so the tests actually stress only one inactive and active lru. It sounds
not very usual in mordern production environment.
That commit did two major changes:
1. Call page_evictable()
2. Use smp_mb to force the PG_lru set visible
It looks they contribute the most overhead. The page_evictable() is a
function which does function prologue and epilogue, and that was used by
page reclaim path only. However, lru add is a very hot path, so it sounds
better to make it inline. However, it calls page_mapping() which is not
inlined either, but the disassemble shows it doesn't do push and pop
operations and it sounds not very straightforward to inline it.
Other than this, it sounds smp_mb() is not necessary for x86 since
SetPageLRU is atomic which enforces memory barrier already, replace it
with smp_mb__after_atomic() in the following patch.
With the two fixes applied, the tests can get back around 5% on that test
bench and get back normal on my VM. Since the test bench configuration is
not that usual and I also saw around 6% up on the latest upstream, so it
sounds good enough IMHO.
The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
mainline w/ inline fix
150MB 154MB
With this patch the throughput gets 2.67% up. The data with using
smp_mb__after_atomic() is showed in the following patch.
Shakeel Butt did the below test:
On a real machine with limiting the 'dd' on a single node and reading 100
GiB sparse file (less than a single node). Just ran a single instance to
not cause the lru lock contention. The cmdline used is "dd if=file-100GiB
of=/dev/null bs=4k". Ran the cmd 10 times with drop_caches in between and
measured the time it took.
Without patch: 56.64143 +- 0.672 sec
With patches: 56.10 +- 0.21 sec
[akpm@linux-foundation.org: move page_evictable() to internal.h]
Fixes: 9c4e6b1a70 ("mm, mlock, vmscan: no more skipping pagevecs")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: http://lkml.kernel.org/r/1584500541-46817-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we use a tmp pointer, pentry, to transfer and reset swap cache
slot, which is a little redundant. Swap cache slot stores the entry value
directly, assign and reset it by value would be straight forward.
Also this patch merges the else and if, since this is the only case we
refill and repeat swap cache.
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/20200311055352.50574-1-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
si->inuse_pages could be accessed concurrently as noticed by KCSAN,
write to 0xffff98b00ebd04dc of 4 bytes by task 82262 on cpu 92:
swap_range_free+0xbe/0x230
swap_range_free at mm/swapfile.c:719
swapcache_free_entries+0x1be/0x250
free_swap_slot+0x1c8/0x220
__swap_entry_free.constprop.19+0xa3/0xb0
free_swap_and_cache+0x53/0xa0
unmap_page_range+0x7e0/0x1ce0
unmap_single_vma+0xcd/0x170
unmap_vmas+0x18b/0x220
exit_mmap+0xee/0x220
mmput+0xe7/0x240
do_exit+0x598/0xfd0
do_group_exit+0x8b/0x180
get_signal+0x293/0x13d0
do_signal+0x37/0x5d0
prepare_exit_to_usermode+0x1b7/0x2c0
ret_from_intr+0x32/0x42
read to 0xffff98b00ebd04dc of 4 bytes by task 82499 on cpu 46:
try_to_unuse+0x86b/0xc80
try_to_unuse at mm/swapfile.c:2185
__x64_sys_swapoff+0x372/0xd40
do_syscall_64+0x91/0xb05
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The plain reads in try_to_unuse() are outside si->lock critical section
which result in data races that could be dangerous to be used in a loop.
Fix them by adding READ_ONCE().
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/1582578903-29294-1-git-send-email-cai@lca.pw
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__pagevec_lru_add() is only used in mm directory now.
Remove the export symbol.
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200126011436.22979-1-richardw.yang@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The -EEXIST returned by __swap_duplicate means there is a swap cache
instead -EBUSY
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200212145754.27123-1-chenwandun@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
FOLL_LONGTERM is a special case of FOLL_PIN. It suggests a pin which is
going to be given to hardware and can't move. It would truncate CMA
permanently and should be excluded.
In gup slow path, where
__gup_longterm_locked->check_and_migrate_cma_pages() handles
FOLL_LONGTERM, but in fast path, there lacks such a check, which means a
possible leak of CMA page to longterm pinned.
Place a check in try_grab_compound_head() in the fast path to fix the
leak, and if FOLL_LONGTERM happens on CMA, it will fall back to slow path
to migrate the page.
Some note about the check: Huge page's subpages have the same migrate type
due to either allocation from a free_list[] or alloc_contig_range() with
param MIGRATE_MOVABLE. So it is enough to check on a single subpage by
is_migrate_cma_page(subpage)
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Link: http://lkml.kernel.org/r/1584876733-17405-3-git-send-email-kernelfans@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To better reflect the held state of pages and make code self-explaining,
rename nr as nr_pinned.
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Link: http://lkml.kernel.org/r/1584876733-17405-2-git-send-email-kernelfans@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the introduction of protected KVM guests on s390 there is now a
concept of inaccessible pages. These pages need to be made accessible
before the host can access them.
While cpu accesses will trigger a fault that can be resolved, I/O accesses
will just fail. We need to add a callback into architecture code for
places that will do I/O, namely when writeback is started or when a page
reference is taken.
This is not only to enable paging, file backing etc, it is also necessary
to protect the host against a malicious user space. For example a bad
QEMU could simply start direct I/O on such protected memory. We do not
want userspace to be able to trigger I/O errors and thus the logic is
"whenever somebody accesses that page (gup) or does I/O, make sure that
this page can be accessed". When the guest tries to access that page we
will wait in the page fault handler for writeback to have finished and for
the page_ref to be the expected value.
On s390x the function is not supposed to fail, so it is ok to use a
WARN_ON on failure. If we ever need some more finegrained handling we can
tackle this when we know the details.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200306132537.783769-3-imbrenda@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As part of pin_user_pages() and related API calls, pages are "dma-pinned".
For the case of compound pages of order > 1, the per-page accounting of
dma pins is accomplished via the 3rd struct page in the compound page. In
order to support debugging of any pin_user_pages()- related problems,
enhance dump_page() so as to report the pin count in that case.
Documentation/core-api/pin_user_pages.rst is also updated accordingly.
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-13-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There was no protection against a corrupted struct page having an
implausible compound_head(). Sanity check that a compound page has a head
within reach of the maximum allocatable page (this will need to be
adjusted if one of the plans to allocate 1GB pages comes to fruition). In
addition,
- Print the mapping pointer using %p insted of %px. The actual value of
the pointer can be read out of the raw page dump and using %p gives a
chance to correlate it with an earlier printk of the mapping pointer
- Print the mapping pointer from the head page, not the tail page
(the tail ->mapping pointer may be in use for other purposes, eg part
of a list_head)
- Print the order of the page for compound pages
- Dump the raw head page as well as the raw page
- Print the refcount from the head page, not the tail page
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-12-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Up until now, gup_benchmark supported testing of the following kernel
functions:
* get_user_pages(): via the '-U' command line option
* get_user_pages_longterm(): via the '-L' command line option
* get_user_pages_fast(): as the default (no options required)
Add test coverage for the new corresponding pin_*() functions:
* pin_user_pages_fast(): via the '-a' command line option
* pin_user_pages(): via the '-b' command line option
Also, add an option for clarity: '-u' for what is now (still) the default
choice: get_user_pages_fast().
Also, for the commands that set FOLL_PIN, verify that the pages really are
dma-pinned, via the new is_dma_pinned() routine. Those commands are:
PIN_FAST_BENCHMARK : calls pin_user_pages_fast()
PIN_BENCHMARK : calls pin_user_pages()
In between the calls to pin_*() and unpin_user_pages(), check each page:
if page_maybe_dma_pinned() returns false, then WARN and return.
Do this outside of the benchmark timestamps, so that it doesn't affect
reported times.
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-10-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via
unpin_user_pages*(), we need some visibility into whether all of this is
working correctly.
Add two new fields to /proc/vmstat:
nr_foll_pin_acquired
nr_foll_pin_released
These are documented in Documentation/core-api/pin_user_pages.rst. They
represent the number of pages (since boot time) that have been pinned
("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"), via
pin_user_pages*() and unpin_user_pages*().
In the absence of long-running DMA or RDMA operations that hold pages
pinned, the above two fields will normally be equal to each other.
Also: update Documentation/core-api/pin_user_pages.rst, to remove an
earlier (now confirmed untrue) claim about a performance problem with
/proc/vmstat.
Also: update Documentation/core-api/pin_user_pages.rst to rename the new
/proc/vmstat entries, to the names listed here.
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-9-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
scheme tends to overflow too easily, each tail page increments the head
page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number
of huge pages that can be pinned.
This patch removes that limitation, by using an exact form of pin counting
for compound pages of order > 1. The "order > 1" is required because this
approach uses the 3rd struct page in the compound page, and order 1
compound pages only have two pages, so that won't work there.
A new struct page field, hpage_pinned_refcount, has been added, replacing
a padding field in the union (so no new space is used).
This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block. That is
because these compound pages have extra space for tracking things, so they
get exact pin counts instead of overloading page->_refcount.
Documentation/core-api/pin_user_pages.rst is updated accordingly.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add tracking of pages that were pinned via FOLL_PIN. This tracking is
implemented via overloading of page->_refcount: pins are added by adding
GUP_PIN_COUNTING_BIAS (1024) to the refcount. This provides a fuzzy
indication of pinning, and it can have false positives (and that's OK).
Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
details.
As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
(typically via pin_user_pages*()) are required to ultimately free such
pages via unpin_user_page().
Please also note the limitation, discussed in pin_user_pages.rst under the
"TODO: for 1GB and larger huge pages" section. (That limitation will be
removed in a following patch.)
The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
thought of as "FOLL_GET for DIO and/or RDMA use".
Pages that have been pinned via FOLL_PIN are identifiable via a new
function call:
bool page_maybe_dma_pinned(struct page *page);
What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], [3], and [4].
This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages():
https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
[jhubbard@nvidia.com: add kerneldoc]
Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com
[imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough]
Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com
[akpm@linux-foundation.org: fix put_compound_head defined but not used]
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Internal to mm/gup.c, require that get_user_pages_fast() and
__get_user_pages_fast() identify themselves, by setting FOLL_GET. This is
required in order to be able to make decisions based on "FOLL_PIN, or
FOLL_GET, or both or neither are set", in upcoming patches.
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-6-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation for an upcoming patch, send gup flags args to two more
routines: put_compound_head(), and undo_dev_pagemap().
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-5-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A subsequent patch requires access to gup flags, so pass the flags
argument through to the __gup_device_* functions.
Also placate checkpatch.pl by shortening a nearby line.
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-3-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/gup: track FOLL_PIN pages", v6.
This activates tracking of FOLL_PIN pages. This is in support of fixing
the get_user_pages()+DMA problem described in [1]-[4].
FOLL_PIN support is now in the main linux tree. However, the patch to use
FOLL_PIN to track pages was *not* submitted, because Leon saw an RDMA test
suite failure that involved (I think) page refcount overflows when huge
pages were used.
This patch definitively solves that kind of overflow problem, by adding an
exact pincount, for compound pages (of order > 1), in the 3rd struct page
of a compound page. If available, that form of pincounting is used,
instead of the GUP_PIN_COUNTING_BIAS approach. Thanks again to Jan Kara
for that idea.
Other interesting changes:
* dump_page(): added one, or two new things to report for compound
pages: head refcount (for all compound pages), and map_pincount (for
compound pages of order > 1).
* Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the
huge page refcount upper limit problems, and added notes about how it
works now. Also added a note about the dump_page() enhancements.
* Added some comments in gup.c and mm.h, to explain that there are two
ways to count pinned pages: exact (for compound pages of order > 1) and
fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages).
============================================================
General notes about the tracking patch:
This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], [4] and in a remarkable number of email threads since about
2017. :)
In contrast to earlier approaches, the page tracking can be incrementally
applied to the kernel call sites that, until now, have been simply calling
get_user_pages() ("gup"). In other words, opt-in by changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
unpin_user_page()
============================================================
Future steps:
* Convert more subsystems from get_user_pages() to pin_user_pages().
The first probably needs to be bio/biovecs, because any filesystem
testing is too difficult without those in place.
* Change VFS and filesystems to respond appropriately when encountering
dma-pinned pages.
* Work with Ira and others to connect this all up with file system
leases.
[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages()
https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
This patch (of 12):
An upcoming patch requires reusing the implementation of
get_user_pages_remote(). Split up get_user_pages_remote() into an outer
routine that checks flags, and an implementation routine that will be
reused. This makes subsequent changes much easier to understand.
There should be no change in behavior due to this patch.
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-2-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- These were never called PCG flags; they've been called FGP flags since
their introduction in 2014.
- The FGP_FOR_MMAP flag was misleadingly documented as if it was an
alternative to FGP_CREAT instead of an option to it.
- Rename the 'offset' parameter to 'index'.
- Capitalisation, formatting, rewording.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/20200318140253.6141-9-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No in-tree users (proc, madvise, memcg, mincore) can be built as a module.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/20200318140253.6141-8-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dumping the page information in this circumstance helps for debugging.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/20200318140253.6141-7-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The first argument of shrink_readahead_size_eio() is not used. Hence
remove it from the function definition and from all the callers.
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1583868093-24342-1-git-send-email-jrdr.linux@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mount failure issue happens under the scenario: Application forked dozens
of threads to mount the same number of cramfs images separately in docker,
but several mounts failed with high probability. Mount failed due to the
checking result of the page(read from the superblock of loop dev) is not
uptodate after wait_on_page_locked(page) returned in function cramfs_read:
wait_on_page_locked(page);
if (!PageUptodate(page)) {
...
}
The reason of the checking result of the page not uptodate: systemd-udevd
read the loopX dev before mount, because the status of loopX is Lo_unbound
at this time, so loop_make_request directly trigger the calling of io_end
handler end_buffer_async_read, which called SetPageError(page). So It
caused the page can't be set to uptodate in function
end_buffer_async_read:
if(page_uptodate && !PageError(page)) {
SetPageUptodate(page);
}
Then mount operation is performed, it used the same page which is just
accessed by systemd-udevd above, Because this page is not uptodate, it
will launch a actual read via submit_bh, then wait on this page by calling
wait_on_page_locked(page). When the I/O of the page done, io_end handler
end_buffer_async_read is called, because no one cleared the page
error(during the whole read path of mount), which is caused by
systemd-udevd reading, so this page is still in "PageError" status, which
can't be set to uptodate in function end_buffer_async_read, then caused
mount failure.
But sometimes mount succeed even through systemd-udeved read loopX dev
just before, The reason is systemd-udevd launched other loopX read just
between step 3.1 and 3.2, the steps as below:
1, loopX dev default status is Lo_unbound;
2, systemd-udved read loopX dev (page is set to PageError);
3, mount operation
1) set loopX status to Lo_bound;
==>systemd-udevd read loopX dev<==
2) read loopX dev(page has no error)
3) mount succeed
As the loopX dev status is set to Lo_bound after step 3.1, so the other
loopX dev read by systemd-udevd will go through the whole I/O stack, part
of the call trace as below:
SYS_read
vfs_read
do_sync_read
blkdev_aio_read
generic_file_aio_read
do_generic_file_read:
ClearPageError(page);
mapping->a_ops->readpage(filp, page);
here, mapping->a_ops->readpage() is blkdev_readpage. In latest kernel,
some function name changed, the call trace as below:
blkdev_read_iter
generic_file_read_iter
generic_file_buffered_read:
/*
* A previous I/O error may have been due to temporary
* failures, eg. mutipath errors.
* Pg_error will be set again if readpage fails.
*/
ClearPageError(page);
/* Start the actual read. The read will unlock the page*/
error=mapping->a_ops->readpage(flip, page);
We can see ClearPageError(page) is called before the actual read,
then the read in step 3.2 succeed.
This patch is to add the calling of ClearPageError just before the actual
read of read path of cramfs mount. Without the patch, the call trace as
below when performing cramfs mount:
do_mount
cramfs_read
cramfs_blkdev_read
read_cache_page
do_read_cache_page:
filler(data, page);
or
mapping->a_ops->readpage(data, page);
With the patch, the call trace as below when performing mount:
do_mount
cramfs_read
cramfs_blkdev_read
read_cache_page:
do_read_cache_page:
ClearPageError(page); <== new add
filler(data, page);
or
mapping->a_ops->readpage(data, page);
With the patch, mount operation trigger the calling of
ClearPageError(page) before the actual read, the page has no error if no
additional page error happen when I/O done.
Signed-off-by: Xianting Tian <xianting_tian@126.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: <yubin@h3c.com>
Link: http://lkml.kernel.org/r/1583318844-22971-1-git-send-email-xianting_tian@126.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There used to be a 'retry' label in between the two (identical) checks
when first introduced in commit f446daaea9 ("mm: implement writeback
livelock avoidance using page tagging"), and later modified/updated in
commit 6e6938b6d3 ("writeback: introduce .tagged_writepages for the
WB_SYNC_NONE sync stage").
The label has been removed in commit 64081362e8 ("mm/page-writeback.c:
fix range_cyclic writeback vs writepages deadlock"), and the (identical)
checks are now present / performed immediately one after another.
So, remove/deduplicate the latter check, moving tag_pages_for_writeback()
into the former check before the 'tag' variable assignment, so it's clear
that it's not used in this (similarly-named) function call but only later
in pagevec_lookup_range_tag().
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Link: http://lkml.kernel.org/r/20200218221716.1648-1-mfo@canonical.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When handling a page fault, we drop mmap_sem to start async readahead so
that we don't block on IO submission with mmap_sem held. However there's
no point to drop mmap_sem in case readahead is disabled. Handle that case
to avoid pointless dropping of mmap_sem and retrying the fault. This was
actually reported to block mlockall(MCL_CURRENT) indefinitely.
Fixes: 6b4c9f4469 ("filemap: drop the mmap_sem for all blocking operations")
Reported-by: Minchan Kim <minchan@kernel.org>
Reported-by: Robert Stupp <snazy@gmx.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/20200212101356.30759-1-jack@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kmemleak could scan task stacks while plain writes happens to those stack
variables which could results in data races. For example, in
sys_rt_sigaction and do_sigaction(), it could have plain writes in a
32-byte size. Since the kmemleak does not care about the actual values of
a non-pointer and all do_sigaction() call sites only copy to stack
variables, just disable KCSAN for kmemleak to avoid annotating anything
outside Kmemleak just because Kmemleak scans everything.
Suggested-by: Marco Elver <elver@google.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Marco Elver <elver@google.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: http://lkml.kernel.org/r/1583263716-25150-1-git-send-email-cai@lca.pw
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clang warns:
mm/kmemleak.c:1955:28: warning: array comparison always evaluates to a constant [-Wtautological-compare]
if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata)
^
mm/kmemleak.c:1955:60: warning: array comparison always evaluates to a constant [-Wtautological-compare]
if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata)
These are not true arrays, they are linker defined symbols, which are just
addresses. Using the address of operator silences the warning and does
not change the resulting assembly with either clang/ld.lld or gcc/ld
(tested with diff + objdump -Dr).
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/895
Link: http://lkml.kernel.org/r/20200220051551.44000-1-natechancellor@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit ad2c814441.
The function node_to_mem_node() was introduced by that commit for use in SLUB
on systems with memoryless nodes, but it turned out to be unreliable on some
architectures/configurations and a simpler solution exists than fixing it up.
Thus commit 0715e6c516 ("mm, slub: prevent kmalloc_node crashes and
memory leaks") removed the only user of node_to_mem_node() and we can
revert the commit that introduced the function.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: PUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com>
Cc: Sachin Sant <sachinp@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20200320115533.9604-2-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In a recent discussion[1] with Vitaly Nikolenko and Silvio Cesare, it
became clear that moving the freelist pointer away from the edge of
allocations would likely improve the overall defensive posture of the
inline freelist pointer. My benchmarks show no meaningful change to
performance (they seem to show it being faster), so this looks like a
reasonable change to make.
Instead of having the freelist pointer at the very beginning of an
allocation (offset 0) or at the very end of an allocation (effectively
offset -sizeof(void *) from the next allocation), move it away from the
edges of the allocation and into the middle. This provides some
protection against small-sized neighboring overflows (or underflows), for
which the freelist pointer is commonly the target. (Large or well
controlled overwrites are much more likely to attack live object contents,
instead of attempting freelist corruption.)
The vaunted kernel build benchmark, across 5 runs. Before:
Mean: 250.05
Std Dev: 1.85
and after, which appears mysteriously faster:
Mean: 247.13
Std Dev: 0.76
Attempts at running "sysbench --test=memory" show the change to be well in
the noise (sysbench seems to be pretty unstable here -- it's not really
measuring allocation).
Hackbench is more allocation-heavy, and while the std dev is above the
difference, it looks like may manifest as an improvement as well:
20 runs of "hackbench -g 20 -l 1000", before:
Mean: 36.322
Std Dev: 0.577
and after:
Mean: 36.056
Std Dev: 0.598
[1] https://twitter.com/vnik5287/status/1235113523098685440
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Vitaly Nikolenko <vnik@duasynt.com>
Cc: Silvio Cesare <silvio.cesare@gmail.com>
Cc: Christoph Lameter <cl@linux.com>Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/202003051624.AAAC9AECC@keescook
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Under CONFIG_SLAB_FREELIST_HARDENED=y, the obfuscation was relatively weak
in that the ptr and ptr address were usually so close that the first XOR
would result in an almost entirely 0-byte value[1], leaving most of the
"secret" number ultimately being stored after the third XOR. A single
blind memory content exposure of the freelist was generally sufficient to
learn the secret.
Add a swab() call to mix bits a little more. This is a cheap way (1
cycle) to make attacks need more than a single exposure to learn the
secret (or to know _where_ the exposure is in memory).
kmalloc-32 freelist walk, before:
ptr ptr_addr stored value secret
ffff90c22e019020@ffff90c22e019000 is 86528eb656b3b5bd (86528eb656b3b59d)
ffff90c22e019040@ffff90c22e019020 is 86528eb656b3b5fd (86528eb656b3b59d)
ffff90c22e019060@ffff90c22e019040 is 86528eb656b3b5bd (86528eb656b3b59d)
ffff90c22e019080@ffff90c22e019060 is 86528eb656b3b57d (86528eb656b3b59d)
ffff90c22e0190a0@ffff90c22e019080 is 86528eb656b3b5bd (86528eb656b3b59d)
...
after:
ptr ptr_addr stored value secret
ffff9eed6e019020@ffff9eed6e019000 is 793d1135d52cda42 (86528eb656b3b59d)
ffff9eed6e019040@ffff9eed6e019020 is 593d1135d52cda22 (86528eb656b3b59d)
ffff9eed6e019060@ffff9eed6e019040 is 393d1135d52cda02 (86528eb656b3b59d)
ffff9eed6e019080@ffff9eed6e019060 is 193d1135d52cdae2 (86528eb656b3b59d)
ffff9eed6e0190a0@ffff9eed6e019080 is f93d1135d52cdac2 (86528eb656b3b59d)
[1] https://blog.infosectcbr.com.au/2020/03/weaknesses-in-linux-kernel-heap.html
Fixes: 2482ddec67 ("mm: add SLUB free list pointer obfuscation")
Reported-by: Silvio Cesare <silvio.cesare@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/202003051623.AF4F8CB@keescook
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are slub_cpu_partial() and slub_set_cpu_partial() APIs to wrap
kmem_cache->cpu_partial. This patch will use the two APIs to replace
kmem_cache->cpu_partial in slub code.
Signed-off-by: chenqiwu <chenqiwu@xiaomi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/1582079562-17980-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are slub_percpu_partial() and slub_set_percpu_partial() APIs to wrap
kmem_cache->cpu_partial. This patch will use the two to replace
cpu_slab->partial in slub code.
Signed-off-by: chenqiwu <chenqiwu@xiaomi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/1581951895-3038-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series focuses on corner case bug fixes and general clarity
improvements to hmm_range_fault().
- 9 bug fixes
- Allow pgmap to track the 'owner' of a DEVICE_PRIVATE - in this case the
owner tells the driver if it can understand the DEVICE_PRIVATE page or
not. Use this to resolve a bug in nouveau where it could touch
DEVICE_PRIVATE pages from other drivers.
- Remove a bunch of dead, redundant or unused code and flags
- Clarity improvements to hmm_range_fault()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl6CUDEACgkQOG33FX4g
mxqXgQ/+I2o4QIq86xTddRCesINab73sA4bdvIYgixojTBvo5a9Tu9JO3nmzJOEi
mmUXsdN+MnyTdnrkcfbEvo0b5ktCGlwGYQmcRv9MrB1QvSChg6Zy3F3cWR+dsv5c
Mo3z0sWRx+//dVSI0p1BMhd5rym0nZCh/rUO94C/XgZ3YoC+MjX4lKeCMvjJpZFc
DDEN3Z+zjOvjeiT9TMNMmkPnZSY+N4RQwTRKek5l95QwcHyFy5SBI+uzOHyE0lVw
KG5+yk+TFRMoiHjZh4BFqkibQbVKG0qMKiOymIncTr3kL4DK0Hwf/4Zk4DMKucEb
Rs2B7C+UShiyIOzdjbnBsqPiHevAhR3nOpozJy3x/Z6fLapVoIlVx3UJJ/djuxZa
7+H2VzxKz1xGRDH4Js/WD0smZ8jisA04vW5THJVFr0Vd2sqKBo3G/5ay2Kw9NX7T
MRII1KkA/luZbmM6WLEQkliVuNkMCsVUU3hiY96tLI7uN73M9fQUe6s3NubeoFxi
UKg0zuMsAlsJxkHGeY+IMHtUR/law+k9/aZoB4idMGwV/i7KiiolbRcB0H5yn4L5
OV+4uxVBFS/n37sCpMwpx5WJtgbPzww3b6Cd0hfIUqnQzSOjP40Pl5ZX/c/2M7ps
bUdZjve5j653YeC/7NBPDprvzbfmyyxJdZZZzzXLQl5kyL2o9LE=
=UEpV
-----END PGP SIGNATURE-----
Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull hmm updates from Jason Gunthorpe:
"This series focuses on corner case bug fixes and general clarity
improvements to hmm_range_fault(). It arose from a review of
hmm_range_fault() by Christoph, Ralph and myself.
hmm_range_fault() is being used by these 'SVM' style drivers to
non-destructively read the page tables. It is very similar to
get_user_pages() except that the output is an array of PFNs and
per-pfn flags, and it has various modes of reading.
This is necessary before RDMA ODP can be converted, as we don't want
to have weird corner case regressions, which is still a looking
forward item. Ralph has a nice tester for this routine, but it is
waiting for feedback from the selftests maintainers.
Summary:
- 9 bug fixes
- Allow pgmap to track the 'owner' of a DEVICE_PRIVATE - in this case
the owner tells the driver if it can understand the DEVICE_PRIVATE
page or not. Use this to resolve a bug in nouveau where it could
touch DEVICE_PRIVATE pages from other drivers.
- Remove a bunch of dead, redundant or unused code and flags
- Clarity improvements to hmm_range_fault()"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (25 commits)
mm/hmm: return error for non-vma snapshots
mm/hmm: do not set pfns when returning an error code
mm/hmm: do not unconditionally set pfns when returning EBUSY
mm/hmm: use device_private_entry_to_pfn()
mm/hmm: remove HMM_FAULT_SNAPSHOT
mm/hmm: remove unused code and tidy comments
mm/hmm: return the fault type from hmm_pte_need_fault()
mm/hmm: remove pgmap checking for devmap pages
mm/hmm: check the device private page owner in hmm_range_fault()
mm: simplify device private page handling in hmm_range_fault
mm: handle multiple owners of device private pages in migrate_vma
memremap: add an owner field to struct dev_pagemap
mm: merge hmm_vma_do_fault into into hmm_vma_walk_hole_
mm/hmm: don't handle the non-fault case in hmm_vma_walk_hole_()
mm/hmm: simplify hmm_vma_walk_hugetlb_entry()
mm/hmm: remove the unused HMM_FAULT_ALLOW_RETRY flag
mm/hmm: don't provide a stub for hmm_range_fault()
mm/hmm: do not check pmd_protnone twice in hmm_vma_handle_pmd()
mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling
mm/hmm: return -EFAULT when setting HMM_PFN_ERROR on requested valid pages
...
blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs
don't get offlined while there are active cgwbs on them. However, it
ends up making offlining unordered sometimes causing parents to be
offlined before children.
To fix it, we want child blkcgs to pin the parents' online states
turning the refcnt into a more generic online pinning mechanism.
In prepartion,
* blkcg->cgwb_refcnt -> blkcg->online_pin
* blkcg_cgwb_get/put() -> blkcg_pin/unpin_online()
* Take them out of CONFIG_CGROUP_WRITEBACK
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently there are 3 emails tied to me in the kernel tree, I'd rather
dennis@kernel.org be the only one.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
- In-kernel Pointer Authentication support (previously only offered to
user space).
- ARM Activity Monitors (AMU) extension support allowing better CPU
utilisation numbers for the scheduler (frequency invariance).
- Memory hot-remove support for arm64.
- Lots of asm annotations (SYM_*) in preparation for the in-kernel
Branch Target Identification (BTI) support.
- arm64 perf updates: ARMv8.5-PMU 64-bit counters, refactoring the PMU
init callbacks, support for new DT compatibles.
- IPv6 header checksum optimisation.
- Fixes: SDEI (software delegated exception interface) double-lock on
hibernate with shared events.
- Minor clean-ups and refactoring: cpu_ops accessor, cpu_do_switch_mm()
converted to C, cpufeature finalisation helper.
- sys_mremap() comment explaining the asymmetric address untagging
behaviour.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAl6DVyIACgkQa9axLQDI
XvHkqRAAiZA2EYKiQL4M1DJ1cNTADjT7xKX9+UtYBXj7GMVhgVWdunpHVE6qtfgk
cT6avmKrS/6PDqizJgr+Z1yX8x3Kvs57G4BvmIUKIw97mkdewvFQ9JKv6VA1vb86
7Qrl1WzqsGg5Kj9uUfI4h+ZoT1H4C/9PQeFxJwgZRtF9DxRh8O7VeZI+JCu8Aub2
lIkjI8rh+EpTsGT9h/PMGWUcawnKQloZ1/F+GfMAuYBvIv2RNN2xVreJtTmm4NyJ
VcpL0KCNyAI2lGdaJg5nBLRDyGuXDm5i+PLsCSXMquI4fie00txXeD8sjbeuO0ks
YTJ0EhmUUhbSE17go+SxYiEFE0v09i+lD5ud+B4Vmojp0KTczTta9VSgURlbb2/9
n9biq5G3PPDNIrZqiTT2Tf4AMz1350nkbzL2gzKecM5aIzR/u3y5yII5CgfZtFnj
7bGbyFpFpcqI7UaISPsNCxmknbTt/7ff0WM3+7SbecxI3AD2mnxsOdN9JTLyhDp+
owjyiaWxl5zMWF9DhplLG/9BKpNWSxh3skazdOdELd8GTq2MbJlXrVG2XgXTAOh3
y1s6RQrfw8zXh8TSqdmmzauComXIRWTum/sbVB3U8Z3AUsIeq/NTSbN5X9JyIbOP
HOabhlVhhkI6omN1grqPX4jwUiZLZoNfn7Ez4q71549KVK/uBtA=
=LJVX
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"The bulk is in-kernel pointer authentication, activity monitors and
lots of asm symbol annotations. I also queued the sys_mremap() patch
commenting the asymmetry in the address untagging.
Summary:
- In-kernel Pointer Authentication support (previously only offered
to user space).
- ARM Activity Monitors (AMU) extension support allowing better CPU
utilisation numbers for the scheduler (frequency invariance).
- Memory hot-remove support for arm64.
- Lots of asm annotations (SYM_*) in preparation for the in-kernel
Branch Target Identification (BTI) support.
- arm64 perf updates: ARMv8.5-PMU 64-bit counters, refactoring the
PMU init callbacks, support for new DT compatibles.
- IPv6 header checksum optimisation.
- Fixes: SDEI (software delegated exception interface) double-lock on
hibernate with shared events.
- Minor clean-ups and refactoring: cpu_ops accessor,
cpu_do_switch_mm() converted to C, cpufeature finalisation helper.
- sys_mremap() comment explaining the asymmetric address untagging
behaviour"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (81 commits)
mm/mremap: Add comment explaining the untagging behaviour of mremap()
arm64: head: Convert install_el2_stub to SYM_INNER_LABEL
arm64: Introduce get_cpu_ops() helper function
arm64: Rename cpu_read_ops() to init_cpu_ops()
arm64: Declare ACPI parking protocol CPU operation if needed
arm64: move kimage_vaddr to .rodata
arm64: use mov_q instead of literal ldr
arm64: Kconfig: verify binutils support for ARM64_PTR_AUTH
lkdtm: arm64: test kernel pointer authentication
arm64: compile the kernel with ptrauth return address signing
kconfig: Add support for 'as-option'
arm64: suspend: restore the kernel ptrauth keys
arm64: __show_regs: strip PAC from lr in printk
arm64: unwind: strip PAC from kernel addresses
arm64: mask PAC bits of __builtin_return_address
arm64: initialize ptrauth keys for kernel booting task
arm64: initialize and switch ptrauth kernel keys
arm64: enable ptrauth earlier
arm64: cpufeature: handle conflicts based on capability
arm64: cpufeature: Move cpu capability helpers inside C file
...
The pagewalker does not call most ops with NULL vma, those are all routed
to hmm_vma_walk_hole() via ops->pte_hole instead.
Thus hmm_vma_fault() is only called with a NULL vma from
hmm_vma_walk_hole(), so hoist the NULL vma check to there.
Now it is clear that snapshotting with no vma is a HMM_PFN_ERROR as
without a vma we have no path to call hmm_vma_fault().
Link: https://lore.kernel.org/r/20200327200021.29372-10-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Most places that return an error code, like -EFAULT, do not set
HMM_PFN_ERROR, only two places do this.
Resolve this inconsistency by never setting the pfns on an error
exit. This doesn't seem like a worthwhile thing to do anyhow.
If for some reason it becomes important, it makes more sense to directly
return the address of the failing page rather than have the caller scan
for the HMM_PFN_ERROR.
No caller inspects the pnfs output array if hmm_range_fault() fails.
Link: https://lore.kernel.org/r/20200327200021.29372-9-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In hmm_vma_handle_pte() and hmm_vma_walk_hugetlb_entry() if fault happens
then -EBUSY will be returned and the pfns input flags will have been
destroyed.
For hmm_vma_handle_pte() set HMM_PFN_NONE only on the success returns that
don't otherwise store to pfns.
For hmm_vma_walk_hugetlb_entry() all exit paths already set pfns, so
remove the redundant store.
Fixes: 2aee09d8c1 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis")
Link: https://lore.kernel.org/r/20200327200021.29372-8-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
swp_offset() should not be called directly, the wrappers are supposed to
abstract away the encoding of the device_private specific information in
the swap entry.
Link: https://lore.kernel.org/r/20200327200021.29372-7-jgg@ziepe.ca
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fix the crash like this:
BUG: Kernel NULL pointer dereference on read at 0x00000000
Faulting instruction address: 0xc000000000c3447c
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
...
NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
LR [c000000000088354] vmemmap_free+0x144/0x320
Call Trace:
section_deactivate+0x220/0x240
__remove_pages+0x118/0x170
arch_remove_memory+0x3c/0x150
memunmap_pages+0x1cc/0x2f0
devm_action_release+0x30/0x50
release_nodes+0x2f8/0x3e0
device_release_driver_internal+0x168/0x270
unbind_store+0x130/0x170
drv_attr_store+0x44/0x60
sysfs_kf_write+0x68/0x80
kernfs_fop_write+0x100/0x290
__vfs_write+0x3c/0x70
vfs_write+0xcc/0x240
ksys_write+0x7c/0x140
system_call+0x5c/0x68
The crash is due to NULL dereference at
test_bit(idx, ms->usage->subsection_map);
due to ms->usage = NULL in pfn_section_valid()
With commit d41e2f3bd5 ("mm/hotplug: fix hot remove failure in
SPARSEMEM|!VMEMMAP case") section_mem_map is set to NULL after
depopulate_section_mem(). This was done so that pfn_page() can work
correctly with kernel config that disables SPARSEMEM_VMEMMAP. With that
config pfn_to_page does
__section_mem_map_addr(__sec) + __pfn;
where
static inline struct page *__section_mem_map_addr(struct mem_section *section)
{
unsigned long map = section->section_mem_map;
map &= SECTION_MAP_MASK;
return (struct page *)map;
}
Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is
used to check the pfn validity (pfn_valid()). Since section_deactivate
release mem_section->usage if a section is fully deactivated,
pfn_valid() check after a subsection_deactivate cause a kernel crash.
static inline int pfn_valid(unsigned long pfn)
{
...
return early_section(ms) || pfn_section_valid(ms, pfn);
}
where
static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
{
int idx = subsection_map_index(pfn);
return test_bit(idx, ms->usage->subsection_map);
}
Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is
freed. For architectures like ppc64 where large pages are used for
vmmemap mapping (16MB), a specific vmemmap mapping can cover multiple
sections. Hence before a vmemmap mapping page can be freed, the kernel
needs to make sure there are no valid sections within that mapping.
Clearing the section valid bit before depopulate_section_memap enables
this.
[aneesh.kumar@linux.ibm.com: add comment]
Link: http://lkml.kernel.org/r/20200326133235.343616-1-aneesh.kumar@linux.ibm.comLink: http://lkml.kernel.org/r/20200325031914.107660-1-aneesh.kumar@linux.ibm.com
Fixes: d41e2f3bd5 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the
space for task stacks can be allocated using __vmalloc_node_range(),
alloc_pages_node() and kmem_cache_alloc_node().
In the first and the second cases page->mem_cgroup pointer is set, but
in the third it's not: memcg membership of a slab page should be
determined using the memcg_from_slab_page() function, which looks at
page->slab_cache->memcg_params.memcg . In this case, using
mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
page->mem_cgroup pointer is NULL even for pages charged to a non-root
memory cgroup.
It can lead to kernel_stack per-memcg counters permanently showing 0 on
some architectures (depending on the configuration).
In order to fix it, let's introduce a mod_memcg_obj_state() helper,
which takes a pointer to a kernel object as a first argument, uses
mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls
mod_memcg_state(). It allows to handle all possible configurations
(CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without
spilling any memcg/kmem specifics into fork.c .
Note: This is a special version of the patch created for stable
backports. It contains code from the following two patches:
- mm: memcg/slab: introduce mem_cgroup_from_obj()
- mm: fork: fix kernel_stack memcg stats for various stack implementations
[guro@fb.com: introduce mem_cgroup_from_obj()]
Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com
Fixes: 4d96ba3530 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This appears to be a mistake in commit faced7e080 ("mm: hugetlb
controller for cgroups v2").
Essentially that commit does a hugetlb_cgroup_from_counter assuming that
page_counter_try_charge has initialized counter.
But if that has failed then it seems will not initialize counter, so
hugetlb_cgroup_from_counter(counter) ends up pointing to random memory,
causing kasan to complain.
The solution is to simply use 'h_cg', instead of
hugetlb_cgroup_from_counter(counter), since that is a reference to the
hugetlb_cgroup anyway. After this change kasan ceases to complain.
Fixes: faced7e080 ("mm: hugetlb controller for cgroups v2")
Reported-by: syzbot+cac0c4e204952cf449b1@syzkaller.appspotmail.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Giuseppe Scrivano <gscrivan@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/20200313223920.124230-1-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
claim_swapfile() currently keeps the inode locked when it is successful,
or the file is already swapfile (with -EBUSY). And, on the other error
cases, it does not lock the inode.
This inconsistency of the lock state and return value is quite confusing
and actually causing a bad unlock balance as below in the "bad_swap"
section of __do_sys_swapon().
This commit fixes this issue by moving the inode_lock() and IS_SWAPFILE
check out of claim_swapfile(). The inode is unlocked in
"bad_swap_unlock_inode" section, so that the inode is ensured to be
unlocked at "bad_swap". Thus, error handling codes after the locking now
jumps to "bad_swap_unlock_inode" instead of "bad_swap".
=====================================
WARNING: bad unlock balance detected!
5.5.0-rc7+ #176 Not tainted
-------------------------------------
swapon/4294 is trying to release lock (&sb->s_type->i_mutex_key) at: __do_sys_swapon+0x94b/0x3550
but there are no more locks to release!
other info that might help us debug this:
no locks held by swapon/4294.
stack backtrace:
CPU: 5 PID: 4294 Comm: swapon Not tainted 5.5.0-rc7-BTRFS-ZNS+ #176
Hardware name: ASUS All Series/H87-PRO, BIOS 2102 07/29/2014
Call Trace:
dump_stack+0xa1/0xea
print_unlock_imbalance_bug.cold+0x114/0x123
lock_release+0x562/0xed0
up_write+0x2d/0x490
__do_sys_swapon+0x94b/0x3550
__x64_sys_swapon+0x54/0x80
do_syscall_64+0xa4/0x4b0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f15da0a0dc7
Fixes: 1638045c36 ("mm: set S_SWAPFILE on blockdev swap devices")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Qais Youef <qais.yousef@arm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200206090132.154869-1-naohiro.aota@wdc.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that flags are handled on a fine-grained per-page basis this global
flag is redundant and has a confusing overlap with the pfn_flags_mask and
default_flags.
Normalize the HMM_FAULT_SNAPSHOT behavior into one place. Callers needing
the SNAPSHOT behavior should set a pfn_flags_mask and default_flags that
always results in a cleared HMM_PFN_VALID. Then no pages will be faulted,
and HMM_FAULT_SNAPSHOT is not a special flow that overrides the masking
mechanism.
As this is the last flag, also remove the flags argument. If future flags
are needed they can be part of the struct hmm_range function arguments.
Link: https://lore.kernel.org/r/20200327200021.29372-5-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Delete several functions that are never called, fix some desync between
comments and structure content, toss the now out of date top of file
header, and move one function only used by hmm.c into hmm.c
Link: https://lore.kernel.org/r/20200327200021.29372-4-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Using two bools instead of flags return is not necessary and leads to
bugs. Returning a value is easier for the compiler to check and easier to
pass around the code flow.
Convert the two bools into flags and push the change to all callers.
Link: https://lore.kernel.org/r/20200327200021.29372-3-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The checking boils down to some racy check if the pagemap is still
available or not. Instead of checking this, rely entirely on the
notifiers, if a pagemap is destroyed then all pages that belong to it must
be removed from the tables and the notifiers triggered.
Link: https://lore.kernel.org/r/20200327200021.29372-2-jgg@ziepe.ca
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
hmm_range_fault() will succeed for any kind of device private memory, even
if it doesn't belong to the calling entity. While nouveau has some crude
checks for that, they are broken because they assume nouveau is the only
user of device private memory. Fix this by passing in an expected pgmap
owner in the hmm_range_fault structure.
If a device_private page is found and doesn't match the owner then it is
treated as an non-present and non-faultable page.
This prevents a bug in amdgpu, where it doesn't know how to handle
device_private pages, but hmm_range_fault would return them anyhow.
Fixes: 4ef589dc9b ("mm/hmm/devmem: device memory hotplug using ZONE_DEVICE")
Link: https://lore.kernel.org/r/20200316193216.920734-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Remove the HMM_PFN_DEVICE_PRIVATE flag, no driver has ever set this flag
on input, and the only place that uses it on output can be trivially
changed to use is_device_private_page().
This removes the ability to request that device_private pages are faulted
back into system memory.
Link: https://lore.kernel.org/r/20200316193216.920734-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add a new src_owner field to struct migrate_vma. If the field is set,
only device private pages with page->pgmap->owner equal to that field are
migrated. If the field is not set only "normal" pages are migrated.
Fixes: df6ad69838 ("mm/device-public-memory: device memory cache coherent with CPU")
Link: https://lore.kernel.org/r/20200316193216.920734-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add a new opaque owner field to struct dev_pagemap, which will allow the
hmm and migrate_vma code to identify who owns ZONE_DEVICE memory, and
refuse to work on mappings not owned by the calling entity.
Link: https://lore.kernel.org/r/20200316193216.920734-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no good reason for this split, as it just obsfucates the flow.
Link: https://lore.kernel.org/r/20200316135310.899364-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Setting a pfns entry to NONE before returning -EBUSY is a bug that will
cause corruption of the input flags on the next loop.
There is just a single caller using hmm_vma_walk_hole_() for the non-fault
case. Use hmm_pfns_fill() to fill the whole pfn array with zeroes in the
only caller for the non-fault case and remove the non-fault path from
hmm_vma_walk_hole_(). This avoids setting NONE before returning -EBUSY.
Also rename the function to hmm_vma_fault() to better describe what it
does.
Fixes: 2aee09d8c1 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis")
Link: https://lore.kernel.org/r/20200316135310.899364-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Remove the rather confusing goto label and just handle the fault case
directly in the branch checking for it.
Link: https://lore.kernel.org/r/20200316135310.899364-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The HMM_FAULT_ALLOW_RETRY isn't used anywhere in the tree. Remove it and
the weird -EAGAIN handling where handle_mm_fault() drops the mmap_sem.
Link: https://lore.kernel.org/r/20200316135310.899364-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
pmd_to_hmm_pfn_flags() already checks it and makes the cpu flags 0. If no
fault is requested then the pfns should be returned with the not valid
flags.
It should not unconditionally fault if faulting is not requested.
Fixes: 2aee09d8c1 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis")
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently if a special PTE is encountered hmm_range_fault() immediately
returns EFAULT and sets the HMM_PFN_SPECIAL error output (which nothing
uses).
EFAULT should only be returned after testing with hmm_pte_need_fault().
Also pte_devmap() and pte_special() are exclusive, and there is no need to
check IS_ENABLED, pte_special() is stubbed out to return false on
unsupported architectures.
Fixes: 992de9a8b7 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem")
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
hmm_range_fault() should never return 0 if the caller requested a valid
page, but the pfns output for that page would be HMM_PFN_ERROR.
hmm_pte_need_fault() must always be called before setting HMM_PFN_ERROR to
detect if the page is in faulting mode or not.
Fix two cases in hmm_vma_walk_pmd() and reorganize some of the duplicated
code.
Fixes: d08faca018 ("mm/hmm: properly handle migration pmd")
Fixes: da4c3c735e ("mm/hmm/mirror: helper to snapshot CPU page table")
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The intention with this code is to determine if the caller required the
pages to be valid, and if so, then take some action to make them valid.
The action varies depending on the page type.
In all cases, if the caller doesn't ask for the page, then
hmm_range_fault() should not return an error.
Revise the implementation to be clearer, and fix some bugs:
- hmm_pte_need_fault() must always be called before testing fault or
write_fault otherwise the defaults of false apply and the if()'s don't
work. This was missed on the is_migration_entry() branch
- -EFAULT should not be returned unless hmm_pte_need_fault() indicates
fault is required - ie snapshotting should not fail.
- For !pte_present() the cpu_flags are always 0, except in the special
case of is_device_private_entry(), calling pte_to_hmm_pfn_flags() is
confusing.
Reorganize the flow so that it always follows the pattern of calling
hmm_pte_need_fault() and then checking fault || write_fault.
Fixes: 2aee09d8c1 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis")
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
All return paths that do EFAULT must call hmm_range_need_fault() to
determine if the user requires this page to be valid.
If the page cannot be made valid if the user later requires it, due to vma
flags in this case, then the return should be HMM_PFN_ERROR.
Fixes: a3e0d41c2b ("mm/hmm: improve driver API to work and wait over a range")
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
All success exit paths from the walker functions must set the pfns array.
A migration entry with no required fault is a HMM_PFN_NONE return, just
like the pte case.
Fixes: d08faca018 ("mm/hmm: properly handle migration pmd")
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This eventually calls into handle_mm_fault() which is a sleeping function.
Release the lock first.
hmm_vma_walk_hole() does not touch the contents of the PUD, so it does not
need the lock.
Fixes: 3afc423632 ("mm: pagewalk: add p4d_entry() and pgd_entry()")
Cc: Steven Price <steven.price@arm.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Many of the direct returns of error skipped doing the pte_unmap(). All non
zero exit paths must unmap the pte.
The pte_unmap() is split unnaturally like this because some of the error
exit paths trigger a sleep and must release the lock before sleeping.
Fixes: 992de9a8b7 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem")
Fixes: 53f5c3f489 ("mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()")
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
slab does this already, and I want to use this in a memory allocation
tracker in drm for stuff that's tied to the lifetime of a drm_device,
not the underlying struct device. Kinda like devres, but for drm.
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Link: https://patchwork.freedesktop.org/patch/msgid/20200323144950.3018436-2-daniel.vetter@ffwll.ch
Commit dcde237319 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()") changed mremap() so that only the 'old' address
is untagged, leaving the 'new' address in the form it was passed from
userspace. This prevents the unexpected creation of aliasing virtual
mappings in userspace, but looks a bit odd when you read the code.
Add a comment justifying the untagging behaviour in mremap().
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
This removes a duplicate "a" in the comment in process_vm_rw_core.
Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
For graphics drivers needing to modify the page-protection, add
huge page-table entries counterparts to vmf_insert_pfn_prot().
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Thomas Hellstrom (VMware) <thomas_os@shipmail.org>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
The functions wp_huge_pmd() and wp_huge_pud() currently relies on the
huge_fault() callback to split huge page table entries if needed.
However for module users that requires export of the split_huge_xxx()
functionality which may be undesired. Instead split pre-existing huge
page-table entries on VM_FAULT_FALLBACK return.
We currently only do COW and write-notify on the PTE level, so if the
huge_fault() handler returns VM_FAULT_FALLBACK on wp faults,
split the huge pages and page-table entries. Also do this for huge PUDs
if there is no huge_fault() handler and the vma is not anonymous, similar
to how it's done for PMDs.
Note that fs/dax.c still does the splitting in the huge_fault() handler,
but as huge_fault() A follow-up patch can remove the dax.c split_huge_pmd()
if needed.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Thomas Hellstrom (VMware) <thomas_os@shipmail.org>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
For VM_PFNMAP and VM_MIXEDMAP vmas that want to support transhuge pages
and -page table entries, introduce vma_is_special_huge() that takes the
same codepaths as vma_is_dax().
The use of "special" follows the definition in memory.c, vm_normal_page():
"Special" mappings do not wish to be associated with a "struct page"
(either it doesn't exist, or it exists but they don't want to touch it)
For PAGE_SIZE pages, "special" is determined per page table entry to be
able to deal with COW pages. But since we don't have huge COW pages,
we can classify a vma as either "special huge" or "normal huge".
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Thomas Hellstrom (VMware) <thomas_os@shipmail.org>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Commit 3f8fd02b1b ("mm/vmalloc: Sync unmappings in
__purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
the vunmap() code-path. While this change was necessary to maintain
correctness on x86-32-pae kernels, it also adds additional cycles for
architectures that don't need it.
Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
severe performance regressions in micro-benchmarks because it now also
calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But
the vmalloc_sync_all() implementation on x86-64 is only needed for newly
created mappings.
To avoid the unnecessary work on x86-64 and to gain the performance
back, split up vmalloc_sync_all() into two functions:
* vmalloc_sync_mappings(), and
* vmalloc_sync_unmappings()
Most call-sites to vmalloc_sync_all() only care about new mappings being
synchronized. The only exception is the new call-site added in the
above mentioned commit.
Shile Zhang directed us to a report of an 80% regression in reaim
throughput.
Fixes: 3f8fd02b1b ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Borislav Petkov <bp@suse.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [GHES]
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sachin reports [1] a crash in SLUB __slab_alloc():
BUG: Kernel NULL pointer dereference on read at 0x000073b0
Faulting instruction address: 0xc0000000003d55f4
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
NIP: c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000
REGS: c0000008b37836d0 TRAP: 0300 Not tainted (5.6.0-rc2-next-20200218-autotest)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004844 XER: 00000000
CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1
GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500
GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620
GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000
GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000
GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002
GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122
GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8
GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180
NIP ___slab_alloc+0x1f4/0x760
LR __slab_alloc+0x34/0x60
Call Trace:
___slab_alloc+0x334/0x760 (unreliable)
__slab_alloc+0x34/0x60
__kmalloc_node+0x110/0x490
kvmalloc_node+0x58/0x110
mem_cgroup_css_online+0x108/0x270
online_css+0x48/0xd0
cgroup_apply_control_enable+0x2ec/0x4d0
cgroup_mkdir+0x228/0x5f0
kernfs_iop_mkdir+0x90/0xf0
vfs_mkdir+0x110/0x230
do_mkdirat+0xb0/0x1a0
system_call+0x5c/0x68
This is a PowerPC platform with following NUMA topology:
available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 35247 MB
node 1 free: 30907 MB
node distances:
node 0 1
0: 10 40
1: 40 10
possible numa nodes: 0-31
This only happens with a mmotm patch "mm/memcontrol.c: allocate
shrinker_map on appropriate NUMA node" [2] which effectively calls
kmalloc_node for each possible node. SLUB however only allocates
kmem_cache_node on online N_NORMAL_MEMORY nodes, and relies on
node_to_mem_node to return such valid node for other nodes since commit
a561ce00b0 ("slub: fall back to node_to_mem_node() node if allocating
on memoryless node"). This is however not true in this configuration
where the _node_numa_mem_ array is not initialized for nodes 0 and 2-31,
thus it contains zeroes and get_partial() ends up accessing
non-allocated kmem_cache_node.
A related issue was reported by Bharata (originally by Ramachandran) [3]
where a similar PowerPC configuration, but with mainline kernel without
patch [2] ends up allocating large amounts of pages by kmalloc-1k
kmalloc-512. This seems to have the same underlying issue with
node_to_mem_node() not behaving as expected, and might probably also
lead to an infinite loop with CONFIG_SLUB_CPU_PARTIAL [4].
This patch should fix both issues by not relying on node_to_mem_node()
anymore and instead simply falling back to NUMA_NO_NODE, when
kmalloc_node(node) is attempted for a node that's not online, or has no
usable memory. The "usable memory" condition is also changed from
node_present_pages() to N_NORMAL_MEMORY node state, as that is exactly
the condition that SLUB uses to allocate kmem_cache_node structures.
The check in get_partial() is removed completely, as the checks in
___slab_alloc() are now sufficient to prevent get_partial() being
reached with an invalid node.
[1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
[2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com/
[3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/
[4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz/
Fixes: a561ce00b0 ("slub: fall back to node_to_mem_node() node if allocating on memoryless node")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200320115533.9604-1-vbabka@suse.cz
Debugged-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is safe to traverse mm->notifier_subscriptions->list either under
SRCU read lock or mm->notifier_subscriptions->lock using
hlist_for_each_entry_rcu(). Silence the PROVE_RCU_LIST false positives,
for example,
WARNING: suspicious RCU usage
-----------------------------
mm/mmu_notifier.c:484 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
3 locks held by libvirtd/802:
#0: ffff9321e3f58148 (&mm->mmap_sem#2){++++}, at: do_mprotect_pkey+0xe1/0x3e0
#1: ffffffff91ae6160 (mmu_notifier_invalidate_range_start){+.+.}, at: change_p4d_range+0x5fa/0x800
#2: ffffffff91ae6e08 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x178/0x460
stack backtrace:
CPU: 7 PID: 802 Comm: libvirtd Tainted: G I 5.6.0-rc6-next-20200317+ #2
Hardware name: HP ProLiant BL460c Gen8, BIOS I31 11/02/2014
Call Trace:
dump_stack+0xa4/0xfe
lockdep_rcu_suspicious+0xeb/0xf5
__mmu_notifier_invalidate_range_start+0x3ff/0x460
change_p4d_range+0x746/0x800
change_protection+0x1df/0x300
mprotect_fixup+0x245/0x3e0
do_mprotect_pkey+0x23b/0x3e0
__x64_sys_mprotect+0x51/0x70
do_syscall_64+0x91/0xae8
entry_SYSCALL_64_after_hwframe+0x49/0xb3
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Link: http://lkml.kernel.org/r/20200317175640.2047-1-cai@lca.pw
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jann has brought up a very interesting point [1]. While shared pages
are excluded from MADV_PAGEOUT normally, CoW pages can be easily
reclaimed that way. This can lead to all sorts of hard to debug
problems. E.g. performance problems outlined by Daniel [2].
There are runtime environments where there is a substantial memory
shared among security domains via CoW memory and a easy to reclaim way
of that memory, which MADV_{COLD,PAGEOUT} offers, can lead to either
performance degradation in for the parent process which might be more
privileged or even open side channel attacks.
The feasibility of the latter is not really clear to me TBH but there is
no real reason for exposure at this stage. It seems there is no real
use case to depend on reclaiming CoW memory via madvise at this stage so
it is much easier to simply disallow it and this is what this patch
does. Put it simply MADV_{PAGEOUT,COLD} can operate only on the
exclusively owned memory which is a straightforward semantic.
[1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com
[2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com
Fixes: 9c276cc65a ("mm: introduce MADV_COLD")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200312082248.GS23944@dhcp22.suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prior to this commit, we only directly check the affected cgroup's
memory.high against its usage. However, it's possible that we are being
reclaimed as a result of hitting an ancestor memory.high and should be
penalised based on that, instead.
This patch changes memory.high overage throttling to use the largest
overage in its ancestors when considering how many penalty jiffies to
charge. This makes sure that we penalise poorly behaving cgroups in the
same way regardless of at what level of the hierarchy memory.high was
breached.
Fixes: 0e4b01df86 ("mm, memcg: throttle allocators when failing reclaim over memory.high")
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org> [5.4.x+]
Link: http://lkml.kernel.org/r/8cd132f84bd7e16cdb8fde3378cdbf05ba00d387.1584036142.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 0e4b01df86 had a bunch of fixups to use the right division
method. However, it seems that after all that it still wasn't right --
div_u64 takes a 32-bit divisor.
The headroom is still large (2^32 pages), so on mundane systems you
won't hit this, but this should definitely be fixed.
Fixes: 0e4b01df86 ("mm, memcg: throttle allocators when failing reclaim over memory.high")
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: <stable@vger.kernel.org> [5.4.x+]
Link: http://lkml.kernel.org/r/80780887060514967d414b3cd91f9a316a16ab98.1584036142.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In section_deactivate(), pfn_to_page() doesn't work any more after
ms->section_mem_map is resetting to NULL in SPARSEMEM|!VMEMMAP case. It
causes a hot remove failure:
kernel BUG at mm/page_alloc.c:4806!
invalid opcode: 0000 [#1] SMP PTI
CPU: 3 PID: 8 Comm: kworker/u16:0 Tainted: G W 5.5.0-next-20200205+ #340
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:free_pages+0x85/0xa0
Call Trace:
__remove_pages+0x99/0xc0
arch_remove_memory+0x23/0x4d
try_remove_memory+0xc8/0x130
__remove_memory+0xa/0x11
acpi_memory_device_remove+0x72/0x100
acpi_bus_trim+0x55/0x90
acpi_device_hotplug+0x2eb/0x3d0
acpi_hotplug_work_fn+0x1a/0x30
process_one_work+0x1a7/0x370
worker_thread+0x30/0x380
kthread+0x112/0x130
ret_from_fork+0x35/0x40
Let's move the ->section_mem_map resetting after
depopulate_section_memmap() to fix it.
[akpm@linux-foundation.org: remove unneeded initialization, per David]
Fixes: ba72b4c8cf ("mm/sparsemem: support sub-section hotplug")
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200307084229.28251-2-bhe@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An eventfd monitors multiple memory thresholds of the cgroup, closes them,
the kernel deletes all events related to this eventfd. Before all events
are deleted, another eventfd monitors the memory threshold of this cgroup,
leading to a crash:
BUG: kernel NULL pointer dereference, address: 0000000000000004
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 800000033058e067 P4D 800000033058e067 PUD 3355ce067 PMD 0
Oops: 0002 [#1] SMP PTI
CPU: 2 PID: 14012 Comm: kworker/2:6 Kdump: loaded Not tainted 5.6.0-rc4 #3
Hardware name: LENOVO 20AWS01K00/20AWS01K00, BIOS GLET70WW (2.24 ) 05/21/2014
Workqueue: events memcg_event_remove
RIP: 0010:__mem_cgroup_usage_unregister_event+0xb3/0x190
RSP: 0018:ffffb47e01c4fe18 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff8bb223a8a000 RCX: 0000000000000001
RDX: 0000000000000001 RSI: ffff8bb22fb83540 RDI: 0000000000000001
RBP: ffffb47e01c4fe48 R08: 0000000000000000 R09: 0000000000000010
R10: 000000000000000c R11: 071c71c71c71c71c R12: ffff8bb226aba880
R13: ffff8bb223a8a480 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8bb242680000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000004 CR3: 000000032c29c003 CR4: 00000000001606e0
Call Trace:
memcg_event_remove+0x32/0x90
process_one_work+0x172/0x380
worker_thread+0x49/0x3f0
kthread+0xf8/0x130
ret_from_fork+0x35/0x40
CR2: 0000000000000004
We can reproduce this problem in the following ways:
1. We create a new cgroup subdirectory and a new eventfd, and then we
monitor multiple memory thresholds of the cgroup through this eventfd.
2. closing this eventfd, and __mem_cgroup_usage_unregister_event ()
will be called multiple times to delete all events related to this
eventfd.
The first time __mem_cgroup_usage_unregister_event() is called, the
kernel will clear all items related to this eventfd in thresholds->
primary.
Since there is currently only one eventfd, thresholds-> primary becomes
empty, so the kernel will set thresholds-> primary and hresholds-> spare
to NULL. If at this time, the user creates a new eventfd and monitor
the memory threshold of this cgroup, kernel will re-initialize
thresholds-> primary.
Then when __mem_cgroup_usage_unregister_event () is called for the
second time, because thresholds-> primary is not empty, the system will
access thresholds-> spare, but thresholds-> spare is NULL, which will
trigger a crash.
In general, the longer it takes to delete all events related to this
eventfd, the easier it is to trigger this problem.
The solution is to check whether the thresholds associated with the
eventfd has been cleared when deleting the event. If so, we do nothing.
[akpm@linux-foundation.org: fix comment, per Kirill]
Fixes: 907860ed38 ("cgroups: make cftype.unregister_event() void-returning")
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/077a6f67-aefa-4591-efec-f2f3af2b0b02@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is just a cleanup addition to Jann's fix to properly update the
transaction ID for the slub slowpath in commit fd4d9c7d0c ("mm: slub:
add missing TID bump..").
The transaction ID is what protects us against any concurrent accesses,
but we should really also make sure to make the 'freelist' comparison
itself always use the same freelist value that we then used as the new
next free pointer.
Jann points out that if we do all of this carefully, we could skip the
transaction ID update for all the paths that only remove entries from
the lists, and only update the TID when adding entries (to avoid the ABA
issue with cmpxchg and list handling re-adding a previously seen value).
But this patch just does the "make sure to cmpxchg the same value we
used" rather than then try to be clever.
Acked-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When kmem_cache_alloc_bulk() attempts to allocate N objects from a percpu
freelist of length M, and N > M > 0, it will first remove the M elements
from the percpu freelist, then call ___slab_alloc() to allocate the next
element and repopulate the percpu freelist. ___slab_alloc() can re-enable
IRQs via allocate_slab(), so the TID must be bumped before ___slab_alloc()
to properly commit the freelist head change.
Fix it by unconditionally bumping c->tid when entering the slowpath.
Cc: stable@vger.kernel.org
Fixes: ebe909e0fd ("slub: improve bulk alloc strategy")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This helps set up size accounting in the next commit. Without this out
param, it's difficult to find out the removed xattr size without taking
a lock for longer and walking the xattr linked list twice.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull networking fixes from David Miller:
"It looks like a decent sized set of fixes, but a lot of these are one
liner off-by-one and similar type changes:
1) Fix netlink header pointer to calcular bad attribute offset
reported to user. From Pablo Neira Ayuso.
2) Don't double clear PHY interrupts when ->did_interrupt is set,
from Heiner Kallweit.
3) Add missing validation of various (devlink, nl802154, fib, etc.)
attributes, from Jakub Kicinski.
4) Missing *pos increments in various netfilter seq_next ops, from
Vasily Averin.
5) Missing break in of_mdiobus_register() loop, from Dajun Jin.
6) Don't double bump tx_dropped in veth driver, from Jiang Lidong.
7) Work around FMAN erratum A050385, from Madalin Bucur.
8) Make sure ARP header is pulled early enough in bonding driver,
from Eric Dumazet.
9) Do a cond_resched() during multicast processing of ipvlan and
macvlan, from Mahesh Bandewar.
10) Don't attach cgroups to unrelated sockets when in interrupt
context, from Shakeel Butt.
11) Fix tpacket ring state management when encountering unknown GSO
types. From Willem de Bruijn.
12) Fix MDIO bus PHY resume by checking mdio_bus_phy_may_suspend()
only in the suspend context. From Heiner Kallweit"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits)
net: systemport: fix index check to avoid an array out of bounds access
tc-testing: add ETS scheduler to tdc build configuration
net: phy: fix MDIO bus PM PHY resuming
net: hns3: clear port base VLAN when unload PF
net: hns3: fix RMW issue for VLAN filter switch
net: hns3: fix VF VLAN table entries inconsistent issue
net: hns3: fix "tc qdisc del" failed issue
taprio: Fix sending packets without dequeueing them
net: mvmdio: avoid error message for optional IRQ
net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register
net: memcg: fix lockdep splat in inet_csk_accept()
s390/qeth: implement smarter resizing of the RX buffer pool
s390/qeth: refactor buffer pool code
s390/qeth: use page pointers to manage RX buffer pool
seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number
net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed
net/packet: tpacket_rcv: do not increment ring index on drop
sxgbe: Fix off by one in samsung driver strncpy size arg
net: caif: Add lockdep expression to RCU traversal primitive
MAINTAINERS: remove Sathya Perla as Emulex NIC maintainer
...
If a TCP socket is allocated in IRQ context or cloned from unassociated
(i.e. not associated to a memcg) in IRQ context then it will remain
unassociated for its whole life. Almost half of the TCPs created on the
system are created in IRQ context, so, memory used by such sockets will
not be accounted by the memcg.
This issue is more widespread in cgroup v1 where network memory
accounting is opt-in but it can happen in cgroup v2 if the source socket
for the cloning was created in root memcg.
To fix the issue, just do the association of the sockets at the accept()
time in the process context and then force charge the memory buffer
already used and reserved by the socket.
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are testing network memory accounting in our setup and noticed
inconsistent network memory usage and often unrelated cgroups network
usage correlates with testing workload. On further inspection, it
seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
irq context specially for cgroup v1.
mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
and kind of assumes that this can only happen from sk_clone_lock()
and the source sock object has already associated cgroup. However in
cgroup v1, where network memory accounting is opt-in, the source sock
can be unassociated with any cgroup and the new cloned sock can get
associated with unrelated interrupted cgroup.
Cgroup v2 can also suffer if the source sock object was created by
process in the root cgroup or if sk_alloc() is called in irq context.
The fix is to just do nothing in interrupt.
WARNING: Please note that about half of the TCP sockets are allocated
from the IRQ context, so, memory used by such sockets will not be
accouted by the memcg.
The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1
Hardware name: ...
Call Trace:
<IRQ>
dump_stack+0x57/0x75
mem_cgroup_sk_alloc+0xe9/0xf0
sk_clone_lock+0x2a7/0x420
inet_csk_clone_lock+0x1b/0x110
tcp_create_openreq_child+0x23/0x3b0
tcp_v6_syn_recv_sock+0x88/0x730
tcp_check_req+0x429/0x560
tcp_v6_rcv+0x72d/0xa40
ip6_protocol_deliver_rcu+0xc9/0x400
ip6_input+0x44/0xd0
? ip6_protocol_deliver_rcu+0x400/0x400
ip6_rcv_finish+0x71/0x80
ipv6_rcv+0x5b/0xe0
? ip6_sublist_rcv+0x2e0/0x2e0
process_backlog+0x108/0x1e0
net_rx_action+0x26b/0x460
__do_softirq+0x104/0x2a6
do_softirq_own_stack+0x2a/0x40
</IRQ>
do_softirq.part.19+0x40/0x50
__local_bh_enable_ip+0x51/0x60
ip6_finish_output2+0x23d/0x520
? ip6table_mangle_hook+0x55/0x160
__ip6_finish_output+0xa1/0x100
ip6_finish_output+0x30/0xd0
ip6_output+0x73/0x120
? __ip6_finish_output+0x100/0x100
ip6_xmit+0x2e3/0x600
? ipv6_anycast_cleanup+0x50/0x50
? inet6_csk_route_socket+0x136/0x1e0
? skb_free_head+0x1e/0x30
inet6_csk_xmit+0x95/0xf0
__tcp_transmit_skb+0x5b4/0xb20
__tcp_send_ack.part.60+0xa3/0x110
tcp_send_ack+0x1d/0x20
tcp_rcv_state_process+0xe64/0xe80
? tcp_v6_connect+0x5d1/0x5f0
tcp_v6_do_rcv+0x1b1/0x3f0
? tcp_v6_do_rcv+0x1b1/0x3f0
__release_sock+0x7f/0xd0
release_sock+0x30/0xa0
__inet_stream_connect+0x1c3/0x3b0
? prepare_to_wait+0xb0/0xb0
inet_stream_connect+0x3b/0x60
__sys_connect+0x101/0x120
? __sys_getsockopt+0x11b/0x140
__x64_sys_connect+0x1a/0x20
do_syscall_64+0x51/0x200
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
Fixes: 2d75807383 ("mm: memcontrol: consolidate cgroup socket tracking")
Fixes: d979a39d72 ("cgroup: duplicate cgroup reference when cloning sockets")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit cd02cf1ace ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
fixed memory hotplug with debug_pagealloc enabled, where onlining a page
goes through page freeing, which removes the direct mapping. Some arches
don't like when the page is not mapped in the first place, so
generic_online_page() maps it first. This is somewhat wasteful, but
better than special casing page freeing fast paths.
The commit however missed that DEBUG_PAGEALLOC configured doesn't mean
it's actually enabled. One has to test debug_pagealloc_enabled() since
031bc5743f ("mm/debug-pagealloc: make debug-pagealloc boottime
configurable"), or alternatively debug_pagealloc_enabled_static() since
8e57f8acbb ("mm, debug_pagealloc: don't rely on static keys too early"),
but this is not done.
As a result, a s390 kernel with DEBUG_PAGEALLOC configured but not enabled
will crash:
Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 0000000000000000 TEID: 0000000000000483
Fault in home space mode while using kernel ASCE.
AS:0000001ece13400b R2:000003fff7fd000b R3:000003fff7fcc007 S:000003fff7fd7000 P:000000000000013d
Oops: 0004 ilc:2 [#1] SMP
CPU: 1 PID: 26015 Comm: chmem Kdump: loaded Tainted: GX 5.3.18-5-default #1 SLE15-SP2 (unreleased)
Krnl PSW : 0704e00180000000 0000001ecd281b9e (__kernel_map_pages+0x166/0x188)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 0000000000000000 0000000000000800 0000400b00000000 0000000000000100
0000000000000001 0000000000000000 0000000000000002 0000000000000100
0000001ece139230 0000001ecdd98d40 0000400b00000100 0000000000000000
000003ffa17e4000 001fffe0114f7d08 0000001ecd4d93ea 001fffe0114f7b20
Krnl Code: 0000001ecd281b8e: ec17ffff00d8 ahik %r1,%r7,-1
0000001ecd281b94: ec111dbc0355 risbg %r1,%r1,29,188,3
>0000001ecd281b9e: 94fb5006 ni 6(%r5),251
0000001ecd281ba2: 41505008 la %r5,8(%r5)
0000001ecd281ba6: ec51fffc6064 cgrj %r5,%r1,6,1ecd281b9e
0000001ecd281bac: 1a07 ar %r0,%r7
0000001ecd281bae: ec03ff584076 crj %r0,%r3,4,1ecd281a5e
Call Trace:
[<0000001ecd281b9e>] __kernel_map_pages+0x166/0x188
[<0000001ecd4d9516>] online_pages_range+0xf6/0x128
[<0000001ecd2a8186>] walk_system_ram_range+0x7e/0xd8
[<0000001ecda28aae>] online_pages+0x2fe/0x3f0
[<0000001ecd7d02a6>] memory_subsys_online+0x8e/0xc0
[<0000001ecd7add42>] device_online+0x5a/0xc8
[<0000001ecd7d0430>] state_store+0x88/0x118
[<0000001ecd5b9f62>] kernfs_fop_write+0xc2/0x200
[<0000001ecd5064b6>] vfs_write+0x176/0x1e0
[<0000001ecd50676a>] ksys_write+0xa2/0x100
[<0000001ecda315d4>] system_call+0xd8/0x2c8
Fix this by checking debug_pagealloc_enabled_static() before calling
kernel_map_pages(). Backports for kernel before 5.5 should use
debug_pagealloc_enabled() instead. Also add comments.
Fixes: cd02cf1ace ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/20200224094651.18257-1-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rwlock.h should not be included directly. Instead linux/splinlock.h
should be included. One thing it does is to break the RT build.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200224133631.1510569-1-bigeasy@linutronix.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jeff Moyer has reported that one of xfstests triggers a warning when run
on DAX-enabled filesystem:
WARNING: CPU: 76 PID: 51024 at mm/memory.c:2317 wp_page_copy+0xc40/0xd50
...
wp_page_copy+0x98c/0xd50 (unreliable)
do_wp_page+0xd8/0xad0
__handle_mm_fault+0x748/0x1b90
handle_mm_fault+0x120/0x1f0
__do_page_fault+0x240/0xd70
do_page_fault+0x38/0xd0
handle_page_fault+0x10/0x30
The warning happens on failed __copy_from_user_inatomic() which tries to
copy data into a CoW page.
This happens because of race between MADV_DONTNEED and CoW page fault:
CPU0 CPU1
handle_mm_fault()
do_wp_page()
wp_page_copy()
do_wp_page()
madvise(MADV_DONTNEED)
zap_page_range()
zap_pte_range()
ptep_get_and_clear_full()
<TLB flush>
__copy_from_user_inatomic()
sees empty PTE and fails
WARN_ON_ONCE(1)
clear_page()
The solution is to re-try __copy_from_user_inatomic() under PTL after
checking that PTE is matches the orig_pte.
The second copy attempt can still fail, like due to non-readable PTE, but
there's nothing reasonable we can do about, except clearing the CoW page.
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: Justin He <Justin.He@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Link: http://lkml.kernel.org/r/20200218154151.13349-1-kirill.shutemov@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In set_pmd_migration_entry(), pmdp_invalidate() is used to change PMD
atomically. But the PMD is read before that with an ordinary memory
reading. If the THP (transparent huge page) is written between the PMD
reading and pmdp_invalidate(), the PMD dirty bit may be lost, and cause
data corruption. The race window is quite small, but still possible in
theory, so need to be fixed.
The race is fixed via using the return value of pmdp_invalidate() to get
the original content of PMD, which is a read/modify/write atomic
operation. So no THP writing can occur in between.
The race has been introduced when the THP migration support is added in
the commit 616b837153 ("mm: thp: enable thp migration in generic path").
But this fix depends on the commit d52605d7cb ("mm: do not lose dirty
and accessed bits in pmdp_invalidate()"). So it's easy to be backported
after v4.16. But the race window is really small, so it may be fine not
to backport the fix at all.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Link: http://lkml.kernel.org/r/20200220075220.2327056-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
: A user reported a bug against a distribution kernel while running a
: proprietary workload described as "memory intensive that is not swapping"
: that is expected to apply to mainline kernels. The workload is
: read/write/modifying ranges of memory and checking the contents. They
: reported that within a few hours that a bad PMD would be reported followed
: by a memory corruption where expected data was all zeros. A partial
: report of the bad PMD looked like
:
: [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
: [ 5195.341184] ------------[ cut here ]------------
: [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
: ....
: [ 5195.410033] Call Trace:
: [ 5195.410471] [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
: [ 5195.410716] [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
: [ 5195.410918] [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
: [ 5195.411200] [<ffffffff81098322>] task_work_run+0x72/0x90
: [ 5195.411246] [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
: [ 5195.411494] [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
: [ 5195.411739] [<ffffffff815e56af>] retint_user+0x8/0x10
:
: Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
: was a false detection. The bug does not trigger if automatic NUMA
: balancing or transparent huge pages is disabled.
:
: The bug is due a race in change_pmd_range between a pmd_trans_huge and
: pmd_nond_or_clear_bad check without any locks held. During the
: pmd_trans_huge check, a parallel protection update under lock can have
: cleared the PMD and filled it with a prot_numa entry between the transhuge
: check and the pmd_none_or_clear_bad check.
:
: While this could be fixed with heavy locking, it's only necessary to make
: a copy of the PMD on the stack during change_pmd_range and avoid races. A
: new helper is created for this as the check if quite subtle and the
: existing similar helpful is not suitable. This passed 154 hours of
: testing (usually triggers between 20 minutes and 24 hours) without
: detecting bad PMDs or corruption. A basic test of an autonuma-intensive
: workload showed no significant change in behaviour.
Although Mel withdrew the patch on the face of LKML comment
https://lkml.org/lkml/2017/4/10/922 the race window aforementioned is
still open, and we have reports of Linpack test reporting bad residuals
after the bad PMD warning is observed. In addition to that, bad
rss-counter and non-zero pgtables assertions are triggered on mm teardown
for the task hitting the bad PMD.
host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
....
host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096
The issue is observed on a v4.18-based distribution kernel, but the race
window is expected to be applicable to mainline kernels, as well.
[akpm@linux-foundation.org: fix comment typo, per Rafael]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200216191800.22423-1-aquini@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull tmpfs fix from Al Viro:
"Regression from fs_parse series this cycle..."
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
tmpfs: deny and force are not huge mount options
- Fix regression in malloc() caused by ignored address tags in brk()
- Add missing brackets around argument to untagged_addr() macro
- Fix clang build when using binutils assembler
- Fix silly typo in virtual memory map documentation
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl5P2VIQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNNn9CACN/C0aTsRT+22ABPahHcnnyQgsETMOS3Up
M/edlPMUI5qK8IcIBt/PKswzBKlwMpI/pCWxfn/kwdq9u0ho3IASnqtaBVcm7yjt
d/4DX5GhwJBdv6q6N+vjacrqs3e/xCiDiWqLvhEVZXTFuDxNMziCfloP6sDBxmYu
E0+zuZnMbVemgV7USo+7QXMeNb7kFwP6fmJN0cr/FG7N4orms2zygs6mhg/ogpkH
zdl7Ze6DdC5+ChLpLhGXEuA2+Gyv+tWoE7A1EXnTGSEEQXmH7FkaZOJxAuSbWgw6
8Gcul+0JSHRBHN876oqS9aSr88ZiZDZkccC2gLW2Off6vvv8Rgog
=ehao
-----END PGP SIGNATURE-----
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"It's all straightforward apart from the changes to mmap()/mremap() in
relation to their handling of address arguments from userspace with
non-zero tag bits in the upper byte.
The change to brk() is necessary to fix a nasty user-visible
regression in malloc(), but we tightened up mmap() and mremap() at the
same time because they also allow the user to create virtual aliases
by accident. It's much less likely than brk() to matter in practice,
but enforcing the principle of "don't permit the creation of mappings
using tagged addresses" leads to a straightforward ABI without having
to worry about the "but what if a crazy program did foo?" aspect of
things.
Summary:
- Fix regression in malloc() caused by ignored address tags in brk()
- Add missing brackets around argument to untagged_addr() macro
- Fix clang build when using binutils assembler
- Fix silly typo in virtual memory map documentation"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
mm: Avoid creating virtual address aliases in brk()/mmap()/mremap()
docs: arm64: fix trivial spelling enought to enough in memory.rst
arm64: memory: Add missing brackets to untagged_addr() macro
arm64: lse: Fix LSE atomics with LLVM
Commit 68600f623d ("mm: don't miss the last page because of round-off
error") makes the scan size round up to @denominator regardless of the
memory cgroup's state, online or offline. This affects the overall
reclaiming behavior: the corresponding LRU list is eligible for
reclaiming only when its size logically right shifted by @sc->priority
is bigger than zero in the former formula.
For example, the inactive anonymous LRU list should have at least 0x4000
pages to be eligible for reclaiming when we have 60/12 for
swappiness/priority and without taking scan/rotation ratio into account.
After the roundup is applied, the inactive anonymous LRU list becomes
eligible for reclaiming when its size is bigger than or equal to 0x1000
in the same condition.
(0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1
aarch64 has 512MB huge page size when the base page size is 64KB. The
memory cgroup that has a huge page is always eligible for reclaiming in
that case.
The reclaiming is likely to stop after the huge page is reclaimed,
meaing the further iteration on @sc->priority and the silbing and child
memory cgroups will be skipped. The overall behaviour has been changed.
This fixes the issue by applying the roundup to offlined memory cgroups
only, to give more preference to reclaim memory from offlined memory
cgroup. It sounds reasonable as those memory is unlikedly to be used by
anyone.
The issue was found by starting up 8 VMs on a Ampere Mustang machine,
which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and
2GB memory. It took 264 seconds for all VMs to be completely up and
784MB swap is consumed after that. With this patch applied, it took 236
seconds and 60MB swap to do same thing. So there is 10% performance
improvement for my case. Note that KSM is disable while THP is enabled
in the testing.
total used free shared buff/cache available
Mem: 16196 10065 2049 16 4081 3749
Swap: 8175 784 7391
total used free shared buff/cache available
Mem: 16196 11324 3656 24 1215 2936
Swap: 8175 60 8115
Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com
Fixes: 68600f623d ("mm: don't miss the last page because of round-off error")
Signed-off-by: Gavin Shan <gshan@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org> [4.20+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
for_each_mem_cgroup() increases css reference counter for memory cgroup
and requires to use mem_cgroup_iter_break() if the walk is cancelled.
Link: http://lkml.kernel.org/r/c98414fb-7e1f-da0f-867a-9340ec4bd30b@virtuozzo.com
Fixes: 0a4465d340 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The "sub-section memory hotplug" facility allows memremap_pages() users
like libnvdimm to compensate for hardware platforms like x86 that have a
section size larger than their hardware memory mapping granularity. The
compensation that sub-section support affords is being tolerant of
physical memory resources shifting by units smaller (64MiB on x86) than
the memory-hotplug section size (128 MiB). Where the platform
physical-memory mapping granularity is limited by the number and
capability of address-decode-registers in the memory controller.
While the sub-section support allows memremap_pages() to operate on
sub-section (2MiB) granularity, the Power architecture may still
require 16MiB alignment on "!radix_enabled()" platforms.
In order for libnvdimm to be able to detect and manage this per-arch
limitation, introduce memremap_compat_align() as a common minimum
alignment across all driver-facing memory-mapping interfaces, and let
Power override it to 16MiB in the "!radix_enabled()" case.
The assumption / requirement for 16MiB to be a viable
memremap_compat_align() value is that Power does not have platforms
where its equivalent of address-decode-registers never hardware remaps a
persistent memory resource on smaller than 16MiB boundaries. Note that I
tried my best to not add a new Kconfig symbol, but header include
entanglements defeated the #ifndef memremap_compat_align design pattern
and the need to export it defeats the __weak design pattern for arch
overrides.
Based on an initial patch by Aneesh.
Link: http://lore.kernel.org/r/CAPcyv4gBGNP95APYaBcsocEa50tQj9b5h__83vgngjq3ouGX_Q@mail.gmail.com
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Currently the arm64 kernel ignores the top address byte passed to brk(),
mmap() and mremap(). When the user is not aware of the 56-bit address
limit or relies on the kernel to return an error, untagging such
pointers has the potential to create address aliases in user-space.
Passing a tagged address to munmap(), madvise() is permitted since the
tagged pointer is expected to be inside an existing mapping.
The current behaviour breaks the existing glibc malloc() implementation
which relies on brk() with an address beyond 56-bit to be rejected by
the kernel.
Remove untagging in the above functions by partially reverting commit
ce18d171cb ("mm: untag user pointers in mmap/munmap/mremap/brk"). In
addition, update the arm64 tagged-address-abi.rst document accordingly.
Link: https://bugzilla.redhat.com/1797052
Fixes: ce18d171cb ("mm: untag user pointers in mmap/munmap/mremap/brk")
Cc: <stable@vger.kernel.org> # 5.4.x-
Cc: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Victor Stinner <vstinner@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
5.6-rc1 commit 2710c957a8 ("fs_parse: get rid of ->enums") regressed
the huge tmpfs mount options to an earlier state: "deny" and "force"
are not valid there, and can crash the kernel. Delete those lines.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently x86 numa_meminfo is marked __initdata in the
CONFIG_MEMORY_HOTPLUG=n case. In support of a new facility to allow
drivers to map reserved memory to a 'target_node'
(phys_to_target_node()), add support for removing the __initdata
designation for those users. Both memory hotplug and
phys_to_target_node() users select CONFIG_NUMA_KEEP_MEMINFO to tell the
arch to maintain its physical address to NUMA mapping infrastructure
post init.
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/158188326422.894464.15742054998046628934.stgit@dwillia2-desk3.amr.corp.intel.com
The acpi_map_pxm_to_online_node() helper is used to find the closest
online node to a given proximity domain. This is used to map devices in
a proximity domain with no online memory or cpus to the closest online
node and populate a device's 'numa_node' property. The numa_node
property allows applications to be migrated "close" to a resource.
In preparation for providing a generic facility to optionally map an
address range to its closest online node, or the node the range would
represent were it to be onlined (target_node), up-level the core of
acpi_map_pxm_to_online_node() to a generic mm/numa helper.
Cc: Michal Hocko <mhocko@suse.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/158188324802.894464.13128795207831894206.stgit@dwillia2-desk3.amr.corp.intel.com
Pull vfs file system parameter updates from Al Viro:
"Saner fs_parser.c guts and data structures. The system-wide registry
of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
the horror switch() in fs_parse() that would have to grow another case
every time something got added to that system-wide registry.
New syntax types can be added by filesystems easily now, and their
namespace is that of functions - not of system-wide enum members. IOW,
they can be shared or kept private and if some turn out to be widely
useful, we can make them common library helpers, etc., without having
to do anything whatsoever to fs_parse() itself.
And we already get that kind of requests - the thing that finally
pushed me into doing that was "oh, and let's add one for timeouts -
things like 15s or 2h". If some filesystem really wants that, let them
do it. Without somebody having to play gatekeeper for the variants
blessed by direct support in fs_parse(), TYVM.
Quite a bit of boilerplate is gone. And IMO the data structures make a
lot more sense now. -200LoC, while we are at it"
* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
tmpfs: switch to use of invalfc()
cgroup1: switch to use of errorfc() et.al.
procfs: switch to use of invalfc()
hugetlbfs: switch to use of invalfc()
cramfs: switch to use of errofc() et.al.
gfs2: switch to use of errorfc() et.al.
fuse: switch to use errorfc() et.al.
ceph: use errorfc() and friends instead of spelling the prefix out
prefix-handling analogues of errorf() and friends
turn fs_param_is_... into functions
fs_parse: handle optional arguments sanely
fs_parse: fold fs_parameter_desc/fs_parameter_spec
fs_parser: remove fs_parameter_description name field
add prefix to fs_context->log
ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
new primitive: __fs_parse()
switch rbd and libceph to p_log-based primitives
struct p_log, variants of warnf() et.al. taking that one instead
teach logfc() to handle prefices, give it saner calling conventions
get rid of cg_invalf()
...
Pull misc vfs updates from Al Viro:
- bmap series from cmaiolino
- getting rid of convolutions in copy_mount_options() (use a couple of
copy_from_user() instead of the __get_user() crap)
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
saner copy_mount_options()
fibmap: Reject negative block numbers
fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
ecryptfs: drop direct calls to ->bmap
cachefiles: drop direct usage of ->bmap method.
fs: Enable bmap() function to properly return errors
Unused now.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Don't do a single array; attach them to fsparam_enum() entry
instead. And don't bother trying to embed the names into those -
it actually loses memory, with no real speedup worth mentioning.
Simplifies validation as well.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge more updates from Andrew Morton:
"The rest of MM and the rest of everything else: hotfixes, ipc, misc,
procfs, lib, cleanups, arm"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (67 commits)
ARM: dma-api: fix max_pfn off-by-one error in __dma_supported()
treewide: remove redundant IS_ERR() before error code check
include/linux/cpumask.h: don't calculate length of the input string
lib: new testcases for bitmap_parse{_user}
lib: rework bitmap_parse()
lib: make bitmap_parse_user a wrapper on bitmap_parse
lib: add test for bitmap_parse()
bitops: more BITS_TO_* macros
lib/string: add strnchrnul()
proc: convert everything to "struct proc_ops"
proc: decouple proc from VFS with "struct proc_ops"
asm-generic/tlb: provide MMU_GATHER_TABLE_FREE
asm-generic/tlb: rename HAVE_MMU_GATHER_NO_GATHER
asm-generic/tlb: rename HAVE_MMU_GATHER_PAGE_SIZE
asm-generic/tlb: rename HAVE_RCU_TABLE_FREE
asm-generic/tlb: add missing CONFIG symbol
asm-gemeric/tlb: remove stray function declarations
asm-generic/tlb: avoid potential double flush
mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush
powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case
...
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJeOKqlAAoJEAx081l5xIa+NKoP/1hhIF6iV6QwNxCvNukhiuNc
VhS9LPllEQ3G86QE+CfPWmFmh1IBnbJi3CwigC1UX4nA2XoN2Eccm8f5BmitG+Dm
YZNZBk8e1z9E68gqpdjA2qr3hddB8ZWcWW8SNqL0PID1UMdXlVr5QM+2Tw7flTQp
YsajCwssMVZWIgWya+n+A//Qdu5z95KC0ycLjIrT1g+Hl1dxUUixqcbEWppE2wrF
mJqTBJsfrPo2Njb6PFUKvUmzAZPJGWxLp4cGYJhHgqNrEtwJmmwhWpUDd+BpfxHn
jq+ir0ALAph+quWdtouArrf1ibS+uipQl+7uaIW5J963czxxNRr7eNl5KP9gaHod
zmtplXlw0ruNgCrDhRIJ4B4SB5M2nAZk7Y/xeHK+GwaCOGVt4rCaBWGBz/lorm7u
9phhygqYKdoIJc3JmXcKcXZHgLWuGgobWmYFlA/vNloXMeY5C5LPgDmAKIW40nut
pKg9iGH6fJoN0HUcECPIbv1bterZy7YKK6OWl9TRHfMnabLJXNCej8hcEnf6KQ2l
spDO//L8tICBftrcZdJFunzoFFlTavF18XBqm9ZdGNfNo9BIipwFJQEQDahVKHSM
6Du0kmuVoB02QLPXEUpA6+W5rFw7M5Qzi47pUnzM/VFJFkFx/eSsfs119aXJRXtc
jIgI1vHIPFCX8Zg4SN/O
=uiav
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2020-02-04' of git://anongit.freedesktop.org/drm/drm
Pull drm ttm/mm updates from Dave Airlie:
"Thomas Hellstrom has some more changes to the TTM layer that needed a
patch to the mm subsystem.
This adds a new mm API vmf_insert_mixed_prot to avoid an ugly hack
that has limitations in the TTM layer"
* tag 'drm-next-2020-02-04' of git://anongit.freedesktop.org/drm/drm:
mm, drm/ttm: Fix vm page protection handling
mm: Add a vmf_insert_mixed_prot() function
As described in the comment, the correct order for freeing pages is:
1) unhook page
2) TLB invalidate page
3) free page
This order equally applies to page directories.
Currently there are two correct options:
- use tlb_remove_page(), when all page directores are full pages and
there are no futher contraints placed by things like software
walkers (HAVE_FAST_GUP).
- use MMU_GATHER_RCU_TABLE_FREE and tlb_remove_table() when the
architecture does not do IPI based TLB invalidate and has
HAVE_FAST_GUP (or software TLB fill).
This however leaves architectures that don't have page based directories
but don't need RCU in a bind. For those, provide MMU_GATHER_TABLE_FREE,
which provides the independent batching for directories without the
additional RCU freeing.
Link: http://lkml.kernel.org/r/20200116064531.483522-10-aneesh.kumar@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Towards a more consistent naming scheme.
Link: http://lkml.kernel.org/r/20200116064531.483522-9-aneesh.kumar@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Towards a more consistent naming scheme.
Link: http://lkml.kernel.org/r/20200116064531.483522-8-aneesh.kumar@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Architectures for which we have hardware walkers of Linux page table
should flush TLB on mmu gather batch allocation failures and batch flush.
Some architectures like POWER supports multiple translation modes (hash
and radix) and in the case of POWER only radix translation mode needs the
above TLBI. This is because for hash translation mode kernel wants to
avoid this extra flush since there are no hardware walkers of linux page
table. With radix translation, the hardware also walks linux page table
and with that, kernel needs to make sure to TLB invalidate page walk cache
before page table pages are freed.
More details in commit d86564a2f0 ("mm/tlb, x86/mm: Support invalidating
TLB caches for RCU_TABLE_FREE")
The changes to sparc are to make sure we keep the old behavior since we
are now removing HAVE_RCU_TABLE_NO_INVALIDATE. The default value for
tlb_needs_table_invalidate is to always force an invalidate and sparc can
avoid the table invalidate. Hence we define tlb_needs_table_invalidate to
false for sparc architecture.
Link: http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.kumar@linux.ibm.com
Fixes: a46cc7a90f ("powerpc/mm/radix: Improve TLB/PWC flushes")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: <stable@vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct mm_struct is quite large (~1664 bytes) and so allocating on the
stack may cause problems as the kernel stack size is small.
Since ptdump_walk_pgd_level_core() was only allocating the structure so
that it could modify the pgd argument we can instead introduce a pgd
override in struct mm_walk and pass this down the call stack to where it
is needed.
Since the correct mm_struct is now being passed down, it is now also
unnecessary to take the mmap_sem semaphore because ptdump_walk_pgd() will
now take the semaphore on the real mm.
[steven.price@arm.com: restore missed arm64 changes]
Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.com
Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rather than having to increment the 'depth' number by 1 in ptdump_hole(),
let's change the meaning of 'level' in note_page() since that makes the
code simplier.
Note that for x86, the level numbers were previously increased by 1 in
commit 45dcd20913 ("x86/mm/dump_pagetables: Fix printout of p4d level")
and the comment "Bit 7 has a different meaning" was not updated, so this
change also makes the code match the comment again.
Link: http://lkml.kernel.org/r/20191218162402.45610-24-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a generic version of page table dumping that architectures can opt-in
to.
Link: http://lkml.kernel.org/r/20191218162402.45610-20-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The pte_hole() callback is called at multiple levels of the page tables.
Code dumping the kernel page tables needs to know what at what depth the
missing entry is. Add this is an extra parameter to pte_hole(). When the
depth isn't know (e.g. processing a vma) then -1 is passed.
The depth that is reported is the actual level where the entry is missing
(ignoring any folding that is in place), i.e. any levels where
PTRS_PER_P?D is set to 1 are ignored.
Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
natural numbers as levels 2/3/4.
Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Tested-by: Zong Li <zong.li@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If walk_pte_range() is called with a 'end' argument that is beyond the
last page of memory (e.g. ~0UL) then the comparison between 'addr' and
'end' will always fail and the loop will be infinite. Instead change the
comparison to >= while accounting for overflow.
Link: http://lkml.kernel.org/r/20191218162402.45610-15-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
walk_page_range_novma() can be used to walk page tables or the kernel or
for firmware. These page tables may contain entries that are not backed
by a struct page and so it isn't (in general) possible to take the PTE
lock for the pte_entry() callback. So update walk_pte_range() to only
take the lock when no_vma==false by splitting out the inner loop to a
separate function and add a comment explaining the difference to
walk_page_range_novma().
Link: http://lkml.kernel.org/r/20191218162402.45610-14-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since 48684a65b4: "mm: pagewalk: fix misbehavior of walk_page_range for
vma(VM_PFNMAP)", page_table_walk() will report any kernel area as a hole,
because it lacks a vma.
This means each arch has re-implemented page table walking when needed,
for example in the per-arch ptdump walker.
Remove the requirement to have a vma in the generic code and add a new
function walk_page_range_novma() which ignores the VMAs and simply walks
the page tables.
Link: http://lkml.kernel.org/r/20191218162402.45610-13-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pgd_entry() and pud_entry() were removed by commit 0b1fbfe500
("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were no
users. We're about to add users so reintroduce them, along with
p4d_entry() as we now have 5 levels of tables.
Note that commit a00cc7d9dd ("mm, x86: add support for PUD-sized
transparent hugepages") already re-added pud_entry() but with different
semantics to the other callbacks. This commit reverts the semantics back
to match the other callbacks.
To support hmm.c which now uses the new semantics of pud_entry() a new
member ('action') of struct mm_walk is added which allows the callbacks to
either descend (ACTION_SUBTREE, the default), skip (ACTION_CONTINUE) or
repeat the callback (ACTION_AGAIN). hmm.c is then updated to call
pud_trans_huge_lock() itself and make use of the splitting/retry logic of
the core code.
After this change pud_entry() is called for all entries, not just
transparent huge pages.
[arnd@arndb.de: fix unused variable warning]
Link: http://lkml.kernel.org/r/20200107204607.1533842-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20191218162402.45610-12-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since 5.5-rc1 the last user of this function is gone, so remove the
functionality.
See commit
2ad9d7747c ("netfilter: conntrack: free extension area immediately")
for details.
Link: http://lkml.kernel.org/r/20191212223442.22141-1-fw@strlen.de
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The callers are only interested in the actual zone, they don't care about
boundaries. Return the zone instead to simplify.
Link: http://lkml.kernel.org/r/20200110183308.11849-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's drop the basically unused section stuff and simplify.
Also, let's use a shorter variant to calculate the number of pages to
the next section boundary.
Link: http://lkml.kernel.org/r/20191006085646.5768-11-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Get rid of the unnecessary local variables.
Link: http://lkml.kernel.org/r/20191006085646.5768-10-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we have holes, the holes will automatically get detected and removed
once we remove the next bigger/smaller section. The extra checks can go.
Link: http://lkml.kernel.org/r/20191006085646.5768-9-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With shrink_pgdat_span() out of the way, we now always have a valid zone.
Link: http://lkml.kernel.org/r/20191006085646.5768-8-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's poison the pages similar to when adding new memory in
sparse_add_section(). Also call remove_pfn_range_from_zone() from
memunmap_pages(), so we can poison the memmap from there as well.
Link: http://lkml.kernel.org/r/20191006085646.5768-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/memory_hotplug: Shrink zones before removing memory", v6.
This series fixes the access of uninitialized memmaps when shrinking
zones/nodes and when removing memory. Also, it contains all fixes for
crashes that can be triggered when removing certain namespace using
memunmap_pages() - ZONE_DEVICE, reported by Aneesh.
We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
more involved (we don't have SECTION_IS_ONLINE as an indicator), and
shrinking is only of limited use (set_zone_contiguous() cannot detect the
ZONE_DEVICE as contiguous).
We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount of
code to a minimum. Shrinking is especially necessary to keep
zone->contiguous set where possible, especially, on memory unplug of DIMMs
at zone boundaries.
--------------------------------------------------------------------------
Zones are now properly shrunk when offlining memory blocks or when
onlining failed. This allows to properly shrink zones on memory unplug
even if the separate memory blocks of a DIMM were onlined to different
zones or re-onlined to a different zone after offlining.
Example:
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 0
present 0
managed 0
:/# echo "online_movable" > /sys/devices/system/memory/memory41/state
:/# echo "online_movable" > /sys/devices/system/memory/memory43/state
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 98304
present 65536
managed 65536
:/# echo 0 > /sys/devices/system/memory/memory43/online
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 32768
present 32768
managed 32768
:/# echo 0 > /sys/devices/system/memory/memory41/online
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 0
present 0
managed 0
This patch (of 6):
The third argument is actually number of pages. Change the variable name
from size to nr_pages to indicate this better.
No functional change in this patch.
Link: http://lkml.kernel.org/r/20191006085646.5768-3-david@redhat.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's move it to the header and use the shorter variant from
mm/page_alloc.c (the original one will also check
"__highest_present_section_nr + 1", which is not necessary). While at
it, make the section_nr in next_pfn() const.
In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once we
exceed __highest_present_section_nr, which doesn't make a difference in
the caller as it is big enough (>= all sane end_pfn).
Link: http://lkml.kernel.org/r/20200113144035.10848-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Jin, Zhi" <zhi.jin@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's update the pfn manually whenever we continue the loop. This makes
the code easier to read but also less error prone (and we can directly fix
one issue).
When overlap_memmap_init() returns true, pfn is updated to
"memblock_region_memory_end_pfn(r)". So it already points at the *next*
pfn to process. Incrementing the pfn another time is wrong, we might
leave one uninitialized. I spotted this by inspecting the code, so I have
no idea if this is relevant in practise (with kernelcore=mirror).
Link: http://lkml.kernel.org/r/20200113144035.10848-2-david@redhat.com
Fixes: a9a9e77fbf ("mm: move mirrored memory specific code outside of memmap_init_zone")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Jin, Zhi" <zhi.jin@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's make sure that all memory holes are actually marked PageReserved(),
that page_to_pfn() produces reliable results, and that these pages are not
detected as "mmap" pages due to the mapcount.
E.g., booting a x86-64 QEMU guest with 4160 MB:
[ 0.010585] Early memory node ranges
[ 0.010586] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.010588] node 0: [mem 0x0000000000100000-0x00000000bffdefff]
[ 0.010589] node 0: [mem 0x0000000100000000-0x0000000143ffffff]
max_pfn is 0x144000.
Before this change:
[root@localhost ~]# ./page-types -r -a 0x144000,
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000800 16384 64 ___________M_______________________________ mmap
total 16384 64
After this change:
[root@localhost ~]# ./page-types -r -a 0x144000,
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000100000000 16384 64 ___________________________r_______________ reserved
total 16384 64
IOW, especially the unavailable physical memory ("memory hole") in the
last section would not get properly marked PageReserved() and is indicated
to be "mmap" memory.
Drop the trace of that function from include/linux/mm.h - nobody else
needs it, and rename it accordingly.
Note: The fake zone/node might not be covered by the zone/node span. This
is not an urgent issue (for now, we had the same node/zone due to the
zeroing). We'll need a clean way to mark memory holes (e.g., using a page
type PageHole() if possible or a fake ZONE_INVALID) and eventually stop
marking these memory holes PageReserved().
Link: http://lkml.kernel.org/r/20191211163201.17179-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: fix max_pfn not falling on section boundary", v2.
Playing with different memory sizes for a x86-64 guest, I discovered that
some memmaps (highest section if max_mem does not fall on the section
boundary) are marked as being valid and online, but contain garbage. We
have to properly initialize these memmaps.
Looking at /proc/kpageflags and friends, I found some more issues,
partially related to this.
This patch (of 3):
If max_pfn is not aligned to a section boundary, we can easily run into
BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a
memory size that is not a multiple of 128MB (e.g., 4097MB, but also
4160MB). I was told that on real HW, we can easily have this scenario
(esp., one of the main reasons sub-section hotadd of devmem was added).
The issue is, that we have a valid memmap (pfn_valid()) for the whole
section, and the whole section will be marked "online".
pfn_to_online_page() will succeed, but the memmap contains garbage.
E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
4160M" - (see tools/vm/page-types.c):
[ 200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
[ 200.477500] #PF: supervisor read access in kernel mode
[ 200.478334] #PF: error_code(0x0000) - not-present page
[ 200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
[ 200.479557] Oops: 0000 [#4] SMP NOPTI
[ 200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G D W 5.5.0-rc1-next-20191209 #93
[ 200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
[ 200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
[ 200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
[ 200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
[ 200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
[ 200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
[ 200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
[ 200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
[ 200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
[ 200.487130] FS: 00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
[ 200.487804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
[ 200.488897] Call Trace:
[ 200.489115] kpageflags_read+0xe9/0x140
[ 200.489447] proc_reg_read+0x3c/0x60
[ 200.489755] vfs_read+0xc2/0x170
[ 200.490037] ksys_pread64+0x65/0xa0
[ 200.490352] do_syscall_64+0x5c/0xa0
[ 200.490665] entry_SYSCALL_64_after_hwframe+0x49/0xbe
But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
after cold/hot plugging a DIMM to such a system:
[root@localhost ~]# cat /proc/kpageflags > /dev/null
[ 111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
[ 111.517907] #PF: supervisor read access in kernel mode
[ 111.518333] #PF: error_code(0x0000) - not-present page
[ 111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0
This patch fixes that by at least zero-ing out that memmap (so e.g.,
page_to_pfn() will not crash). Commit 907ec5fca3 ("mm: zero remaining
unavailable struct pages") tried to fix a similar issue, but forgot to
consider this special case.
After this patch, there are still problems to solve. E.g., not all of
these pages falling into a memory hole will actually get initialized later
and set PageReserved - they are only zeroed out - but at least the
immediate crashes are gone. A follow-up patch will take care of this.
Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
Fixes: f7f99100d8 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org> [4.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By now, bmap() will either return the physical block number related to
the requested file offset or 0 in case of error or the requested offset
maps into a hole.
This patch makes the needed changes to enable bmap() to proper return
errors, using the return value as an error return, and now, a pointer
must be passed to bmap() to be filled with the mapped physical block.
It will change the behavior of bmap() on return:
- negative value in case of error
- zero on success or map fell into a hole
In case of a hole, the *block will be zero too
Since this is a prep patch, by now, the only error return is -EINVAL if
->bmap doesn't exist.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull updates from Andrew Morton:
"Most of -mm and quite a number of other subsystems: hotfixes, scripts,
ocfs2, misc, lib, binfmt, init, reiserfs, exec, dma-mapping, kcov.
MM is fairly quiet this time. Holidays, I assume"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
kcov: ignore fault-inject and stacktrace
include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc()
execve: warn if process starts with executable stack
reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()
init/main.c: fix misleading "This architecture does not have kernel memory protection" message
init/main.c: fix quoted value handling in unknown_bootoption
init/main.c: remove unnecessary repair_env_string in do_initcall_level
init/main.c: log arguments and environment passed to init
fs/binfmt_elf.c: coredump: allow process with empty address space to coredump
fs/binfmt_elf.c: coredump: delete duplicated overflow check
fs/binfmt_elf.c: coredump: allocate core ELF header on stack
fs/binfmt_elf.c: make BAD_ADDR() unlikely
fs/binfmt_elf.c: better codegen around current->mm
fs/binfmt_elf.c: don't copy ELF header around
fs/binfmt_elf.c: fix ->start_code calculation
fs/binfmt_elf.c: smaller code generation around auxv vector fill
lib/find_bit.c: uninline helper _find_next_bit()
lib/find_bit.c: join _find_next_bit{_le}
uapi: rename ext2_swab() to swab() and share globally in swab.h
lib/scatterlist.c: adjust indentation in __sg_alloc_table
...
This tag contains a handful of patches that I'd like to target for this merge
window:
* Support for kasan.
* 32-bit physical addresses on rv32i-based systems.
* Support for CONFIG_DEBUG_VIRTUAL
* DT entry for the FU540 GPIO controller, which has recently had a device
driver merged.
These boot a buildroot-based system on QEMU's virt board for me.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAl4xncwTHHBhbG1lckBk
YWJiZWx0LmNvbQAKCRAuExnzX7sYibh5D/oCzcibEN/73fHCd5kgoAasX9o3ZmW/
i/GVcVAW7B2E0kpRheGbAisc2/E9o7qFTbji2cVFfApGYgjHV5tUTYbC9zIhxmQq
FW4fKwx/u6QnM8eW7gvOr7do6QFPC86dWqN5LN7g4ZfgamIPFBMUJgX4Ev/0zeJ8
zBZ3CHIGFID7uG8cyVmzO2PwFzedi7CuKVNRXggDcZgYM3+LXToevY05/9Nu1asA
i2T4IrLzB40pBmv5PxRDC1aNpHt5+fOuLb1kNYa8kWG/zj95kmeSirVIyvlbCQgX
VN6Za9z3EH/xu6zD2dlaSDrvbAQdY7fRPuvXyg/ZbJ3Z0daxt0JVj8iXVVWW3juP
9/DeQ/KiNOJ5LwnRjr6/uuxlzqlDNrzp2DSrpGzXvKhfgvDnKObX+HnND78aj/XJ
UPQ7Ef/wgOuM+IYFLYb2rdb65mFZ0Y+F7efAdXQTGlKMkKiGw9ci/oC9j/EG3TG1
cFNUY0mNJKJ8RNMUyvujzp/38si5Q7CN4/v/P80P9DOhOuZvSTW1YSAUUh6VZJEt
XoDrUIrKQPA+vuXcfUCk6ooBQorqzarwKYOilF7Pw7KHy4yZhz4aFbzi/VxJDNGI
p0UGfTB5te5u+l78dFm+Uigq4Q87xZ1byo4VXj9i/Jb2IvbxEbnDWjCHKmeugfUF
9PEtubl9UkKd2Q==
=W7f7
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-5.6-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V updates from Palmer Dabbelt:
"This contains a handful of patches for this merge window:
- Support for kasan
- 32-bit physical addresses on rv32i-based systems
- Support for CONFIG_DEBUG_VIRTUAL
- DT entry for the FU540 GPIO controller, which has recently had a
device driver merged
These boot a buildroot-based system on QEMU's virt board for me"
* tag 'riscv-for-linus-5.6-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: dts: Add DT support for SiFive FU540 GPIO driver
riscv: mm: add support for CONFIG_DEBUG_VIRTUAL
riscv: keep 32-bit kernel to 32-bit phys_addr_t
kasan: Add riscv to KASAN documentation.
riscv: Add KASAN support
kasan: No KASAN's memmove check if archs don't have it.
Don't instrument 3 more files that contain debugging facilities and
produce large amounts of uninteresting coverage for every syscall.
The following snippets are sprinkled all over the place in kcov traces
in a debugging kernel. We already try to disable instrumentation of
stack unwinding code and of most debug facilities. I guess we did not
use fault-inject.c at the time, and stacktrace.c was somehow missed (or
something has changed in kernel/configs). This change both speeds up
kcov (kernel doesn't need to store these PCs, user-space doesn't need to
process them) and frees trace buffer capacity for more useful coverage.
should_fail
lib/fault-inject.c:149
fail_dump
lib/fault-inject.c:45
stack_trace_save
kernel/stacktrace.c:124
stack_trace_consume_entry
kernel/stacktrace.c:86
stack_trace_consume_entry
kernel/stacktrace.c:89
... a hundred frames skipped ...
stack_trace_consume_entry
kernel/stacktrace.c:93
stack_trace_consume_entry
kernel/stacktrace.c:86
Link: http://lkml.kernel.org/r/20200116111449.217744-1-dvyukov@gmail.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The "pool" pointer can be NULL at the end of the init_zswap(). (We
would allocate a new pool later in that situation)
So in the error handling then we need to make sure pool is a valid
pointer before calling "zswap_pool_destroy(pool);" because that function
dereferences the argument.
Link: http://lkml.kernel.org/r/20200114050902.og32fkllkod5ycf5@kili.mountain
Fixes: 93d4dfa9fbd0 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zswap will always try to shrink pool when zswap is full. If there is a
high pressure on zswap it will result in flipping pages in and out zswap
pool without any real benefit, and the overall system performance will
drop. The previous discussion on this subject [1] ended up with a
suggestion to implement a sort of hysteresis to refuse taking pages into
zswap pool until it has sufficient space if the limit has been hit.
This is my take on this.
Hysteresis is controlled with a sysfs-configurable parameter (namely,
/sys/kernel/debug/zswap/accept_threhsold_percent). It specifies the
threshold at which zswap would start accepting pages again after it
became full. Setting this parameter to 100 disables the hysteresis and
sets the zswap behavior to pre-hysteresis state.
[1] https://lkml.org/lkml/2019/11/8/949
Link: http://lkml.kernel.org/r/20200108200118.15563-1-vitaly.wool@konsulko.com
Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It makes sense to call the WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE)
from start_isolate_page_range(), but should avoid triggering it from
userspace, i.e, from is_mem_section_removable() because it could crash
the system by a non-root user if warn_on_panic is set.
While at it, simplify the code a bit by removing an unnecessary jump
label.
Link: http://lkml.kernel.org/r/20200120163915.1469-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is not that hard to trigger lockdep splats by calling printk from
under zone->lock. Most of them are false positives caused by lock
chains introduced early in the boot process and they do not cause any
real problems (although most of the early boot lock dependencies could
happen after boot as well). There are some console drivers which do
allocate from the printk context as well and those should be fixed. In
any case, false positives are not that trivial to workaround and it is
far from optimal to lose lockdep functionality for something that is a
non-issue.
So change has_unmovable_pages() so that it no longer calls dump_page()
itself - instead it returns a "struct page *" of the unmovable page back
to the caller so that in the case of a has_unmovable_pages() failure,
the caller can call dump_page() after releasing zone->lock. Also, make
dump_page() is able to report a CMA page as well, so the reason string
from has_unmovable_pages() can be removed.
Even though has_unmovable_pages doesn't hold any reference to the
returned page this should be reasonably safe for the purpose of
reporting the page (dump_page) because it cannot be hotremoved in the
context of memory unplug. The state of the page might change but that
is the case even with the existing code as zone->lock only plays role
for free pages.
While at it, remove a similar but unnecessary debug-only printk() as
well. A sample of one of those lockdep splats is,
WARNING: possible circular locking dependency detected
------------------------------------------------------
test.sh/8653 is trying to acquire lock:
ffffffff865a4460 (console_owner){-.-.}, at:
console_unlock+0x207/0x750
but task is already holding lock:
ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
__offline_isolated_pages+0x179/0x3e0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&(&zone->lock)->rlock){-.-.}:
__lock_acquire+0x5b3/0xb40
lock_acquire+0x126/0x280
_raw_spin_lock+0x2f/0x40
rmqueue_bulk.constprop.21+0xb6/0x1160
get_page_from_freelist+0x898/0x22c0
__alloc_pages_nodemask+0x2f3/0x1cd0
alloc_pages_current+0x9c/0x110
allocate_slab+0x4c6/0x19c0
new_slab+0x46/0x70
___slab_alloc+0x58b/0x960
__slab_alloc+0x43/0x70
__kmalloc+0x3ad/0x4b0
__tty_buffer_request_room+0x100/0x250
tty_insert_flip_string_fixed_flag+0x67/0x110
pty_write+0xa2/0xf0
n_tty_write+0x36b/0x7b0
tty_write+0x284/0x4c0
__vfs_write+0x50/0xa0
vfs_write+0x105/0x290
redirected_tty_write+0x6a/0xc0
do_iter_write+0x248/0x2a0
vfs_writev+0x106/0x1e0
do_writev+0xd4/0x180
__x64_sys_writev+0x45/0x50
do_syscall_64+0xcc/0x76c
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #2 (&(&port->lock)->rlock){-.-.}:
__lock_acquire+0x5b3/0xb40
lock_acquire+0x126/0x280
_raw_spin_lock_irqsave+0x3a/0x50
tty_port_tty_get+0x20/0x60
tty_port_default_wakeup+0xf/0x30
tty_port_tty_wakeup+0x39/0x40
uart_write_wakeup+0x2a/0x40
serial8250_tx_chars+0x22e/0x440
serial8250_handle_irq.part.8+0x14a/0x170
serial8250_default_handle_irq+0x5c/0x90
serial8250_interrupt+0xa6/0x130
__handle_irq_event_percpu+0x78/0x4f0
handle_irq_event_percpu+0x70/0x100
handle_irq_event+0x5a/0x8b
handle_edge_irq+0x117/0x370
do_IRQ+0x9e/0x1e0
ret_from_intr+0x0/0x2a
cpuidle_enter_state+0x156/0x8e0
cpuidle_enter+0x41/0x70
call_cpuidle+0x5e/0x90
do_idle+0x333/0x370
cpu_startup_entry+0x1d/0x1f
start_secondary+0x290/0x330
secondary_startup_64+0xb6/0xc0
-> #1 (&port_lock_key){-.-.}:
__lock_acquire+0x5b3/0xb40
lock_acquire+0x126/0x280
_raw_spin_lock_irqsave+0x3a/0x50
serial8250_console_write+0x3e4/0x450
univ8250_console_write+0x4b/0x60
console_unlock+0x501/0x750
vprintk_emit+0x10d/0x340
vprintk_default+0x1f/0x30
vprintk_func+0x44/0xd4
printk+0x9f/0xc5
-> #0 (console_owner){-.-.}:
check_prev_add+0x107/0xea0
validate_chain+0x8fc/0x1200
__lock_acquire+0x5b3/0xb40
lock_acquire+0x126/0x280
console_unlock+0x269/0x750
vprintk_emit+0x10d/0x340
vprintk_default+0x1f/0x30
vprintk_func+0x44/0xd4
printk+0x9f/0xc5
__offline_isolated_pages.cold.52+0x2f/0x30a
offline_isolated_pages_cb+0x17/0x30
walk_system_ram_range+0xda/0x160
__offline_pages+0x79c/0xa10
offline_pages+0x11/0x20
memory_subsys_offline+0x7e/0xc0
device_offline+0xd5/0x110
state_store+0xc6/0xe0
dev_attr_store+0x3f/0x60
sysfs_kf_write+0x89/0xb0
kernfs_fop_write+0x188/0x240
__vfs_write+0x50/0xa0
vfs_write+0x105/0x290
ksys_write+0xc6/0x160
__x64_sys_write+0x43/0x50
do_syscall_64+0xcc/0x76c
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Chain exists of:
console_owner --> &(&port->lock)->rlock --> &(&zone->lock)->rlock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&(&zone->lock)->rlock);
lock(&(&port->lock)->rlock);
lock(&(&zone->lock)->rlock);
lock(console_owner);
*** DEADLOCK ***
9 locks held by test.sh/8653:
#0: ffff88839ba7d408 (sb_writers#4){.+.+}, at:
vfs_write+0x25f/0x290
#1: ffff888277618880 (&of->mutex){+.+.}, at:
kernfs_fop_write+0x128/0x240
#2: ffff8898131fc218 (kn->count#115){.+.+}, at:
kernfs_fop_write+0x138/0x240
#3: ffffffff86962a80 (device_hotplug_lock){+.+.}, at:
lock_device_hotplug_sysfs+0x16/0x50
#4: ffff8884374f4990 (&dev->mutex){....}, at:
device_offline+0x70/0x110
#5: ffffffff86515250 (cpu_hotplug_lock.rw_sem){++++}, at:
__offline_pages+0xbf/0xa10
#6: ffffffff867405f0 (mem_hotplug_lock.rw_sem){++++}, at:
percpu_down_write+0x87/0x2f0
#7: ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
__offline_isolated_pages+0x179/0x3e0
#8: ffffffff865a4920 (console_lock){+.+.}, at:
vprintk_emit+0x100/0x340
stack backtrace:
Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10,
BIOS U34 05/21/2019
Call Trace:
dump_stack+0x86/0xca
print_circular_bug.cold.31+0x243/0x26e
check_noncircular+0x29e/0x2e0
check_prev_add+0x107/0xea0
validate_chain+0x8fc/0x1200
__lock_acquire+0x5b3/0xb40
lock_acquire+0x126/0x280
console_unlock+0x269/0x750
vprintk_emit+0x10d/0x340
vprintk_default+0x1f/0x30
vprintk_func+0x44/0xd4
printk+0x9f/0xc5
__offline_isolated_pages.cold.52+0x2f/0x30a
offline_isolated_pages_cb+0x17/0x30
walk_system_ram_range+0xda/0x160
__offline_pages+0x79c/0xa10
offline_pages+0x11/0x20
memory_subsys_offline+0x7e/0xc0
device_offline+0xd5/0x110
state_store+0xc6/0xe0
dev_attr_store+0x3f/0x60
sysfs_kf_write+0x89/0xb0
kernfs_fop_write+0x188/0x240
__vfs_write+0x50/0xa0
vfs_write+0x105/0x290
ksys_write+0xc6/0x160
__x64_sys_write+0x43/0x50
do_syscall_64+0xcc/0x76c
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Link: http://lkml.kernel.org/r/20200117181200.20299-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/memory_hotplug: pass in nid to online_pages()".
Simplify onlining code and get rid of find_memory_block(). Pass in the
nid from the memory block we are trying to online directly, instead of
manually looking it up.
This patch (of 2):
No need to lookup the memory block, we can directly pass in the nid.
Link: http://lkml.kernel.org/r/20200113113354.6341-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The jump labels try_prev and none are not really needed in
find_mergeable_anon_vma(), eliminate them to improve readability.
Link: http://lkml.kernel.org/r/1574079844-17493-1-git-send-email-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If thp defrag setting "defer" is used and a newline is *not* used when
writing to the sysfs file, this is interpreted as the "defer+madvise"
option.
This is because we do prefix matching and if five characters are written
without a newline, the current code ends up comparing to the first five
bytes of the "defer+madvise" option and using that instead.
Use the more appropriate sysfs_streq() that handles the trailing newline
for us. Since this doubles as a nice cleanup, do it in enabled_store()
as well.
The current implementation relies on prefix matching: the number of
bytes compared is either the number of bytes written or the length of
the option being compared. With a newline, "defer\n" does not match
"defer+"madvise"; without a newline, however, "defer" is considered to
match "defer+madvise" (prefix matching is only comparing the first five
bytes). End result is that writing "defer" is broken unless it has an
additional trailing character.
This means that writing "madv" in the past would match and set
"madvise". With strict checking, that no longer is the case but it is
unlikely anybody is currently doing this.
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001171411020.56385@chino.kir.corp.google.com
Fixes: 21440d7eb9 ("mm, thp: add new defer+madvise defrag option")
Signed-off-by: David Rientjes <rientjes@google.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
migrate_vma_insert_page() closely follows the code in:
__handle_mm_fault()
handle_pte_fault()
do_anonymous_page()
Add a call to check_stable_address_space() after locking the page table
entry before inserting a ZONE_DEVICE private zero page mapping similar
to page faulting a new anonymous page.
Link: http://lkml.kernel.org/r/20200107211208.24595-4-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix some comment typos and coding style clean up in preparation for the
next patch. No functional changes.
Link: http://lkml.kernel.org/r/20200107211208.24595-3-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Addresses passed to walk_page_range() callback functions are already
page aligned and don't need to be masked with PAGE_MASK.
Link: http://lkml.kernel.org/r/20200107211208.24595-2-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
split_queue_lock protects data in struct deferred_split. We can release
the lock after delete the page from deferred_split_queue.
This patch moves the THP accounting out of the lock protection, which is
introduced in commit 65c453778a ("mm, rmap: account shmem thp pages").
Link: http://lkml.kernel.org/r/20200110025516.23996-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During split huge page, it checks the property of the page. Currently
we do the check on page and head without emphasizing the check is on the
compound page. In case the page passed to split_huge_page_to_list is a
tail page, audience would take some time to think about whether the
check is on compound page or tail page itself.
To make it explicit, use head instead of page for those checks. After
this, audience would be more clear about the checks are on compound page
and the page is used to do the split and dump error message if failed.
Link: http://lkml.kernel.org/r/20200110032610.26499-2-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The page could be a tail page, if this is the case, this BUG_ON will
never be triggered.
Link: http://lkml.kernel.org/r/20200110032610.26499-1-richardw.yang@linux.intel.com
Fixes: e9b61f1985 ("thp: reintroduce split_huge_page()")
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a process cannot be oom reaped, for whatever reason, currently the
list of locks that are held is currently dumped to the kernel log.
Much more interesting is the stack trace of the victim that cannot be
reaped. If the stack trace is dumped, we have the ability to find
related occurrences in the same kernel code and hopefully solve the
issue that is making it wedged.
Dump the stack trace when a process fails to be oom reaped.
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001141519280.200484@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace open function name strings with %s (__func__) in all remaining
memblock_dbg() call sites.
Link: http://lkml.kernel.org/r/1578285510-28261-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On the s390 platform memblock.physmem array is being built by directly
calling into memblock_add_range() which is a low level function not
intended to be used outside of memblock. Hence lets conditionally add
helper functions for physmem array when HAVE_MEMBLOCK_PHYS_MAP is
enabled. Also use MAX_NUMNODES instead of 0 as node ID similar to
memblock_add() and memblock_reserve(). Make memblock_add_range() a
static function as it is no longer getting used outside of memblock.
Link: http://lkml.kernel.org/r/1578283835-21969-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Collin Walling <walling@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Philipp Rudo <prudo@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 1b2ffb7896 ("[PATCH] Zone reclaim: Allow modification of zone
reclaim behavior")' defined RECLAIM_OFF/RECLAIM_ZONE, but never use
them, so better to remove them.
[dwagner@suse.de: fix sanity checks enabling]
Link: http://lkml.kernel.org/r/20200116131642.642-1-dwagner@suse.de
[akpm@linux-foundation.org: renumber the bits for neatness]
Link: http://lkml.kernel.org/r/1579005573-58923-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Tobin C. Harding" <tobin@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This macro was never used in git history. So better to remove.
Link: http://lkml.kernel.org/r/1579006500-127143-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The return value of shrink_node is not used, so remove unnecessary
operations.
Link: http://lkml.kernel.org/r/20191128143524.3223-1-fishland@aliyun.com
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the memory isolate notifier is gone, the parameter is always 0.
Drop it and cleanup has_unmovable_pages().
Link: http://lkml.kernel.org/r/20191114131911.11783-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Luckily, we have no users left, so we can get rid of it. Cleanup
set_migratetype_isolate() a little bit.
Link: http://lkml.kernel.org/r/20191114131911.11783-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Qian Cai <cai@lca.pw>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memmap_init_zone() can be called on the ranges with holes during the
boot. It will skip any non-valid PFNs one-by-one. It works fine as
long as holes are not too big.
But huge holes in the memory map causes a problem. It takes over 20
seconds to walk 32TiB hole. x86-64 with 5-level paging allows for much
larger holes in the memory map which would practically hang the system.
Deferred struct page init doesn't help here. It only works on the
present ranges.
Skipping non-present sections would fix the issue.
Link: http://lkml.kernel.org/r/20191230093828.24613-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Jin, Zhi" <zhi.jin@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
%pa takes into consideration the special types such as resource_size_t.
Use this specifier %instead of explicit casting.
Link: http://lkml.kernel.org/r/20191209165413.56263-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
The current implementation apply comparison as
- normal 4K page: page_pfn <= pfn < page_pfn + 1
- hugetlbfs page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
- THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
in pfn_in_hpage. For hugetlbfs page, it should be page_pfn == pfn
Now, change pfn_in_hpage to pfn_is_match to highlight that comparison is
not only for THP and explicitly compare for these cases.
No impact upon current behavior, just make the code clear. I think it
is important to make the code clear - comparing hugetlbfs page in range
page_pfn <= pfn < page_pfn + HPAGE_PMD_NR is confusing.
Link: http://lkml.kernel.org/r/1578737885-8890-1-git-send-email-lixinhai.lxh@gmail.com
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Compound pages handling in mem_cgroup_migrate is more convoluted than
necessary. The state is duplicated in compound variable and the same
could be achieved by PageTransHuge check which is trivial and
hpage_nr_pages is already PageTransHuge aware.
It is much simpler to just use hpage_nr_pages for nr_pages and replace
the local variable by PageTransHuge check directly
Link: http://lkml.kernel.org/r/20191210160450.3395-1-pilgrimtao@gmail.com
Signed-off-by: Kaitao Cheng <pilgrimtao@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If seq_file .next fuction does not change position index, read after
some lseek can generate unexpected output.
In Aug 2018 NeilBrown noticed commit 1f4aace60b ("fs/seq_file.c:
simplify seq_file iteration code and interface") "Some ->next functions
do not increment *pos when they return NULL... Note that such ->next
functions are buggy and should be fixed. A simple demonstration is
dd if=/proc/swaps bs=1000 skip=1
Choose any block size larger than the size of /proc/swaps. This will
always show the whole last line of /proc/swaps"
Described problem is still actual. If you make lseek into middle of
last output line following read will output end of last line and whole
last line once again.
$ dd if=/proc/swaps bs=1 # usual output
Filename Type Size Used Priority
/dev/dm-0 partition 4194812 97536 -2
104+0 records in
104+0 records out
104 bytes copied
$ dd if=/proc/swaps bs=40 skip=1 # last line was generated twice
dd: /proc/swaps: cannot skip to specified offset
v/dm-0 partition 4194812 97536 -2
/dev/dm-0 partition 4194812 97536 -2
3+1 records in
3+1 records out
131 bytes copied
https://bugzilla.kernel.org/show_bug.cgi?id=206283
Link: http://lkml.kernel.org/r/bd8cfd7b-ac95-9b91-f9e7-e8438bd5047d@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jann Horn <jannh@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to provide a clearer, more symmetric API for pinning and
unpinning DMA pages. This way, pin_user_pages*() calls match up with
unpin_user_pages*() calls, and the API is a lot closer to being
self-explanatory.
Link: http://lkml.kernel.org/r/20200107224558.2362728-23-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the gup benchmark flags to use the symbolic FOLL_WRITE, instead of a
hard-coded "1" value.
Also, clean up the filtering of gup flags a little, by just doing it
once before issuing any of the get_user_pages*() calls. This makes it
harder to overlook, instead of having little "gup_flags & 1" phrases in
the function calls.
Link: http://lkml.kernel.org/r/20200107224558.2362728-22-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert process_vm_access to use the new pin_user_pages_remote() call,
which sets FOLL_PIN. Setting FOLL_PIN is now required for code that
requires tracking of pinned pages.
Also, release the pages via put_user_page*().
Also, rename "pages" to "pinned_pages", as this makes for easier reading
of process_vm_rw_single_vec().
Link: http://lkml.kernel.org/r/20200107224558.2362728-15-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce pin_user_pages*() variations of get_user_pages*() calls, and
also pin_longterm_pages*() variations.
For now, these are placeholder calls, until the various call sites are
converted to use the correct get_user_pages*() or pin_user_pages*() API.
These variants will eventually all set FOLL_PIN, which is also
introduced, and thoroughly documented.
pin_user_pages()
pin_user_pages_remote()
pin_user_pages_fast()
All pages that are pinned via the above calls, must be unpinned via
put_user_page().
The underlying rules are:
* FOLL_PIN is a gup-internal flag, so the call sites should not directly
set it. That behavior is enforced with assertions.
* Call sites that want to indicate that they are going to do DirectIO
("DIO") or something with similar characteristics, should call a
get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
will:
* Start with "pin_user_pages" instead of "get_user_pages". That
makes it easy to find and audit the call sites.
* Set FOLL_PIN
* For pages that are received via FOLL_PIN, those pages must be returned
via put_user_page().
Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases in
this documentation. (I've reworded it and expanded upon it.)
Link: http://lkml.kernel.org/r/20200107224558.2362728-12-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> [Documentation]
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 817be129e6 ("mm: validate get_user_pages_fast flags") allowed
only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
This, combined with the fact that get_user_pages_fast() falls back to
"slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
if you need FOLL_FORCE, you cannot call get_user_pages_fast().
There does not appear to be any reason for filtering out FOLL_FORCE.
There is nothing in the _fast() implementation that requires that we
avoid writing to the pages. So it appears to have been an oversight.
Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().
Link: http://lkml.kernel.org/r/20200107224558.2362728-9-jhubbard@nvidia.com
Fixes: 817be129e6 ("mm: validate get_user_pages_fast flags")
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As it says in the updated comment in gup.c: current FOLL_LONGTERM
behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the FS
DAX check requirement on vmas.
However, the corresponding restriction in get_user_pages_remote() was
slightly stricter than is actually required: it forbade all
FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
that do not set the "locked" arg.
Update the code and comments to loosen the restriction, allowing
FOLL_LONGTERM in some cases.
Also, copy the DAX check ("if a VMA is DAX, don't allow long term
pinning") from the VFIO call site, all the way into the internals of
get_user_pages_remote() and __gup_longterm_locked(). That is:
get_user_pages_remote() calls __gup_longterm_locked(), which in turn
calls check_dax_vmas(). This check will then be removed from the VFIO
call site in a subsequent patch.
Thanks to Jason Gunthorpe for pointing out a clean way to fix this, and
to Dan Williams for helping clarify the DAX refactoring.
Link: http://lkml.kernel.org/r/20200107224558.2362728-7-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An upcoming patch changes and complicates the refcounting and especially
the "put page" aspects of it. In order to keep everything clean,
refactor the devmap page release routines:
* Rename put_devmap_managed_page() to page_is_devmap_managed(), and
limit the functionality to "read only": return a bool, with no side
effects.
* Add a new routine, put_devmap_managed_page(), to handle decrementing
the refcount for ZONE_DEVICE pages.
* Change callers (just release_pages() and put_page()) to check
page_is_devmap_managed() before calling the new
put_devmap_managed_page() routine. This is a performance point:
put_page() is a hot path, so we need to avoid non- inline function calls
where possible.
* Rename __put_devmap_managed_page() to free_devmap_managed_page(), and
limit the functionality to unconditionally freeing a devmap page.
This is originally based on a separate patch by Ira Weiny, which applied
to an early version of the put_user_page() experiments. Since then,
Jérôme Glisse suggested the refactoring described above.
Link: http://lkml.kernel.org/r/20200107224558.2362728-5-jhubbard@nvidia.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After the removal of the device-public infrastructure there are only 2
->page_free() call backs in the kernel. One of those is a
device-private callback in the nouveau driver, the other is a generic
wakeup needed in the DAX case. In the hopes that all ->page_free()
callbacks can be migrated to common core kernel functionality, move the
device-private specific actions in __put_devmap_managed_page() under the
is_device_private_page() conditional, including the ->page_free()
callback. For the other page types just open-code the generic wakeup.
Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it
does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA
case.
Link: http://lkml.kernel.org/r/20200107224558.2362728-4-jhubbard@nvidia.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An upcoming patch uses try_get_compound_head() more widely, so move it to
the top of gup.c.
Also fix a tiny spelling error and a checkpatch.pl warning.
Link: http://lkml.kernel.org/r/20200107224558.2362728-3-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/gup: prereqs to track dma-pinned pages: FOLL_PIN", v12.
Overview:
This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], and in a remarkable number of email threads since about
2017. :)
A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.
I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
unpin_user_page()
Testing:
* I've done some overall kernel testing (LTP, and a few other goodies),
and some directed testing to exercise some of the changes. And as you
can see, gup_benchmark is enhanced to exercise this. Basically, I've
been able to runtime test the core get_user_pages() and
pin_user_pages() and related routines, but not so much on several of
the call sites--but those are generally just a couple of lines
changed, each.
Not much of the kernel is actually using this, which on one hand
reduces risk quite a lot. But on the other hand, testing coverage
is low. So I'd love it if, in particular, the Infiniband and PowerPC
folks could do a smoke test of this series for me.
Runtime testing for the call sites so far is pretty light:
* io_uring: Some directed tests from liburing exercise this, and
they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and
passes.
* infiniband: Ran rdma-core tests: rdma-core/build/bin/run_tests.py
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...
[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
This patch (of 22):
There are four locations in gup.c that have a fair amount of code
duplication. This means that changing one requires making the same
changes in four places, not to mention reading the same code four times,
and wondering if there are subtle differences.
Factor out the common code into static functions, thus reducing the
overall line count and the code's complexity.
Also, take the opportunity to slightly improve the efficiency of the
error cases, by doing a mass subtraction of the refcount, surrounded by
get_page()/put_page().
Also, further simplify (slightly), by waiting until the the successful
end of each routine, to increment *nr.
Link: http://lkml.kernel.org/r/20200107224558.2362728-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No functional change, just leverage the helper function to improve
readability as others.
Link: http://lkml.kernel.org/r/20200113070322.26627-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sorry for not processing for a long time. I met it again.
patch v1 https://lkml.org/lkml/2019/9/20/656
do_machine_check()
do_memory_failure()
memory_failure()
hw_poison_user_mappings()
try_to_unmap()
pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
...and now we have a swap entry that indicates that the page entry
refers to a bad (and poisoned) page of memory, but gup_fast() at this
level of the page table was ignoring swap entries, and incorrectly
assuming that "!pxd_none() == valid and present".
And this was not just a poisoned page problem, but a generaly swap entry
problem. So, any swap entry type (device memory migration, numa
migration, or just regular swapping) could lead to the same problem.
Fix this by checking for pxd_present(), instead of pxd_none().
Link: http://lkml.kernel.org/r/1578479084-15508-1-git-send-email-hqjagain@gmail.com
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At some point filemap_write_and_wait() and
filemap_write_and_wait_range() got the exact same implementation with
the exception of the range being specified in *_range()
Similar to other functions in fs.h which call *_range(..., 0,
LLONG_MAX), change filemap_write_and_wait() to be a static inline which
calls filemap_write_and_wait_range()
Link: http://lkml.kernel.org/r/20191129160713.30892-1-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 76a1850e45 ("mm/debug.c: __dump_page() prints an extra line")
inadvertently removed printing of page flags for pages that are neither
anon nor ksm nor have a mapping. Fix that.
Using pr_cont() again would be a solution, but the commit explicitly
removed its use. Avoiding the danger of mixing up split lines from
multiple CPUs might be beneficial for near-panic dumps like this, so fix
this without reintroducing pr_cont().
Link: http://lkml.kernel.org/r/9f884d5c-ca60-dc7b-219c-c081c755fab6@suse.cz
Fixes: 76a1850e45 ("mm/debug.c: __dump_page() prints an extra line")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reported-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmemleak_lock as a rwlock on RT can possibly be acquired in atomic
context which does work.
Since the kmemleak operation is performed in atomic context make it a
raw_spinlock_t so it can also be acquired on RT. This is used for
debugging and is not enabled by default in a production like environment
(where performance/latency matters) so it makes sense to make it a
raw_spinlock_t instead trying to get rid of the atomic context. Turn
also the kmemleak_object->lock into raw_spinlock_t which is acquired
(nested) while the kmemleak_lock is held.
The time spent in "echo scan > kmemleak" slightly improved on 64core box
with this patch applied after boot.
[bigeasy@linutronix.de: redo the description, update comments. Merge the individual bits: He Zhe did the kmemleak_lock, Liu Haitao the ->lock and Yongxin Liu forwarded Liu's patch.]
Link: http://lkml.kernel.org/r/20191219170834.4tah3prf2gdothz4@linutronix.de
Link: https://lkml.kernel.org/r/20181218150744.GB20197@arrakis.emea.arm.com
Link: https://lkml.kernel.org/r/1542877459-144382-1-git-send-email-zhe.he@windriver.com
Link: https://lkml.kernel.org/r/20190927082230.34152-1-yongxin.liu@windriver.com
Signed-off-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Liu Haitao <haitao.liu@windriver.com>
Signed-off-by: Yongxin Liu <yongxin.liu@windriver.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we are already under list_lock, don't call kmalloc(). Otherwise we
will run into a deadlock because kmalloc() also tries to grab the same
lock.
Fix the problem by using a static bitmap instead.
WARNING: possible recursive locking detected
--------------------------------------------
mount-encrypted/4921 is trying to acquire lock:
(&(&n->list_lock)->rlock){-.-.}, at: ___slab_alloc+0x104/0x437
but task is already holding lock:
(&(&n->list_lock)->rlock){-.-.}, at: __kmem_cache_shutdown+0x81/0x3cb
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&n->list_lock)->rlock);
lock(&(&n->list_lock)->rlock);
*** DEADLOCK ***
Link: http://lkml.kernel.org/r/20191108193958.205102-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit a49bd4d716 ("mm, numa: rework do_pages_move"), the
semantic of move_pages() has changed to return the number of
non-migrated pages if they were result of a non-fatal reasons (usually a
busy page).
This was an unintentional change that hasn't been noticed except for LTP
tests which checked for the documented behavior.
There are two ways to go around this change. We can even get back to
the original behavior and return -EAGAIN whenever migrate_pages is not
able to migrate pages due to non-fatal reasons. Another option would be
to simply continue with the changed semantic and extend move_pages
documentation to clarify that -errno is returned on an invalid input or
when migration simply cannot succeed (e.g. -ENOMEM, -EBUSY) or the
number of pages that couldn't have been migrated due to ephemeral
reasons (e.g. page is pinned or locked for other reasons).
This patch implements the second option because this behavior is in
place for some time without anybody complaining and possibly new users
depending on it. Also it allows to have a slightly easier error
handling as the caller knows that it is worth to retry when err > 0.
But since the new semantic would be aborted immediately if migration is
failed due to ephemeral reasons, need include the number of
non-attempted pages in the return value too.
Link: http://lkml.kernel.org/r/1580160527-109104-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: a49bd4d716 ("mm, numa: rework do_pages_move")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Cc: <stable@vger.kernel.org> [4.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If compound is true, this means it is a PMD mapped THP. Which implies
the page is not linked to any defer list. So the first code chunk will
not be executed.
Also with this reason, it would not be proper to add this page to a
defer list. So the second code chunk is not correct.
Based on this, we should remove the defer list related code.
[yang.shi@linux.alibaba.com: better patch title]
Link: http://lkml.kernel.org/r/20200117233836.3434-1-richardw.yang@linux.intel.com
Fixes: 87eaceb3fa ("mm: thp: make deferred split shrinker memcg aware")
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org> [5.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The daxctl unit test for the dax_kmem driver currently triggers the
(false positive) lockdep splat below. It results from the fact that
remove_memory_block_devices() is invoked under the mem_hotplug_lock()
causing lockdep entanglements with cpu_hotplug_lock() and sysfs (kernfs
active state tracking). It is a false positive because the sysfs
attribute path triggering the memory remove is not the same attribute
path associated with memory-block device.
sysfs_break_active_protection() is not applicable since there is no real
deadlock conflict, instead move memory-block device removal outside the
lock. The mem_hotplug_lock() is not needed to synchronize the
memory-block device removal vs the page online state, that is already
handled by lock_device_hotplug(). Specifically, lock_device_hotplug()
is sufficient to allow try_remove_memory() to check the offline state of
the memblocks and be assured that any in progress online attempts are
flushed / blocked by kernfs_drain() / attribute removal.
The add_memory() path safely creates memblock devices under the
mem_hotplug_lock(). There is no kernfs active state synchronization in
the memblock device_register() path, so nothing to fix there.
This change is only possible thanks to the recent change that refactored
memory block device removal out of arch_remove_memory() (commit
4c4b7f9ba9 "mm/memory_hotplug: remove memory block devices before
arch_remove_memory()"), and David's due diligence tracking down the
guarantees afforded by kernfs_drain(). Not flagged for -stable since
this only impacts ongoing development and lockdep validation, not a
runtime issue.
======================================================
WARNING: possible circular locking dependency detected
5.5.0-rc3+ #230 Tainted: G OE
------------------------------------------------------
lt-daxctl/6459 is trying to acquire lock:
ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80
but task is already holding lock:
ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (mem_hotplug_lock.rw_sem){++++}:
__lock_acquire+0x39c/0x790
lock_acquire+0xa2/0x1b0
get_online_mems+0x3e/0xb0
kmem_cache_create_usercopy+0x2e/0x260
kmem_cache_create+0x12/0x20
ptlock_cache_init+0x20/0x28
start_kernel+0x243/0x547
secondary_startup_64+0xb6/0xc0
-> #1 (cpu_hotplug_lock.rw_sem){++++}:
__lock_acquire+0x39c/0x790
lock_acquire+0xa2/0x1b0
cpus_read_lock+0x3e/0xb0
online_pages+0x37/0x300
memory_subsys_online+0x17d/0x1c0
device_online+0x60/0x80
state_store+0x65/0xd0
kernfs_fop_write+0xcf/0x1c0
vfs_write+0xdb/0x1d0
ksys_write+0x65/0xe0
do_syscall_64+0x5c/0xa0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#241){++++}:
check_prev_add+0x98/0xa40
validate_chain+0x576/0x860
__lock_acquire+0x39c/0x790
lock_acquire+0xa2/0x1b0
__kernfs_remove+0x25f/0x2e0
kernfs_remove_by_name_ns+0x41/0x80
remove_files.isra.0+0x30/0x70
sysfs_remove_group+0x3d/0x80
sysfs_remove_groups+0x29/0x40
device_remove_attrs+0x39/0x70
device_del+0x16a/0x3f0
device_unregister+0x16/0x60
remove_memory_block_devices+0x82/0xb0
try_remove_memory+0xb5/0x130
remove_memory+0x26/0x40
dev_dax_kmem_remove+0x44/0x6a [kmem]
device_release_driver_internal+0xe4/0x1c0
unbind_store+0xef/0x120
kernfs_fop_write+0xcf/0x1c0
vfs_write+0xdb/0x1d0
ksys_write+0x65/0xe0
do_syscall_64+0x5c/0xa0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Chain exists of:
kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(mem_hotplug_lock.rw_sem);
lock(cpu_hotplug_lock.rw_sem);
lock(mem_hotplug_lock.rw_sem);
lock(kn->count#241);
*** DEADLOCK ***
No fixes tag as this has been a long standing issue that predated the
addition of kernfs lockdep annotations.
Link: http://lkml.kernel.org/r/157991441887.2763922.4770790047389427325.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we get here after successfully adding page to list, err would be 1 to
indicate the page is queued in the list.
Current code has two problems:
* on success, 0 is not returned
* on error, if add_page_for_migratioin() return 1, and the following err1
from do_move_pages_to_node() is set, the err1 is not returned since err
is 1
And these behaviors break the user interface.
Link: http://lkml.kernel.org/r/20200119065753.21694-1-richardw.yang@linux.intel.com
Fixes: e0153fc2c7 ("mm: move_pages: return valid node id in status if the page is already on the target node").
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit ba72b4c8cf ("mm/sparsemem: support sub-section hotplug"),
when a mem section is fully deactivated, section_mem_map still records
the section's start pfn, which is not used any more and will be
reassigned during re-addition.
In analogy with alloc/free pattern, it is better to clear all fields of
section_mem_map.
Beside this, it breaks the user space tool "makedumpfile" [1], which
makes assumption that a hot-removed section has mem_map as NULL, instead
of checking directly against SECTION_MARKED_PRESENT bit. (makedumpfile
will be better to change the assumption, and need a patch)
The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" ,
trigger a crash, and save vmcore by makedumpfile
[1]: makedumpfile, commit e73016540293 ("[v1.6.7] Update version")
Link: http://lkml.kernel.org/r/1579487594-28889-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
What we are trying to do is change the '=' character to a NUL terminator
and then at the end of the function we restore it back to an '='. The
problem is there are two error paths where we jump to the end of the
function before we have replaced the '=' with NUL.
We end up putting the '=' in the wrong place (possibly one element
before the start of the buffer).
Link: http://lkml.kernel.org/r/20200115055426.vdjwvry44nfug7yy@kili.mountain
Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
Fixes: 095f1fc4eb ("mempolicy: rework shmem mpol parsing and display")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Dmitry Vyukov <dvyukov@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PPC: Bugfixes
x86:
* Support for mapping DAX areas with large nested page table entries.
* Cleanups and bugfixes here too. A particularly important one is
a fix for FPU load when the thread has TIF_NEED_FPU_LOAD. There is
also a race condition which could be used in guest userspace to exploit
the guest kernel, for which the embargo expired today.
* Fast path for IPI delivery vmexits, shaving about 200 clock cycles
from IPI latency.
* Protect against "Spectre-v1/L1TF" (bring data in the cache via
speculative out of bound accesses, use L1TF on the sibling hyperthread
to read it), which unfortunately is an even bigger whack-a-mole game
than SpectreV1.
Sean continues his mission to rewrite KVM. In addition to a sizable
number of x86 patches, this time he contributed a pretty large refactoring
of vCPU creation that affects all architectures but should not have any
visible effect.
s390 will come next week together with some more x86 patches.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJeMxtCAAoJEL/70l94x66DQxIIAJv9hMmXLQHGFnUMskjGErR6
DCLSC0YRdRMwE50CerblyJtGsMwGsPyHZwvZxoAceKJ9w0Yay9cyaoJ87ItBgHoY
ce0HrqIUYqRSJ/F8WH2lSzkzMBr839rcmqw8p1tt4D5DIsYnxHGWwRaaP+5M/1KQ
YKFu3Hea4L00U339iIuDkuA+xgz92LIbsn38svv5fxHhPAyWza0rDEYHNgzMKuoF
IakLf5+RrBFAh6ZuhYWQQ44uxjb+uQa9pVmcqYzzTd5t1g4PV5uXtlJKesHoAvik
Eba8IEUJn+HgQJjhp3YxQYuLeWOwRF3bwOiZ578MlJ4OPfYXMtbdlqCQANHOcGk=
=H/q1
-----END PGP SIGNATURE-----
Merge tag 'kvm-5.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"This is the first batch of KVM changes.
ARM:
- cleanups and corner case fixes.
PPC:
- Bugfixes
x86:
- Support for mapping DAX areas with large nested page table entries.
- Cleanups and bugfixes here too. A particularly important one is a
fix for FPU load when the thread has TIF_NEED_FPU_LOAD. There is
also a race condition which could be used in guest userspace to
exploit the guest kernel, for which the embargo expired today.
- Fast path for IPI delivery vmexits, shaving about 200 clock cycles
from IPI latency.
- Protect against "Spectre-v1/L1TF" (bring data in the cache via
speculative out of bound accesses, use L1TF on the sibling
hyperthread to read it), which unfortunately is an even bigger
whack-a-mole game than SpectreV1.
Sean continues his mission to rewrite KVM. In addition to a sizable
number of x86 patches, this time he contributed a pretty large
refactoring of vCPU creation that affects all architectures but should
not have any visible effect.
s390 will come next week together with some more x86 patches"
* tag 'kvm-5.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
x86/KVM: Clean up host's steal time structure
x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed
x86/kvm: Cache gfn to pfn translation
x86/kvm: Introduce kvm_(un)map_gfn()
x86/kvm: Be careful not to clear KVM_VCPU_FLUSH_TLB bit
KVM: PPC: Book3S PR: Fix -Werror=return-type build failure
KVM: PPC: Book3S HV: Release lock on page-out failure path
KVM: arm64: Treat emulated TVAL TimerValue as a signed 32-bit integer
KVM: arm64: pmu: Only handle supported event counters
KVM: arm64: pmu: Fix chained SW_INCR counters
KVM: arm64: pmu: Don't mark a counter as chained if the odd one is disabled
KVM: arm64: pmu: Don't increment SW_INCR if PMCR.E is unset
KVM: x86: Use a typedef for fastop functions
KVM: X86: Add 'else' to unify fastop and execute call path
KVM: x86: inline memslot_valid_for_gpte
KVM: x86/mmu: Use huge pages for DAX-backed files
KVM: x86/mmu: Remove lpage_is_disallowed() check from set_spte()
KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust()
KVM: x86/mmu: Zap any compound page when collapsing sptes
KVM: x86/mmu: Remove obsolete gfn restoration in FNAME(fetch)
...
From Boris Ostrovsky:
The KVM hypervisor may provide a guest with ability to defer remote TLB
flush when the remote VCPU is not running. When this feature is used,
the TLB flush will happen only when the remote VPCU is scheduled to run
again. This will avoid unnecessary (and expensive) IPIs.
Under certain circumstances, when a guest initiates such deferred action,
the hypervisor may miss the request. It is also possible that the guest
may mistakenly assume that it has already marked remote VCPU as needing
a flush when in fact that request had already been processed by the
hypervisor. In both cases this will result in an invalid translation
being present in a vCPU, potentially allowing accesses to memory locations
in that guest's address space that should not be accessible.
Note that only intra-guest memory is vulnerable.
The five patches address both of these problems:
1. The first patch makes sure the hypervisor doesn't accidentally clear
a guest's remote flush request
2. The rest of the patches prevent the race between hypervisor
acknowledging a remote flush request and guest issuing a new one.
Conflicts:
arch/x86/kvm/x86.c [move from kvm_arch_vcpu_free to kvm_arch_vcpu_destroy]
This small series revises the names in mmu_notifier to make the code
clearer and more readable.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl4wf2EACgkQOG33FX4g
mxqrdw//XIexbXQqP4dUKFCFeI7Um6ZqYE6iVCQi6JEetpKxCR8BSrJsq6EP60Mg
cVCKolISuudzOccz/liotg9SrwRlcO3mzucd8LJZG0v2FZMzQr0EKjst0RC4/xvK
U2RxGvwLQ+XVR/3/l6hXyWyw7u28+F1RsfQMMX3kqR3qlcQachQ3k7oUINDIq2XH
JkQcBV+XK0doXEp6VCCVKwuwEN7O5xSm8lAIHDNFZEEPre0iKxwatgWxdXFIWQek
tRywwB7bRzFROBlDcoOQ0GDTqScr3bghz6vWU4GGv3avYkystKwy44ha6BzO2xQc
ZNIo8AN9UFFhcmF531wklsXTCbxbxJAJAwdyIuQnKq5glw64EFnrjo2sxuL6s56h
C1GHADtxDccv+nr2sKP/rFFeq9K3VqHDtjEdBOhReuB0Vp1YfVr17A4R8yAn8A+1
vm3IusoOq+g8qMYxRHEb+76/S//joaxAlFQkU5Gjn/0xsykP99YQSQFBjXmkzWlS
IiHLf0HJiCCL8SHe4Wnyhyl1DUIIl38HQULqbFWZ8hK4ELhTd2KEuDxzT8q+v+v7
2M9nBVdRaw1kskGiFv+F7mb6c990CTEZO9B5fHpAjPRxeVkLYc06QfJY+hXbbu4c
6yzIvERRRlAviCmgb7G+3pLyBCKdvlIlCVsVOdxHXSRsl904BnA=
=hhT0
-----END PGP SIGNATURE-----
Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull mmu_notifier updates from Jason Gunthorpe:
"This small series revises the names in mmu_notifier to make the code
clearer and more readable"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
mm/mmu_notifiers: Use 'interval_sub' as the variable for mmu_interval_notifier
mm/mmu_notifiers: Use 'subscription' as the variable name for mmu_notifier
mm/mmu_notifier: Rename struct mmu_notifier_mm to mmu_notifier_subscriptions
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4yEegQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpn5ZD/4/WlXs2cUDgg1C65bzZFO4qvevm+VkXmsk
GbyrnFstRekvSH01/ZQxlyDVKS8Wux0XIJ6OArCh1047LvL1bEE5dvOW5iIiwa/r
grjQuwFAzIPsE2fgcAO17BKIUzq2Z96+hwDzH7dw0i32yBuLvNmY/1SxcCHKfPut
uzGyp7t3/2dIHbpWILRndMYe0O9j9ubmOMvKyKTwy723yDEafsUoqu2mlpigzTq4
2i+DbYBIAd8qmLqG/m3e+vOt9xodJ2Q0hlO+v6DcP2SKXU64Hb/N98HadR//aWP9
41DBXqs+dvDBcu3Jxb80PFUTiOQZECJivkns5cNcjuSXmNkOuQhDQR5K372AHmR9
m6e6FSBxwej8HselAZCI6yu9uBKd0i+MM4FnFs/O73QGYx2ayXsEXp/Jad9xiYgW
pC5XJTSqJQhPE0AYYEOzHPPcBLBcpvXHkvmGKdjkNb8OLhhgh2S/YG0DNC+8ABXr
j1uIe/n3kJEEmOanUyiitGyLmDq+mXd7aCVKJL/J0KiGD8Gkc1avAZ1ZrTQgjujY
FqqBFawO8gv3g0L4WMI8JI+HJGMnA488obet6UKm9+l/Z/urEpXzDAKf/W/vnx2B
LD0FSA0bCh1tyO6JU+avFwHlwShtV7/rx/OhrmCK7CCYKtZCA2IEctxyr8U+PBIv
DtwIMTYTsA==
=ZZUI
-----END PGP SIGNATURE-----
Merge tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
- Support for various new opcodes (fallocate, openat, close, statx,
fadvise, madvise, openat2, non-vectored read/write, send/recv, and
epoll_ctl)
- Faster ring quiesce for fileset updates
- Optimizations for overflow condition checking
- Support for max-sized clamping
- Support for probing what opcodes are supported
- Support for io-wq backend sharing between "sibling" rings
- Support for registering personalities
- Lots of little fixes and improvements
* tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-block: (64 commits)
io_uring: add support for epoll_ctl(2)
eventpoll: support non-blocking do_epoll_ctl() calls
eventpoll: abstract out epoll_ctl() handler
io_uring: fix linked command file table usage
io_uring: support using a registered personality for commands
io_uring: allow registering credentials
io_uring: add io-wq workqueue sharing
io-wq: allow grabbing existing io-wq
io_uring/io-wq: don't use static creds/mm assignments
io-wq: make the io_wq ref counted
io_uring: fix refcounting with batched allocations at OOM
io_uring: add comment for drain_next
io_uring: don't attempt to copy iovec for READ/WRITE
io_uring: honor IOSQE_ASYNC for linked reqs
io_uring: prep req when do IOSQE_ASYNC
io_uring: use labeled array init in io_op_defs
io_uring: optimise sqe-to-req flags translation
io_uring: remove REQ_F_IO_DRAINED
io_uring: file switch work needs to get flushed on exit
io_uring: hide uring_fd in ctx
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4vDYkACgkQxWXV+ddt
WDsNJQ//WJEcYoRpN5Y7oOIk/vo5ulF68P3kUh3hl206A13xpaHorvTvZKAD5s2o
C6xACJk839sGEhMdDRWvdeBDCHTedMk7EXjiZ6kJD+7EPpWmDllI5O6DTolT7SR2
b9zId4KCO+m8LiLZccRsxCJbdkJ7nJnz2c5+063TjsS3uq1BFudctRUjW/XnFCCZ
JIE5iOkdXrA+bFqc+l2zKTwgByQyJg+hVKRTZEJBT0QZsyNQvHKzXAmXxGopW8bO
SeuzFkiFTA0raK8xBz6mUwaZbk40Qlzm9v9AitFZx0x2nvQnMu447N3xyaiuyDWd
Li1aMN0uFZNgSz+AemuLfG0Wj70x1HrQisEj958XKzn4cPpUuMcc3lr1PZ2NIX+C
p6pSgaLOEq8Rc0U78/euZX6oyiLJPAmQO1TdkVMHrcMi36esBI6uG11rds+U+xeK
XoP20qXLFVYLLrl3wH9F4yIzydfMYu66Us1AeRPRB14NSSa7tbCOG//aCafOoLM6
518sJCazSWlv1kDewK8dtLiXc8eM6XJN+KI4NygFZrUj2Rq376q5oovUUKKkn3iN
pdHtF/7gAxIx6bZ+jY/gyt/Xe5AdPi7sKggahvrSOL3X+LLINwC4r+vAnnpd6yh4
NfJj5fobvc/mO9PEVMwgJ8PmHw5uNqeMlORGjk7stQs7Oez3tCw=
=4OkE
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features, highlights:
- async discard
- "mount -o discard=async" to enable it
- freed extents are not discarded immediatelly, but grouped
together and trimmed later, with IO rate limiting
- the "sync" mode submits short extents that could have been
ignored completely by the device, for SATA prior to 3.1 the
requests are unqueued and have a big impact on performance
- the actual discard IO requests have been moved out of
transaction commit to a worker thread, improving commit latency
- IO rate and request size can be tuned by sysfs files, for now
enabled only with CONFIG_BTRFS_DEBUG as we might need to
add/delete the files and don't have a stable-ish ABI for
general use, defaults are conservative
- export device state info in sysfs, eg. missing, writeable
- no discard of extents known to be untouched on disk (eg. after
reservation)
- device stats reset is logged with process name and PID that called
the ioctl
Fixes:
- fix missing hole after hole punching and fsync when using NO_HOLES
- writeback: range cyclic mode could miss some dirty pages and lead
to OOM
- two more corner cases for metadata_uuid change after power loss
during the change
- fix infinite loop during fsync after mix of rename operations
Core changes:
- qgroup assign returns ENOTCONN when quotas not enabled, used to
return EINVAL that was confusing
- device closing does not need to allocate memory anymore
- snapshot aware code got removed, disabled for years due to
performance problems, reimplmentation will allow to select wheter
defrag breaks or does not break COW on shared extents
- tree-checker:
- check leaf chunk item size, cross check against number of
stripes
- verify location keys for DIR_ITEM, DIR_INDEX and XATTR items
- new self test for physical -> logical mapping code, used for super
block range exclusion
- assertion helpers/macros updated to avoid objtool "unreachable
code" reports on older compilers or config option combinations"
* tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (84 commits)
btrfs: free block groups after free'ing fs trees
btrfs: Fix split-brain handling when changing FSID to metadata uuid
btrfs: Handle another split brain scenario with metadata uuid feature
btrfs: Factor out metadata_uuid code from find_fsid.
btrfs: Call find_fsid from find_fsid_inprogress
Btrfs: fix infinite loop during fsync after rename operations
btrfs: set trans->drity in btrfs_commit_transaction
btrfs: drop log root for dropped roots
btrfs: sysfs, add devid/dev_state kobject and device attributes
btrfs: Refactor btrfs_rmap_block to improve readability
btrfs: Add self-tests for btrfs_rmap_block
btrfs: selftests: Add support for dummy devices
btrfs: Move and unexport btrfs_rmap_block
btrfs: separate definition of assertion failure handlers
btrfs: device stats, log when stats are zeroed
btrfs: fix improper setting of scanned for range cyclic write cache pages
btrfs: safely advance counter when looking up bio csums
btrfs: remove unused member btrfs_device::work
btrfs: remove unnecessary wrapper get_alloc_profile
btrfs: add correction to handle -1 edge case in async discard
...
Pull misc x86 updates from Ingo Molnar:
"Misc changes:
- Enhance #GP fault printouts by distinguishing between canonical and
non-canonical address faults, and also add KASAN fault decoding.
- Fix/enhance the x86 NMI handler by putting the duration check into
a direct function call instead of an irq_work which we know to be
broken in some cases.
- Clean up do_general_protection() a bit"
* 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Remove irq_work from the long duration NMI handler
x86/traps: Cleanup do_general_protection()
x86/kasan: Print original address on #GP
x86/dumpstack: Introduce die_addr() for die() with #GP fault address
x86/traps: Print address on #GP
x86/insn-eval: Add support for 64-bit kernel mode
Pull scheduler updates from Ingo Molnar:
"These were the main changes in this cycle:
- More -rt motivated separation of CONFIG_PREEMPT and
CONFIG_PREEMPTION.
- Add more low level scheduling topology sanity checks and warnings
to filter out nonsensical topologies that break scheduling.
- Extend uclamp constraints to influence wakeup CPU placement
- Make the RT scheduler more aware of asymmetric topologies and CPU
capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y
- Make idle CPU selection more consistent
- Various fixes, smaller cleanups, updates and enhancements - please
see the git log for details"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
sched/fair: Define sched_idle_cpu() only for SMP configurations
sched/topology: Assert non-NUMA topology masks don't (partially) overlap
idle: fix spelling mistake "iterrupts" -> "interrupts"
sched/fair: Remove redundant call to cpufreq_update_util()
sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
sched/fair: calculate delta runnable load only when it's needed
sched/cputime: move rq parameter in irqtime_account_process_tick
stop_machine: Make stop_cpus() static
sched/debug: Reset watchdog on all CPUs while processing sysrq-t
sched/core: Fix size of rq::uclamp initialization
sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
sched/fair: Load balance aggressively for SCHED_IDLE CPUs
sched/fair : Improve update_sd_pick_busiest for spare capacity case
watchdog: Remove soft_lockup_hrtimer_cnt and related code
sched/rt: Make RT capacity-aware
sched/fair: Make EAS wakeup placement consider uclamp restrictions
sched/fair: Make task_fits_capacity() consider uclamp restrictions
sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
sched/uclamp: Make uclamp util helpers use and return UL values
...
Pull EFI updates from Ingo Molnar:
"The main changes in this cycle were:
- Cleanup of the GOP [graphics output] handling code in the EFI stub
- Complete refactoring of the mixed mode handling in the x86 EFI stub
- Overhaul of the x86 EFI boot/runtime code
- Increase robustness for mixed mode code
- Add the ability to disable DMA at the root port level in the EFI
stub
- Get rid of RWX mappings in the EFI memory map and page tables,
where possible
- Move the support code for the old EFI memory mapping style into its
only user, the SGI UV1+ support code.
- plus misc fixes, updates, smaller cleanups.
... and due to interactions with the RWX changes, another round of PAT
cleanups make a guest appearance via the EFI tree - with no side
effects intended"
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
efi/x86: Disable instrumentation in the EFI runtime handling code
efi/libstub/x86: Fix EFI server boot failure
efi/x86: Disallow efi=old_map in mixed mode
x86/boot/compressed: Relax sed symbol type regex for LLVM ld.lld
efi/x86: avoid KASAN false positives when accessing the 1: 1 mapping
efi: Fix handling of multiple efi_fake_mem= entries
efi: Fix efi_memmap_alloc() leaks
efi: Add tracking for dynamically allocated memmaps
efi: Add a flags parameter to efi_memory_map
efi: Fix comment for efi_mem_type() wrt absent physical addresses
efi/arm: Defer probe of PCIe backed efifb on DT systems
efi/x86: Limit EFI old memory map to SGI UV machines
efi/x86: Avoid RWX mappings for all of DRAM
efi/x86: Don't map the entire kernel text RW for mixed mode
x86/mm: Fix NX bit clearing issue in kernel_map_pages_in_pgd
efi/libstub/x86: Fix unused-variable warning
efi/libstub/x86: Use mandatory 16-byte stack alignment in mixed mode
efi/libstub/x86: Use const attribute for efi_is_64bit()
efi: Allow disabling PCI busmastering on bridges during boot
efi/x86: Allow translating 64-bit arguments for mixed mode calls
...
- Rework the smp function call core code to avoid the allocation of an
additional cpumask.
- Remove the not longer required GFP argument from on_each_cpu_cond() and
on_each_cpu_cond_mask() and fixup the callers.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl4vcrATHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYocr1D/4ptWrZKsgBxGKBP34lvJAjd0KRqVoz
J9dLAN+AAs6YZSnOmRBX1b9d9IL2PrccOEF+J/Ja3ZkB+PAoAQ9W3uCHkZ77WUph
xx5eJahZCo+3nZ6amGgS2cPdG8WjxSK3enxPcU4pJhV/QaaP7R9BZt5YQgreYAQO
kRi0qyt10AExLqLd+077GX5DKcEOXwwVG/qckUQK2h8Kkd68vTbjDxggvsHwmpSE
MHaszv85UpE+YQbT6DyG5Hi4kK3AJeODBy/fKr2VODIBLZpKiuQ5kK4lbNHYPpVB
wXw0umXHLQggrKoPKo58ayoCXD0bAG9JT0rvapjUJIz1/9YejQ6lB/t5f0dPbSrU
al4CJq/pfNky4H6uLWFVbAXJabJuBcB/eG1csaM88Yw0pEXkbnHCOkJAdosoDhhl
qNQYg4yaE9tTuy1chXDMntH0R0Qztqry6+DMsczJxT21TgERsHCRJV+mGLV46/ZN
GXJEoJ/cnjNJlqj8GirjbksPRbxuvmQNHRVrTh8qOSxbPKUQZfZocp9HHNmFsBaN
Q07VgWMHXzYj1L4r3cbJ/ONpOCo66lw7F//MNGk0eIWdeL6H7XZvJQPX+YUrLsZc
tVlZh8mZOGbRiM8g1dN0BSJO7QrVYmJWGb0oQQtv5tVSRN/V8Y9VZ8YX8lpYlF1e
ETkrZLGhTJWp4A==
=M4aK
-----END PGP SIGNATURE-----
Merge tag 'smp-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core SMP updates from Thomas Gleixner:
"A small set of SMP core code changes:
- Rework the smp function call core code to avoid the allocation of
an additional cpumask
- Remove the not longer required GFP argument from on_each_cpu_cond()
and on_each_cpu_cond_mask() and fixup the callers"
* tag 'smp-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp: Remove allocation mask from on_each_cpu_cond.*()
smp: Add a smp_cond_func_t argument to smp_call_function_many()
smp: Use smp_cond_func_t as type for the conditional function
- Time namespace support:
If a container migrates from one host to another then it expects that
clocks based on MONOTONIC and BOOTTIME are not subject to
disruption. Due to different boot time and non-suspended runtime these
clocks can differ significantly on two hosts, in the worst case time
goes backwards which is a violation of the POSIX requirements.
The time namespace addresses this problem. It allows to set offsets for
clock MONOTONIC and BOOTTIME once after creation and before tasks are
associated with the namespace. These offsets are taken into account by
timers and timekeeping including the VDSO.
Offsets for wall clock based clocks (REALTIME/TAI) are not provided by
this mechanism. While in theory possible, the overhead and code
complexity would be immense and not justified by the esoteric potential
use cases which were discussed at Plumbers '18.
The overhead for tasks in the root namespace (host time offsets = 0) is
in the noise and great effort was made to ensure that especially in the
VDSO. If time namespace is disabled in the kernel configuration the
code is compiled out.
Kudos to Andrei Vagin and Dmitry Sofanov who implemented this feature
and kept on for more than a year addressing review comments, finding
better solutions. A pleasant experience.
- Overhaul of the alarmtimer device dependency handling to ensure that
the init/suspend/resume ordering is correct.
- A new clocksource/event driver for Microchip PIT64
- Suspend/resume support for the Hyper-V clocksource
- The usual pile of fixes, updates and improvements mostly in the
driver code.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl4vbTcTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoXT2D/96iJ3G9Snn2khEQP3XS2rYmtDGw7NO
m1n96falwWeGe6zreU80R2Jge5nLxQtNhRoMPLLee1GpHwRC6lvqEqgdZ4LMBrD2
JqV7Gzg8Urmdh+hpDsyTCpeEWEzoMKxiFOX8PxwctqUhM4szEe5iQg2YQsg85Jw2
vG6M93N2xwDILh4rhEMbKjo+5ZmYn7c1RQvpGOSmpKOj940W/N7H2HBsFhdaJ1Kw
FW5pFv1211PaU5RV2YNb2dMeeMTT1N3e2VN4Dkadoxp47pb+725gNHEBEjmV9poG
Lp4IhzGAPnj8zVD88icQZSTaK3gUHMClxprJ0Pf84WEtiH7SeGu8BPYyu77+oNDe
yzcctDJNyCWXkzmaP/fe/HLc0TStbvNAJ5Tagp4BC75gzebeb4/n8RtRT0fKeDYL
pxpDPKDAPU7p1JSjxiWAtshqjBycWNY3Z49bA7/VhKBhnv8BDyBPGlYd7/4xrbGr
RK7DQNXJwaJaiNJ7p5PiaFxGzNyB0B9sThD/slSlEInIKb4h9YzWr0TV+NB62VnB
sDcN+tpLbRPz5/5cHGGfxR0+zKWpfyai8pzbmmaXEaKssjRYwyvcac5EZdgbWpbK
k7CqAjoWLA2P+tGeePNJOf5JYK6Vmdyh4clmuwM0zOiRJ9NlWUyMf3z7QYILs4RO
UAI+6opYlZEPAw==
=x3qT
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"The timekeeping and timers departement provides:
- Time namespace support:
If a container migrates from one host to another then it expects
that clocks based on MONOTONIC and BOOTTIME are not subject to
disruption. Due to different boot time and non-suspended runtime
these clocks can differ significantly on two hosts, in the worst
case time goes backwards which is a violation of the POSIX
requirements.
The time namespace addresses this problem. It allows to set offsets
for clock MONOTONIC and BOOTTIME once after creation and before
tasks are associated with the namespace. These offsets are taken
into account by timers and timekeeping including the VDSO.
Offsets for wall clock based clocks (REALTIME/TAI) are not provided
by this mechanism. While in theory possible, the overhead and code
complexity would be immense and not justified by the esoteric
potential use cases which were discussed at Plumbers '18.
The overhead for tasks in the root namespace (ie where host time
offsets = 0) is in the noise and great effort was made to ensure
that especially in the VDSO. If time namespace is disabled in the
kernel configuration the code is compiled out.
Kudos to Andrei Vagin and Dmitry Sofanov who implemented this
feature and kept on for more than a year addressing review
comments, finding better solutions. A pleasant experience.
- Overhaul of the alarmtimer device dependency handling to ensure
that the init/suspend/resume ordering is correct.
- A new clocksource/event driver for Microchip PIT64
- Suspend/resume support for the Hyper-V clocksource
- The usual pile of fixes, updates and improvements mostly in the
driver code"
* tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
alarmtimer: Make alarmtimer_get_rtcdev() a stub when CONFIG_RTC_CLASS=n
alarmtimer: Use wakeup source from alarmtimer platform device
alarmtimer: Make alarmtimer platform device child of RTC device
alarmtimer: Update alarmtimer_get_rtcdev() docs to reflect reality
hrtimer: Add missing sparse annotation for __run_timer()
lib/vdso: Only read hrtimer_res when needed in __cvdso_clock_getres()
MIPS: vdso: Define BUILD_VDSO32 when building a 32bit kernel
clocksource/drivers/hyper-v: Set TSC clocksource as default w/ InvariantTSC
clocksource/drivers/hyper-v: Untangle stimers and timesync from clocksources
clocksource/drivers/timer-microchip-pit64b: Fix sparse warning
clocksource/drivers/exynos_mct: Rename Exynos to lowercase
clocksource/drivers/timer-ti-dm: Fix uninitialized pointer access
clocksource/drivers/timer-ti-dm: Switch to platform_get_irq
clocksource/drivers/timer-ti-dm: Convert to devm_platform_ioremap_resource
clocksource/drivers/em_sti: Fix variable declaration in em_sti_probe
clocksource/drivers/em_sti: Convert to devm_platform_ioremap_resource
clocksource/drivers/bcm2835_timer: Fix memory leak of timer
clocksource/drivers/cadence-ttc: Use ttc driver as platform driver
clocksource/drivers/timer-microchip-pit64b: Add Microchip PIT64B support
clocksource/drivers/hyper-v: Reserve PAGE_SIZE space for tsc page
...
Pull cgroup updates from Tejun Heo:
- cgroup2 interface for hugetlb controller. I think this was the last
remaining bit which was missing from cgroup2
- fixes for race and a spurious warning in threaded cgroup handling
- other minor changes
* 'for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
iocost: Fix iocost_monitor.py due to helper type mismatch
cgroup: Prevent double killing of css when enabling threaded cgroup
cgroup: fix function name in comment
mm: hugetlb controller for cgroups v2
Add a helper, is_transparent_hugepage(), to explicitly check whether a
compound page is a THP and use it when populating KVM's secondary MMU.
The explicit check fixes a bug where a remapped compound page, e.g. for
an XDP Rx socket, is mapped into a KVM guest and is mistaken for a THP,
which results in KVM incorrectly creating a huge page in its secondary
MMU.
Fixes: 936a5fe6e6 ("thp: kvm mmu transparent hugepage support")
Reported-by: syzbot+c9d1fb51ac9d0d10c39d@syzkaller.appspotmail.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The allocation mask is no longer used by on_each_cpu_cond() and
on_each_cpu_cond_mask() and can be removed.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200117090137.1205765-4-bigeasy@linutronix.de
If archs don't have memmove then the C implementation from lib/string.c is used,
and then it's instrumented by compiler. So there is no need to add KASAN's
memmove to manual checks.
Signed-off-by: Nick Hu <nickhu@andestech.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
This is in preparation for enabling this functionality through io_uring.
Add a helper that is just exporting what sys_madvise() does, and have the
system call use it.
No functional changes in this patch.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bitmaps are fairly popular for their space efficiency, but we don't have
generic iterators available. Make percpu's bitmap region iterators
available to everyone.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
TTM graphics buffer objects may, transparently to user-space, move
between IO and system memory. When that happens, all PTEs pointing to the
old location are zapped before the move and then faulted in again if
needed. When that happens, the page protection caching mode- and
encryption bits may change and be different from those of
struct vm_area_struct::vm_page_prot.
We were using an ugly hack to set the page protection correctly.
Fix that and instead export and use vmf_insert_mixed_prot() or use
vmf_insert_pfn_prot().
Also get the default page protection from
struct vm_area_struct::vm_page_prot rather than using vm_get_page_prot().
This way we catch modifications done by the vm system for drivers that
want write-notification.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Christian König" <christian.koenig@amd.com>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
The TTM module today uses a hack to be able to set a different page
protection than struct vm_area_struct::vm_page_prot. To be able to do
this properly, add the needed vm functionality as vmf_insert_mixed_prot().
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Christian König" <christian.koenig@amd.com>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
The 'interval_sub' is placed on the 'notifier_subscriptions' interval
tree.
This eliminates the poor name 'mni' for this variable.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The 'subscription' is placed on the 'notifier_subscriptions' list.
This eliminates the poor name 'mn' for this variable.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The name mmu_notifier_mm implies that the thing is a mm_struct pointer,
and is difficult to abbreviate. The struct is actually holding the
interval tree and hlist containing the notifiers subscribed to a mm.
Use 'subscriptions' as the variable name for this struct instead of the
really terrible and misleading 'mmn_mm'.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If a task belongs to a time namespace then the VVAR page which contains
the system wide VDSO data is replaced with a namespace specific page
which has the same layout as the VVAR page.
Co-developed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-25-dima@arista.com
When booting with amd_iommu=off, the following WARNING message
appears:
AMD-Vi: AMD IOMMU disabled on kernel command-line
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
RIP: 0010:flush_workqueue+0x42e/0x450
Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff <0f> 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
Call Trace:
kmem_cache_destroy+0x69/0x260
iommu_go_to_state+0x40c/0x5ab
amd_iommu_prepare+0x16/0x2a
irq_remapping_prepare+0x36/0x5f
enable_IR_x2apic+0x21/0x172
default_setup_apic_routing+0x12/0x6f
apic_intr_mode_init+0x1a1/0x1f1
x86_late_time_init+0x17/0x1c
start_kernel+0x480/0x53f
secondary_startup_64+0xb6/0xc0
---[ end trace 30894107c3749449 ]---
x2apic: IRQ remapping doesn't support X2APIC mode
x2apic disabled
The warning is caused by the calling of 'kmem_cache_destroy()'
in free_iommu_resources(). Here is the call path:
free_iommu_resources
kmem_cache_destroy
flush_memcg_workqueue
flush_workqueue
The root cause is that the IOMMU subsystem runs before the workqueue
subsystem, which the variable 'wq_online' is still 'false'. This leads
to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is
'true'.
Since the variable 'memcg_kmem_cache_wq' is not allocated during the
time, it is unnecessary to call flush_memcg_workqueue(). This prevents
the WARNING message triggered by flush_workqueue().
Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
Fixes: 92ee383f6d ("mm: fix race between kmem_cache destroy, create and deactivate")
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Reported-by: Xiaochun Lee <lixc17@lenovo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use div64_ul() instead of do_div() if the divisor is unsigned long, to
avoid truncation to 32-bit on 64-bit platforms.
Link: http://lkml.kernel.org/r/20200102081442.8273-4-wenyang@linux.alibaba.com
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The two variables 'numerator' and 'denominator', though they are
declared as long, they should actually be unsigned long (according to
the implementation of the fprop_fraction_percpu() function)
And do_div() does a 64-by-32 division, while the divisor 'denominator'
is unsigned long, thus 64-bit on 64-bit platforms. Hence the proper
function to call is div64_ul().
Link: http://lkml.kernel.org/r/20200102081442.8273-3-wenyang@linux.alibaba.com
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "use div64_ul() instead of div_u64() if the divisor is
unsigned long".
We were first inspired by commit b0ab99e773 ("sched: Fix possible divide
by zero in avg_atom () calculation"), then refer to the recently analyzed
mm code, we found this suspicious place.
201 if (min) {
202 min *= this_bw;
203 do_div(min, tot_bw);
204 }
And we also disassembled and confirmed it:
/usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 201
0xffffffff811c37da <__wb_calc_thresh+234>: xor %r10d,%r10d
0xffffffff811c37dd <__wb_calc_thresh+237>: test %rax,%rax
0xffffffff811c37e0 <__wb_calc_thresh+240>: je 0xffffffff811c3800 <__wb_calc_thresh+272>
/usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 202
0xffffffff811c37e2 <__wb_calc_thresh+242>: imul %r8,%rax
/usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 203
0xffffffff811c37e6 <__wb_calc_thresh+246>: mov %r9d,%r10d ---> truncates it to 32 bits here
0xffffffff811c37e9 <__wb_calc_thresh+249>: xor %edx,%edx
0xffffffff811c37eb <__wb_calc_thresh+251>: div %r10
0xffffffff811c37ee <__wb_calc_thresh+254>: imul %rbx,%rax
0xffffffff811c37f2 <__wb_calc_thresh+258>: shr $0x2,%rax
0xffffffff811c37f6 <__wb_calc_thresh+262>: mul %rcx
0xffffffff811c37f9 <__wb_calc_thresh+265>: shr $0x2,%rdx
0xffffffff811c37fd <__wb_calc_thresh+269>: mov %rdx,%r10
This series uses div64_ul() instead of div_u64() if the divisor is
unsigned long, to avoid truncation to 32-bit on 64-bit platforms.
This patch (of 3):
The variables 'min' and 'max' are unsigned long and do_div truncates
them to 32 bits, which means it can test non-zero and be truncated to
zero for division. Fix this issue by using div64_ul() instead.
Link: http://lkml.kernel.org/r/20200102081442.8273-2-wenyang@linux.alibaba.com
Fixes: 693108a8a6 ("writeback: make bdi->min/max_ratio handling cgroup writeback aware")
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 96a2b03f28 ("mm, debug_pagelloc: use static keys to enable
debugging") has introduced a static key to reduce overhead when
debug_pagealloc is compiled in but not enabled. It relied on the
assumption that jump_label_init() is called before parse_early_param()
as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
it is safe to enable the static key.
However, it turns out multiple architectures call parse_early_param()
earlier from their setup_arch(). x86 also calls jump_label_init() even
earlier, so no issue was found while testing the commit, but same is not
true for e.g. ppc64 and s390 where the kernel would not boot with
debug_pagealloc=on as found by our QA.
To fix this without tricky changes to init code of multiple
architectures, this patch partially reverts the static key conversion
from 96a2b03f28. Init-time and non-fastpath calls (such as in arch
code) of debug_pagealloc_enabled() will again test a simple bool
variable. Fastpath mm code is converted to a new
debug_pagealloc_enabled_static() variant that relies on the static key,
which is enabled in a well-defined point in mm_init() where it's
guaranteed that jump_label_init() has been called, regardless of
architecture.
[sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
Fixes: 96a2b03f28 ("mm, debug_pagelloc: use static keys to enable debugging")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently slab percpu vmstats are flushed twice: during the memcg
offlining and just before freeing the memcg structure. Each time percpu
counters are summed, added to the atomic counterparts and propagated up
by the cgroup tree.
The second flushing is required due to how recursive vmstats are
implemented: counters are batched in percpu variables on a local level,
and once a percpu value is crossing some predefined threshold, it spills
over to atomic values on the local and each ascendant levels. It means
that without flushing some numbers cached in percpu variables will be
dropped on floor each time a cgroup is destroyed. And with uptime the
error on upper levels might become noticeable.
The first flushing aims to make counters on ancestor levels more
precise. Dying cgroups may resume in the dying state for a long time.
After kmem_cache reparenting which is performed during the offlining
slab counters of the dying cgroup don't have any chances to be updated,
because any slab operations will be performed on the parent level. It
means that the inaccuracy caused by percpu batching will not decrease up
to the final destruction of the cgroup. By the original idea flushing
slab counters during the offlining should minimize the visible
inaccuracy of slab counters on the parent level.
The problem is that percpu counters are not zeroed after the first
flushing. So every cached percpu value is summed twice. It creates a
small error (up to 32 pages per cpu, but usually less) which accumulates
on parent cgroup level. After creating and destroying of thousands of
child cgroups, slab counter on parent level can be way off the real
value.
For now, let's just stop flushing slab counters on memcg offlining. It
can't be done correctly without scheduling a work on each cpu: reading
and zeroing it during css offlining can race with an asynchronous
update, which doesn't expect values to be changed underneath.
With this change, slab counters on parent level will become eventually
consistent. Once all dying children are gone, values are correct. And
if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.
It's not perfect, as slab are reparented, so any updates after the
reparenting will happen on the parent level. It means that if a slab
page was allocated, a counter on child level was bumped, then the page
was reparented and freed, the annihilation of positive and negative
counter values will not happen until the child cgroup is released. It
makes slab counters different from others, and it might want us to
implement flushing in a correct form again. But it's also a question of
performance: scheduling a work on each cpu isn't free, and it's an open
question if the benefit of having more accurate counters is worth it.
We might also consider flushing all counters on offlining, not only slab
counters.
So let's fix the main problem now: make the slab counters eventually
consistent, so at least the error won't grow with uptime (or more
precisely the number of created and destroyed cgroups). And think about
the accuracy of counters separately.
Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
Fixes: bee07b33db ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shmem/tmpfs tries to provide THP-friendly mappings if huge pages are
enabled. But it doesn't work well with above-47bit hint address.
Normally, the kernel doesn't create userspace mappings above 47-bit,
even if the machine allows this (such as with 5-level paging on x86-64).
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information.
Userspace can ask for allocation from full address space by specifying
hint address (with or without MAP_FIXED) above 47-bits. If the
application doesn't need a particular address, but wants to allocate
from whole address space it can specify -1 as a hint address.
Unfortunately, this trick breaks THP alignment in shmem/tmp:
shmem_get_unmapped_area() would not try to allocate PMD-aligned area if
*any* hint address specified.
This can be fixed by requesting the aligned area if the we failed to
allocated at user-specified hint address. The request with inflated
length will also take the user-specified hint address. This way we will
not lose an allocation request from the full address space.
[kirill@shutemov.name: fold in a fixup]
Link: http://lkml.kernel.org/r/20191223231309.t6bh5hkbmokihpfu@box
Link: http://lkml.kernel.org/r/20191220142548.7118-3-kirill.shutemov@linux.intel.com
Fixes: b569bab78d ("x86/mm: Prepare to expose larger address space to userspace")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Willhalm, Thomas" <thomas.willhalm@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Bruggeman, Otto G" <otto.g.bruggeman@intel.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Fix two above-47bit hint address vs. THP bugs".
The two get_unmapped_area() implementations have to be fixed to provide
THP-friendly mappings if above-47bit hint address is specified.
This patch (of 2):
Filesystems use thp_get_unmapped_area() to provide THP-friendly
mappings. For DAX in particular.
Normally, the kernel doesn't create userspace mappings above 47-bit,
even if the machine allows this (such as with 5-level paging on x86-64).
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information.
Userspace can ask for allocation from full address space by specifying
hint address (with or without MAP_FIXED) above 47-bits. If the
application doesn't need a particular address, but wants to allocate
from whole address space it can specify -1 as a hint address.
Unfortunately, this trick breaks thp_get_unmapped_area(): the function
would not try to allocate PMD-aligned area if *any* hint address
specified.
Modify the routine to handle it correctly:
- Try to allocate the space at the specified hint address with length
padding required for PMD alignment.
- If failed, retry without length padding (but with the same hint
address);
- If the returned address matches the hint address return it.
- Otherwise, align the address as required for THP and return.
The user specified hint address is passed down to get_unmapped_area() so
above-47bit hint address will be taken into account without breaking
alignment requirements.
Link: http://lkml.kernel.org/r/20191220142548.7118-2-kirill.shutemov@linux.intel.com
Fixes: b569bab78d ("x86/mm: Prepare to expose larger address space to userspace")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Thomas Willhalm <thomas.willhalm@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Bruggeman, Otto G" <otto.g.bruggeman@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we remove an early section, we don't free the usage map, as the
usage maps of other sections are placed into the same page. Once the
section is removed, it is no longer an early section (especially, the
memmap is freed). When we re-add that section, the usage map is reused,
however, it is no longer an early section. When removing that section
again, we try to kfree() a usage map that was allocated during early
boot - bad.
Let's check against PageReserved() to see if we are dealing with an
usage map that was allocated during boot. We could also check against
!(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved() is
cleaner.
Can be triggered using memtrace under ppc64/powernv:
$ mount -t debugfs none /sys/kernel/debug/
$ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
$ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
------------[ cut here ]------------
kernel BUG at mm/slub.c:3969!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=3D64K MMU=3DHash SMP NR_CPUS=3D2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
NIP kfree+0x338/0x3b0
LR section_deactivate+0x138/0x200
Call Trace:
section_deactivate+0x138/0x200
__remove_pages+0x114/0x150
arch_remove_memory+0x3c/0x160
try_remove_memory+0x114/0x1a0
__remove_memory+0x20/0x40
memtrace_enable_set+0x254/0x850
simple_attr_write+0x138/0x160
full_proxy_write+0x8c/0x110
__vfs_write+0x38/0x70
vfs_write+0x11c/0x2a0
ksys_write+0x84/0x140
system_call+0x5c/0x68
---[ end trace 4b053cbd84e0db62 ]---
The first invocation will offline+remove memory blocks. The second
invocation will first add+online them again, in order to offline+remove
them again (usually we are lucky and the exact same memory blocks will
get "reallocated").
Tested on powernv with boot memory: The usage map will not get freed.
Tested on x86-64 with DIMMs: The usage map will get freed.
Using Dynamic Memory under a Power DLAPR can trigger it easily.
Triggering removal (I assume after previously removed+re-added) of
memory from the HMC GUI can crash the kernel with the same call trace
and is fixed by this patch.
Link: http://lkml.kernel.org/r/20191217104637.5509-1-david@redhat.com
Fixes: 326e1b8f83 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Pingfan Liu <piliu@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
THP page faults now attempt a __GFP_THISNODE allocation first, which
should only compact existing free memory, followed by another attempt
that can allocate from any node using reclaim/compaction effort
specified by global defrag setting and madvise.
This patch makes the following changes to the scheme:
- Before the patch, the first allocation relies on a check for
pageblock order and __GFP_IO to prevent excessive reclaim. This
however affects also the second attempt, which is not limited to
single node.
Instead of that, reuse the existing check for costly order
__GFP_NORETRY allocations, and make sure the first THP attempt uses
__GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY
allocations will bail out if compaction needs reclaim, while
previously they only bailed out when compaction was deferred due to
previous failures.
This should be still acceptable within the __GFP_NORETRY semantics.
- Before the patch, the second allocation attempt (on all nodes) was
passing __GFP_NORETRY. This is redundant as the check for pageblock
order (discussed above) was stronger. It's also contrary to
madvise(MADV_HUGEPAGE) which means some effort to allocate THP is
requested.
After this patch, the second attempt doesn't pass __GFP_THISNODE nor
__GFP_NORETRY.
To sum up, THP page faults now try the following attempts:
1. local node only THP allocation with no reclaim, just compaction.
2. for madvised VMA's or when synchronous compaction is enabled always - THP
allocation from any node with effort determined by global defrag setting
and VMA madvise
3. fallback to base pages on any node
Link: http://lkml.kernel.org/r/08a3f4dd-c3ce-0009-86c5-9ee51aba8557@suse.cz
Fixes: b39d0ee263 ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ARMv8 64-bit architecture supports execute-only user permissions by
clearing the PTE_USER and PTE_UXN bits, practically making it a mostly
privileged mapping but from which user running at EL0 can still execute.
The downside, however, is that the kernel at EL1 inadvertently reading
such mapping would not trip over the PAN (privileged access never)
protection.
Revert the relevant bits from commit cab15ce604 ("arm64: Introduce
execute-only page access permissions") so that PROT_EXEC implies
PROT_READ (and therefore PTE_USER) until the architecture gains proper
support for execute-only user mappings.
Fixes: cab15ce604 ("arm64: Introduce execute-only page access permissions")
Cc: <stable@vger.kernel.org> # 4.9.x-
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following lockdep splat was observed when a certain hugetlbfs test
was run:
================================
WARNING: inconsistent lock state
4.18.0-159.el8.x86_64+debug #1 Tainted: G W --------- - -
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/30/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
ffffffff9acdc038 (hugetlb_lock){+.?.}, at: free_huge_page+0x36f/0xaa0
{SOFTIRQ-ON-W} state was registered at:
lock_acquire+0x14f/0x3b0
_raw_spin_lock+0x30/0x70
__nr_hugepages_store_common+0x11b/0xb30
hugetlb_sysctl_handler_common+0x209/0x2d0
proc_sys_call_handler+0x37f/0x450
vfs_write+0x157/0x460
ksys_write+0xb8/0x170
do_syscall_64+0xa5/0x4d0
entry_SYSCALL_64_after_hwframe+0x6a/0xdf
irq event stamp: 691296
hardirqs last enabled at (691296): [<ffffffff99bb034b>] _raw_spin_unlock_irqrestore+0x4b/0x60
hardirqs last disabled at (691295): [<ffffffff99bb0ad2>] _raw_spin_lock_irqsave+0x22/0x81
softirqs last enabled at (691284): [<ffffffff97ff0c63>] irq_enter+0xc3/0xe0
softirqs last disabled at (691285): [<ffffffff97ff0ebe>] irq_exit+0x23e/0x2b0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(hugetlb_lock);
<Interrupt>
lock(hugetlb_lock);
*** DEADLOCK ***
:
Call Trace:
<IRQ>
__lock_acquire+0x146b/0x48c0
lock_acquire+0x14f/0x3b0
_raw_spin_lock+0x30/0x70
free_huge_page+0x36f/0xaa0
bio_check_pages_dirty+0x2fc/0x5c0
clone_endio+0x17f/0x670 [dm_mod]
blk_update_request+0x276/0xe50
scsi_end_request+0x7b/0x6a0
scsi_io_completion+0x1c6/0x1570
blk_done_softirq+0x22e/0x350
__do_softirq+0x23d/0xad8
irq_exit+0x23e/0x2b0
do_IRQ+0x11a/0x200
common_interrupt+0xf/0xf
</IRQ>
Both the hugetbl_lock and the subpool lock can be acquired in
free_huge_page(). One way to solve the problem is to make both locks
irq-safe. However, Mike Kravetz had learned that the hugetlb_lock is
held for a linear scan of ALL hugetlb pages during a cgroup reparentling
operation. So it is just too long to have irq disabled unless we can
break hugetbl_lock down into finer-grained locks with shorter lock hold
times.
Another alternative is to defer the freeing to a workqueue job. This
patch implements the deferred freeing by adding a free_hpage_workfn()
work function to do the actual freeing. The free_huge_page() call in a
non-task context saves the page to be freed in the hpage_freelist linked
list in a lockless manner using the llist APIs.
The generic workqueue is used to process the work, but a dedicated
workqueue can be used instead if it is desirable to have the huge page
freed ASAP.
Thanks to Kirill Tkhai <ktkhai@virtuozzo.com> for suggesting the use of
llist APIs which simplfy the code.
Link: http://lkml.kernel.org/r/20191217170331.30893-1-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the implementation of __gup_benchmark_ioctl() the allocated pages
should be released before returning in case of an invalid cmd. Release
pages via kvfree().
[akpm@linux-foundation.org: rework code flow, return -EINVAL rather than -1]
Link: http://lkml.kernel.org/r/20191211174653.4102-1-navid.emamdoost@gmail.com
Fixes: 714a3a1eba ("mm/gup_benchmark.c: add additional pinning methods")
Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pr_err() expects kB, but mm_pgtables_bytes() returns the number of bytes.
As everything else is printed in kB, I chose to fix the value rather than
the string.
Before:
[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
...
[ 1878] 1000 1878 217253 151144 1269760 0 0 python
...
Out of memory: Killed process 1878 (python) total-vm:869012kB, anon-rss:604572kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:1269760kB oom_score_adj:0
After:
[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
...
[ 1436] 1000 1436 217253 151890 1294336 0 0 python
...
Out of memory: Killed process 1436 (python) total-vm:869012kB, anon-rss:607516kB, file-rss:44kB, shmem-rss:0kB, UID:1000 pgtables:1264kB oom_score_adj:0
Link: http://lkml.kernel.org/r/20191211202830.1600-1-idryomov@gmail.com
Fixes: 70cb6d2677 ("mm/oom: add oom_score_adj and pgtables to Killed process message")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Edward Chron <echron@arista.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Felix Abecassis reports move_pages() would return random status if the
pages are already on the target node by the below test program:
int main(void)
{
const long node_id = 1;
const long page_size = sysconf(_SC_PAGESIZE);
const int64_t num_pages = 8;
unsigned long nodemask = 1 << node_id;
long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask));
if (ret < 0)
return (EXIT_FAILURE);
void **pages = malloc(sizeof(void*) * num_pages);
for (int i = 0; i < num_pages; ++i) {
pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ,
MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS,
-1, 0);
if (pages[i] == MAP_FAILED)
return (EXIT_FAILURE);
}
ret = set_mempolicy(MPOL_DEFAULT, NULL, 0);
if (ret < 0)
return (EXIT_FAILURE);
int *nodes = malloc(sizeof(int) * num_pages);
int *status = malloc(sizeof(int) * num_pages);
for (int i = 0; i < num_pages; ++i) {
nodes[i] = node_id;
status[i] = 0xd0; /* simulate garbage values */
}
ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
printf("move_pages: %ld\n", ret);
for (int i = 0; i < num_pages; ++i)
printf("status[%d] = %d\n", i, status[i]);
}
Then running the program would return nonsense status values:
$ ./move_pages_bug
move_pages: 0
status[0] = 208
status[1] = 208
status[2] = 208
status[3] = 208
status[4] = 208
status[5] = 208
status[6] = 208
status[7] = 208
This is because the status is not set if the page is already on the
target node, but move_pages() should return valid status as long as it
succeeds. The valid status may be errno or node id.
We can't simply initialize status array to zero since the pages may be
not on node 0. Fix it by updating status with node id which the page is
already on.
Link: http://lkml.kernel.org/r/1575584353-125392-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: a49bd4d716 ("mm, numa: rework do_pages_move")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Tested-by: Felix Abecassis <fabecassis@nvidia.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org> [4.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When zspage is migrated to the other zone, the zone page state should be
updated as well, otherwise the NR_ZSPAGE for each zone shows wrong
counts including proc/zoneinfo in practice.
Link: http://lkml.kernel.org/r/1575434841-48009-1-git-send-email-chanho.min@lge.com
Fixes: 91537fee00 ("mm: add NR_ZSMALLOC to vmstat")
Signed-off-by: Chanho Min <chanho.min@lge.com>
Signed-off-by: Jinsuk Choi <jjinsuk.choi@lge.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org> [4.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We currently try to shrink a single zone when removing memory. We use
the zone of the first page of the memory we are removing. If that
memmap was never initialized (e.g., memory was never onlined), we will
read garbage and can trigger kernel BUGs (due to a stale pointer):
BUG: unable to handle page fault for address: 000000000000353d
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:clear_zone_contiguous+0x5/0x10
Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
FS: 0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__remove_pages+0x4b/0x640
arch_remove_memory+0x63/0x8d
try_remove_memory+0xdb/0x130
__remove_memory+0xa/0x11
acpi_memory_device_remove+0x70/0x100
acpi_bus_trim+0x55/0x90
acpi_device_hotplug+0x227/0x3a0
acpi_hotplug_work_fn+0x1a/0x30
process_one_work+0x221/0x550
worker_thread+0x50/0x3b0
kthread+0x105/0x140
ret_from_fork+0x3a/0x50
Modules linked in:
CR2: 000000000000353d
Instead, shrink the zones when offlining memory or when onlining failed.
Introduce and use remove_pfn_range_from_zone(() for that. We now
properly shrink the zones, even if we have DIMMs whereby
- Some memory blocks fall into no zone (never onlined)
- Some memory blocks fall into multiple zones (offlined+re-onlined)
- Multiple memory blocks that fall into different zones
Drop the zone parameter (with a potential dubious value) from
__remove_pages() and __remove_section().
Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b]
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: <stable@vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make #GP exceptions caused by out-of-bounds KASAN shadow accesses easier
to understand by computing the address of the original access and
printing that. More details are in the comments in the patch.
This turns an error like this:
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault, probably for non-canonical address
0xe017577ddf75b7dd: 0000 [#1] PREEMPT SMP KASAN PTI
into this:
general protection fault, probably for non-canonical address
0xe017577ddf75b7dd: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: maybe wild-memory-access in range
[0x00badbeefbadbee8-0x00badbeefbadbeef]
The hook is placed in architecture-independent code, but is currently
only wired up to the X86 exception handler because I'm not sufficiently
familiar with the address space layout and exception handling mechanisms
on other architectures.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: kasan-dev@googlegroups.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191218231150.12139-4-jannh@google.com
Since commit 0a432dcbeb ("mm: shrinker: make shrinker not depend on
memcg kmem"), shrinkers' idr is protected by CONFIG_MEMCG instead of
CONFIG_MEMCG_KMEM, so it makes no sense to protect shrinker idr replace
with CONFIG_MEMCG_KMEM.
And in the CONFIG_MEMCG && CONFIG_SLOB case, shrinker_idr contains only
shrinker, and it is deferred_split_shrinker. But it is never actually
called, since idr_replace() is never compiled due to the wrong #ifdef.
The deferred_split_shrinker all the time is staying in half-registered
state, and it's never called for subordinate mem cgroups.
Link: http://lkml.kernel.org/r/1575486978-45249-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: 0a432dcbeb ("mm: shrinker: make shrinker not depend on memcg kmem")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org> [5.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
syzkaller and the fault injector showed that I was wrong to assume that
we could ignore percpu shadow allocation failures.
Handle failures properly. Merge all the allocated areas back into the
free list and release the shadow, then clean up and return NULL. The
shadow is released unconditionally, which relies upon the fact that the
release function is able to tolerate pages not being present.
Also clean up shadows in the recovery path - currently they are not
released, which leaks a bit of memory.
Link: http://lkml.kernel.org/r/20191205140407.1874-3-dja@axtens.net
Fixes: 3c5c3cfb9e ("kasan: support backing vmalloc space with real shadow memory")
Signed-off-by: Daniel Axtens <dja@axtens.net>
Reported-by: syzbot+82e323920b78d54aaed5@syzkaller.appspotmail.com
Reported-by: syzbot+59b7daa4315e07a994f1@syzkaller.appspotmail.com
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
apply_to_page_range() takes an address range, and if any parts of it are
not covered by the existing page table hierarchy, it allocates memory to
fill them in.
In some use cases, this is not what we want - we want to be able to
operate exclusively on PTEs that are already in the tables.
Add apply_to_existing_page_range() for this. Adjust the walker
functions for apply_to_page_range to take 'create', which switches them
between the old and new modes.
This will be used in KASAN vmalloc.
[akpm@linux-foundation.org: reduce code duplication]
[akpm@linux-foundation.org: s/apply_to_existing_pages/apply_to_existing_page_range/]
[akpm@linux-foundation.org: initialize __apply_to_page_range::err]
Link: http://lkml.kernel.org/r/20191205140407.1874-1-dja@axtens.net
Signed-off-by: Daniel Axtens <dja@axtens.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With CONFIG_KASAN_VMALLOC=y any use of memory obtained via vm_map_ram()
will crash because there is no shadow backing that memory.
Instead of sprinkling additional kasan_populate_vmalloc() calls all over
the vmalloc code, move it into alloc_vmap_area(). This will fix
vm_map_ram() and simplify the code a bit.
[aryabinin@virtuozzo.com: v2]
Link: http://lkml.kernel.org/r/20191205095942.1761-1-aryabinin@virtuozzo.comLink: http://lkml.kernel.org/r/20191204204534.32202-1-aryabinin@virtuozzo.com
Fixes: 3c5c3cfb9e ("kasan: support backing vmalloc space with real shadow memory")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Alexander Potapenko <glider@google.com>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the effort of supporting cgroups v2 into Kubernetes, I stumped on
the lack of the hugetlb controller.
When the controller is enabled, it exposes four new files for each
hugetlb size on non-root cgroups:
- hugetlb.<hugepagesize>.current
- hugetlb.<hugepagesize>.max
- hugetlb.<hugepagesize>.events
- hugetlb.<hugepagesize>.events.local
The differences with the legacy hierarchy are in the file names and
using the value "max" instead of "-1" to disable a limit.
The file .limit_in_bytes is renamed to .max.
The file .usage_in_bytes is renamed to .current.
.failcnt is not provided as a single file anymore, but its value can
be read through the new flat-keyed files .events and .events.local,
through the "max" key.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Untangle the somewhat incestous way of how VMALLOC_START is used all across the
kernel, but is, on x86, defined deep inside one of the lowest level page table headers.
It doesn't help that vmalloc.h only includes a single asm header:
#include <asm/page.h> /* pgprot_t */
So there was no existing cross-arch way to decouple address layout
definitions from page.h details. I used this:
#ifndef VMALLOC_START
# include <asm/vmalloc.h>
#endif
This way every architecture that wants to simplify page.h can do so.
- Also on x86 we had a couple of LDT related inline functions that used
the late-stage address space layout positions - but these could be
uninlined without real trouble - the end result is cleaner this way as
well.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.
Switch the pte_unmap_same() and SLUB code over to use CONFIG_PREEMPTION.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Chistoph Lameter <cl@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20191015191821.11479-26-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge more updates from Andrew Morton:
"Most of the rest of MM and various other things. Some Kconfig rework
still awaits merges of dependent trees from linux-next.
Subsystems affected by this patch series: mm/hotfixes, mm/memcg,
mm/vmstat, mm/thp, procfs, sysctl, misc, notifiers, core-kernel,
bitops, lib, checkpatch, epoll, binfmt, init, rapidio, uaccess, kcov,
ubsan, ipc, bitmap, mm/pagemap"
* akpm: (86 commits)
mm: remove __ARCH_HAS_4LEVEL_HACK and include/asm-generic/4level-fixup.h
um: add support for folded p4d page tables
um: remove unused pxx_offset_proc() and addr_pte() functions
sparc32: use pgtable-nopud instead of 4level-fixup
parisc/hugetlb: use pgtable-nopXd instead of 4level-fixup
parisc: use pgtable-nopXd instead of 4level-fixup
nds32: use pgtable-nopmd instead of 4level-fixup
microblaze: use pgtable-nopmd instead of 4level-fixup
m68k: mm: use pgtable-nopXd instead of 4level-fixup
m68k: nommu: use pgtable-nopud instead of 4level-fixup
c6x: use pgtable-nopud instead of 4level-fixup
arm: nommu: use pgtable-nopud instead of 4level-fixup
alpha: use pgtable-nopud instead of 4level-fixup
gpio: pca953x: tighten up indentation
gpio: pca953x: convert to use bitmap API
gpio: pca953x: use input from regs structure in pca953x_irq_pending()
gpio: pca953x: remove redundant variable and check in IRQ handler
lib/bitmap: introduce bitmap_replace() helper
lib/test_bitmap: fix comment about this file
lib/test_bitmap: move exp1 and exp2 upper for others to use
...
There are no architectures that use include/asm-generic/4level-fixup.h
therefore it can be removed along with __ARCH_HAS_4LEVEL_HACK define.
Link: http://lkml.kernel.org/r/1572938135-31886-14-git-send-email-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Anatoly Pugachev <matorola@gmail.com>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Peter Rosin <peda@axentia.se>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Sam Creasey <sammy@sammy.net>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For hugely mapped thp, we use is_huge_zero_pmd() to check if it's zero
page or not.
We do fill ptes with my_zero_pfn() when we split zero thp pmd, but this
is not what we have in vm_normal_page_pmd() -- pmd_trans_huge_lock()
makes sure of it.
This is a trivial fix for /proc/pid/numa_maps, and AFAIK nobody
complains about it.
Gerald Schaefer asked:
: Maybe the description could also mention the symptom of this bug?
: I would assume that it affects anon/dirty accounting in gather_pte_stats(),
: for huge mappings, if zero page mappings are not correctly recognized.
I came across this while I was looking at the code, so I'm not aware of
any symptom.
Link: http://lkml.kernel.org/r/20191108192629.201556-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use common names from vmstat array when possible. This gives not much
difference in code size for now, but should help in keeping interfaces
consistent.
add/remove: 0/2 grow/shrink: 2/0 up/down: 70/-72 (-2)
Function old new delta
memory_stat_format 984 1050 +66
memcg_stat_show 957 961 +4
memcg1_event_names 32 - -32
mem_cgroup_lru_names 40 - -40
Total: Before=14485337, After=14485335, chg -0.00%
Link: http://lkml.kernel.org/r/157113012508.453.80391533767219371.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Statistics in vmstat is combined from counters with different structure,
but names for them are merged into one array.
This patch adds trivial helpers to get name for each item:
const char *zone_stat_name(enum zone_stat_item item);
const char *numa_stat_name(enum numa_stat_item item);
const char *node_stat_name(enum node_stat_item item);
const char *writeback_stat_name(enum writeback_stat_item item);
const char *vm_event_name(enum vm_event_item item);
Names for enum writeback_stat_item are folded in the middle of
vmstat_text so this patch moves declaration into header to calculate
offset of following items.
Also this patch reuses piece of node stat names for lru list names:
const char *lru_list_name(enum lru_list lru);
This returns common lru list names: "inactive_anon", "active_anon",
"inactive_file", "active_file", "unevictable".
[khlebnikov@yandex-team.ru: do not use size of vmstat_text as count of /proc/vmstat items]
Link: http://lkml.kernel.org/r/157152151769.4139.15423465513138349343.stgit@buzz
Link: https://lore.kernel.org/linux-mm/cd1c42ae-281f-c8a8-70ac-1d01d417b2e1@infradead.org/T/#u
Link: http://lkml.kernel.org/r/157113012325.453.562783073839432766.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: YueHaibing <yuehaibing@huawei.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christian reported a warning like the following obtained during running
some KVM-related tests on s390:
WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
Hardware name: IBM 2964 NC9 712 (LPAR)
Workqueue: events sysfs_slab_remove_workfn
Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
Krnl Code: 0000001529746842: f0a0000407fe srp 4(11,%r0),2046,0
0000001529746848: 47000700 bc 0,1792
#000000152974684c: a7f40001 brc 15,152974684e
>0000001529746850: a7f4fff2 brc 15,1529746834
0000001529746854: 0707 bcr 0,%r7
0000001529746856: 0707 bcr 0,%r7
0000001529746858: eb8ff0580024 stmg %r8,%r15,88(%r15)
000000152974685e: a738ffff lhi %r3,-1
Call Trace:
([<000003e009263d00>] 0x3e009263d00)
[<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70
[<0000001529b04882>] kobject_put+0xaa/0xe8
[<000000152918cf28>] process_one_work+0x1e8/0x428
[<000000152918d1b0>] worker_thread+0x48/0x460
[<00000015291942c6>] kthread+0x126/0x160
[<0000001529b22344>] ret_from_fork+0x28/0x30
[<0000001529b2234c>] kernel_thread_starter+0x0/0x10
Last Breaking-Event-Address:
[<000000152974684c>] percpu_ref_exit+0x4c/0x58
---[ end trace b035e7da5788eb09 ]---
The problem occurs because kmem_cache_destroy() is called immediately
after deleting of a memcg, so it races with the memcg kmem_cache
deactivation.
flush_memcg_workqueue() at the beginning of kmem_cache_destroy() is
supposed to guarantee that all deactivation processes are finished, but
failed to do so. It waits for an rcu grace period, after which all
children kmem_caches should be deactivated. During the deactivation
percpu_ref_kill() is called for non root kmem_cache refcounters, but it
requires yet another rcu grace period to finish the transition to the
atomic (dead) state.
So in a rare case when not all children kmem_caches are destroyed at the
moment when the root kmem_cache is about to be gone, we need to wait
another rcu grace period before destroying the root kmem_cache.
This issue can be triggered only with dynamically created kmem_caches
which are used with memcg accounting. In this case per-memcg child
kmem_caches are created. They are deactivated from the cgroup removing
path. If the destruction of the root kmem_cache is racing with the
removal of the cgroup (both are quite complicated multi-stage
processes), the described issue can occur. The only known way to
trigger it in the real life, is to unload some kernel module which
creates a dedicated kmem_cache, used from different memory cgroups with
GFP_ACCOUNT flag. If the unloading happens immediately after calling
rmdir on the corresponding cgroup, there is some chance to trigger the
issue.
Link: http://lkml.kernel.org/r/20191129025011.3076017-1-guro@fb.com
Fixes: f0a3a24b53 ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I hit the following compile error in arch/x86/
mm/kasan/common.c: In function kasan_populate_vmalloc:
mm/kasan/common.c:797:2: error: implicit declaration of function flush_cache_vmap; did you mean flush_rcu_work? [-Werror=implicit-function-declaration]
flush_cache_vmap(shadow_start, shadow_end);
^~~~~~~~~~~~~~~~
flush_rcu_work
cc1: some warnings being treated as errors
Link: http://lkml.kernel.org/r/1575363013-43761-1-git-send-email-zhongjiang@huawei.com
Fixes: 3c5c3cfb9e ("kasan: support backing vmalloc space with real shadow memory")
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* small x86 cleanup
* fix for an x86-specific out-of-bounds write on a ioctl (not guest triggerable,
data not attacker-controlled)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJd551cAAoJEL/70l94x66D+JkH/R3eEOyvckPmYmzd0lnV8mQ/
7e0n2G/aD+iLZkcCbUnMaImdmSJmoEEJCPjgPk/5nJ3zUi5b/ABWyidEM5uf19Hl
rzKBg0DR7BiQptPnZv2JMwEVKu3JOTchMykqu9xXChQlICocZ0xjdOA6nQ19p0Lv
FulDw5MUaWrXevIzCBskQ38zJejRQA6CpD1lQkHn7LKS9p3p+BsAOd/Ouy87RfWG
b3ktECNbXyO6KStrrhgm+z8pviWY+kqYklyBlDOOwxWif0x8WvNDpQLoVo+ZuhLU
Me8YJ1BN75vFlxzh6ZK5exBUnm9E3fGVKIaaF+dpuds2x+j4HnYl+lZCm89MdqY=
=Q4v7
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more KVM updates from Paolo Bonzini:
- PPC secure guest support
- small x86 cleanup
- fix for an x86-specific out-of-bounds write on a ioctl (not guest
triggerable, data not attacker-controlled)
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: vmx: Stop wasting a page for guest_msrs
KVM: x86: fix out-of-bounds write in KVM_GET_EMULATED_CPUID (CVE-2019-19332)
Documentation: kvm: Fix mention to number of ioctls classes
powerpc: Ultravisor: Add PPC_UV config option
KVM: PPC: Book3S HV: Support reset of secure guest
KVM: PPC: Book3S HV: Handle memory plug/unplug to secure VM
KVM: PPC: Book3S HV: Radix changes for secure guest
KVM: PPC: Book3S HV: Shared pages support for secure guests
KVM: PPC: Book3S HV: Support for running secure guests
mm: ksm: Export ksm_madvise()
KVM x86: Move kvm cpuid support out of svm
If a block device supports rw_page operation, it doesn't submit bios so
the annotation in submit_bio() for refault stall doesn't work. It
happens with zram in android, especially swap read path which could
consume CPU cycle for decompress. It is also a problem for zswap which
uses frontswap.
Annotate swap_readpage() to account the synchronous IO overhead to
prevent underreport memory pressure.
[akpm@linux-foundation.org: add comment, per Johannes]
Link: http://lkml.kernel.org/r/20191010152134.38545-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
End a Kconfig help text sentence with a period (aka full stop).
Link: http://lkml.kernel.org/r/c17f2c75-dc2a-42a4-2229-bb6b489addf2@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are several places emphasise the effect of __SetPageUptodate(),
while the comment seems to have a typo in two places.
Link: http://lkml.kernel.org/r/20190926023705.7226-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In 64bit system. sb->s_maxbytes of shmem filesystem is MAX_LFS_FILESIZE,
which equal LLONG_MAX.
If offset > LLONG_MAX - PAGE_SIZE, offset + len < LLONG_MAX in
shmem_fallocate, which will pass the checking in vfs_fallocate.
/* Check for wrap through zero too */
if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
return -EFBIG;
loff_t unmap_start = round_up(offset, PAGE_SIZE) in shmem_fallocate
causes a overflow.
Syzkaller reports a overflow problem in mm/shmem:
UBSAN: Undefined behaviour in mm/shmem.c:2014:10
signed integer overflow: '9223372036854775807 + 1' cannot be represented in type 'long long int'
CPU: 0 PID:17076 Comm: syz-executor0 Not tainted 4.1.46+ #1
Hardware name: linux, dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x2c8 arch/arm64/kernel/traps.c:100
show_stack+0x20/0x30 arch/arm64/kernel/traps.c:238
__dump_stack lib/dump_stack.c:15 [inline]
ubsan_epilogue+0x18/0x70 lib/ubsan.c:164
handle_overflow+0x158/0x1b0 lib/ubsan.c:195
shmem_fallocate+0x6d0/0x820 mm/shmem.c:2104
vfs_fallocate+0x238/0x428 fs/open.c:312
SYSC_fallocate fs/open.c:335 [inline]
SyS_fallocate+0x54/0xc8 fs/open.c:239
The highest bit of unmap_start will be appended with sign bit 1
(overflow) when calculate shmem_falloc.start:
shmem_falloc.start = unmap_start >> PAGE_SHIFT.
Fix it by casting the type of unmap_start to u64, when right shifted.
This bug is found in LTS Linux 4.1. It also seems to exist in mainline.
Link: http://lkml.kernel.org/r/1573867464-5107-1-git-send-email-chenjun102@huawei.com
Signed-off-by: Chen Jun <chenjun102@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The shmem_writepage() uses GFP_ATOMIC to allocate swap cache. GFP_ATOMIC
used to mean __GFP_HIGH, but now it means __GFP_HIGH | __GFP_ATOMIC |
__GFP_KSWAPD_RECLAIM. However, shmem_writepage() should write out to swap
only in response to memory pressure, so __GFP_KSWAPD_RECLAIM looks useless
since the caller may be kswapd itself or in direct reclaim already.
In addition, XArray node allocations from PF_MEMALLOC contexts could
completely exhaust the page allocator, __GFP_NOMEMALLOC stops emergency
reserves from being allocated.
Here just copy the gfp flags used by add_to_swap().
Hugh:
"a cleanup to make the two calls look the same when they don't need to
be different (whereas the call from __read_swap_cache_async() rightly
uses a lower priority gfp)".
Link: http://lkml.kernel.org/r/1572991351-86061-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't populate the array 'values' on the stack but instead make it static
const. Makes the object code smaller by 111 bytes.
Before:
text data bss dec hex filename
108612 11169 512 120293 1d5e5 mm/shmem.o
After:
text data bss dec hex filename
108437 11233 512 120182 1d576 mm/shmem.o
(gcc version 9.2.1, amd64)
Link: http://lkml.kernel.org/r/20190906143012.28698-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When doing UFFDIO_COPY, it is necessary to find the correct destination
vma and make sure fault range is in it.
Since there are two places need to do the same task, just wrap those
common check into an inlined function.
Link: http://lkml.kernel.org/r/20190927070032.2129-3-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These warning here is to make sure address(dst_addr) and length(len -
copied) are huge page size aligned.
While this is ensured by:
dst_start and len is huge page size aligned
dst_addr equals to dst_start and increase huge page size each time
copied increase huge page size each time
This means these warnings will never be triggered.
Link: http://lkml.kernel.org/r/20190927070032.2129-2-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In __mcopy_atomic_hugetlb() we use two variables to deal with huge page
size: vma_hpagesize and huge_page_size.
Since they are the same, it is not necessary to use two different
mechanism. This patch makes it consistent by all using vma_hpagesize.
Link: http://lkml.kernel.org/r/20190927070032.2129-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_size() is supported after the commit a50b854e07 ("mm: introduce
page_size()").
Use page_size() in madvise_inject_error() for readability.
[akpm@linux-foundation.org: use ulong for `size', per David]
Link: http://lkml.kernel.org/r/29dce60c-38d6-0220-f292-e298f0c78c4d@huawei.com
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Hu Shiyuan <hushiyuan@huawei.com>
Cc: Feilong Lin <linfeilong@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Case 1/6, 2/7 and 3/8 have the same pattern and we handle them in the
same logic.
Rearrange the comment to make it a little easy for audience to
understand.
Link: http://lkml.kernel.org/r/20191030012445.16944-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is more clear to use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs file
operation rather than DEFINE_SIMPLE_ATTRIBUTE.
Link: http://lkml.kernel.org/r/1572403660-44718-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In auto NUMA balancing page table scanning, if the pte_protnone() is
true, the PTE needs not to be changed because it's in target state
already. So other checking on corresponding struct page is unnecessary
too.
So, if we check pte_protnone() firstly for each PTE, we can avoid
unnecessary struct page accessing, so that reduce the cache footprint of
NUMA balancing page table scanning.
In the performance test of pmbench memory accessing benchmark with 80:20
read/write ratio and normal access address distribution on a 2 socket
Intel server with Optance DC Persistent Memory, perf profiling shows
that the autonuma page table scanning time reduces from 1.23% to 0.97%
(that is, reduced 21%) with the patch.
Link: http://lkml.kernel.org/r/20191101075727.26683-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When zone_watermark_ok() is called in migrate_balanced_pgdat() to check
migration target node, the parameter classzone_idx (for requested zone)
is specified as 0 (ZONE_DMA). But when allocating memory for autonuma
in alloc_misplaced_dst_page(), the requested zone from GFP flags is
ZONE_MOVABLE. That is, the requested zone is different. The size of
lowmem_reserve for the different requested zone is different. And this
may cause some issues.
For example, in the zoneinfo of a test machine as below,
Node 0, zone DMA32
pages free 61592
min 29
low 454
high 879
spanned 1044480
present 442306
managed 425921
protection: (0, 0, 62457, 62457, 62457)
The free page number of ZONE_DMA32 is greater than "high watermark +
lowmem_reserve[ZONE_DMA]", but less than "high watermark +
lowmem_reserve[ZONE_MOVABLE]". And because __alloc_pages_node() in
alloc_misplaced_dst_page() requests ZONE_MOVABLE, the
zone_watermark_ok() on ZONE_DMA32 in migrate_balanced_pgdat() may always
return true. So, autonuma may not stop even when memory pressure in
node 0 is heavy.
To fix the issue, ZONE_MOVABLE is used as parameter to call
zone_watermark_ok() in migrate_balanced_pgdat(). This makes it same as
requested zone in alloc_misplaced_dst_page(). So that
migrate_balanced_pgdat() returns false when memory pressure is heavy.
Link: http://lkml.kernel.org/r/20191101075727.26683-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is more clear to use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs file
operation rather than DEFINE_SIMPLE_ATTRIBUTE.
Link: http://lkml.kernel.org/r/1572348687-9951-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Yue Hu <huyue2@yulong.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kzalloc() is used for cma bitmap allocation in cma_activate_area(),
switch to bitmap_zalloc() for clarity.
Link: http://lkml.kernel.org/r/895d4627-f115-c77a-d454-c0a196116426@huawei.com
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Yue Hu <huyue2@yulong.com>
Cc: Peng Fan <peng.fan@nxp.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ryohei Suzuki <ryh.szk.cmnty@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Doug Berger <opendmb@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For non-shmem file THPs, khugepaged only collapses read only .text
mapping (VM_DENYWRITE). These pages should not be dirty except the case
where the file hasn't been flushed since first write.
Call filemap_flush() in collapse_file() to accelerate the write back in
such cases.
Link: http://lkml.kernel.org/r/20191106060930.2571389-3-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adding fully unmapped pages into deferred split queue is not productive:
these pages are about to be freed or they are pinned and cannot be split
anyway.
Link: http://lkml.kernel.org/r/20190913091849.11151-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When doing migration if the freed page is met, we just return without
migrating it since it is pointless to migrate a freed page. But, the
current code allocates target page unconditionally before handling freed
page, if the page is freed, the newly allocated will be just freed. It
doesn't make too much sense and is just a waste of time although
migrating freed page is rare.
So, handle freed page at the before that to avoid unnecessary page
allocation and free.
Link: http://lkml.kernel.org/r/1573755869-106954-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
split_huge_pages_fops is used for debugfs file. hence, it is more clear
to use DEFINE_DEBUGFS_ATTRIBUTE.
Link: http://lkml.kernel.org/r/1572347674-8111-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When mmapping an existing hugetlbfs file with MAP_POPULATE, we find it
is very time consuming. For example, mmapping a 128GB file takes about
50 milliseconds. Sampling with perfevent shows it spends 99% time in
the same_page loop in follow_hugetlb_page().
samples: 205 of event 'cycles', Event count (approx.): 136686374
- 99.04% test_mmap_huget [kernel.kallsyms] [k] follow_hugetlb_page
follow_hugetlb_page
__get_user_pages
__mlock_vma_pages_range
__mm_populate
vm_mmap_pgoff
sys_mmap_pgoff
sys_mmap
system_call_fastpath
__mmap64
follow_hugetlb_page() is called with pages=NULL and vmas=NULL, so for
each hugepage, we run into the same_page loop for pages_per_huge_page()
times, but doing nothing. With this change, it takes less then 1
millisecond to mmap a 128GB file in hugetlbfs.
Link: http://lkml.kernel.org/r/1567581712-5992-1-git-send-email-totty.lu@gmail.com
Signed-off-by: Zhigang Lu <tonnylu@tencent.com>
Reviewed-by: Haozhong Zhang <hzhongzhang@tencent.com>
Reviewed-by: Zongming Zhang <knightzhang@tencent.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The first parameter hstate in function hugetlb_fault_mutex_hash() is not
used anymore.
This patch removes it.
[akpm@linux-foundation.org: various build fixes]
[cai@lca.pw: fix a GCC compilation warning]
Link: http://lkml.kernel.org/r/1570544108-32331-1-git-send-email-cai@lca.pw
Link: http://lkml.kernel.org/r/20191005003302.785-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove duplicated code between region_chg and region_add, and refactor
it into a common function, add_reservation_in_range. This is mostly
done because there is a follow up change in another series that disables
region coalescing in region_add, and I want to make that change in one
place only. It should improve maintainability anyway on its own.
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20190919200428.188797-3-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current behavior is that region_chg provides both a cache entry in
resv->region_cache, AND a placeholder entry in resv->regions.
region_add first tries to use the placeholder, and if it finds that the
placeholder has been deleted by a racing region_del call, it uses the
cache entry.
This behavior is completely unnecessary and is removed in this patch for
a couple of reasons:
1. region_add needs to either find a cached file_region entry in
resv->region_cache, or find an entry in resv->regions to expand. It
does not need both.
2. region_chg adding a placeholder entry in resv->regions opens up
a possible race with region_del, where region_chg adds a placeholder
region in resv->regions, and this region is deleted by a racing call
to region_del during region_chg execution or before region_add is
called. Removing the race makes the code easier to reason about and
maintain.
In addition, a follow up patch in another series that disables region
coalescing, which would be further complicated if the race with
region_del exists.
Link: http://lkml.kernel.org/r/20190919200428.188797-2-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A customer with large SMP systems (up to 16 sockets) with application
that uses large amount of static hugepages (~500-1500GB) are
experiencing random multisecond delays. These delays were caused by the
long time it took to scan the VMA interval tree with mmap_sem held.
The sharing of huge PMD does not require changes to the i_mmap at all.
Therefore, we can just take the read lock and let other threads
searching for the right VMA share it in parallel. Once the right VMA is
found, either the PMD lock (2M huge page for x86-64) or the
mm->page_table_lock will be acquired to perform the actual PMD sharing.
Lock contention, if present, will happen in the spinlock. That is much
better than contention in the rwsem where the time needed to scan the
the interval tree is indeterminate.
With this patch applied, the customer is seeing significant performance
improvement over the unpatched kernel.
Link: http://lkml.kernel.org/r/20191107211809.9539-1-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A new clang diagnostic (-Wsizeof-array-div) warns about the calculation
to determine the number of u32's in an array of unsigned longs.
Suppress warning by adding parentheses.
While looking at the above issue, noticed that the 'address' parameter
to hugetlb_fault_mutex_hash is no longer used. So, remove it from the
definition and all callers.
No functional change.
Link: http://lkml.kernel.org/r/20190919011847.18400-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Ilie Halip <ilie.halip@gmail.com>
Cc: David Bolvansky <david.bolvansky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sparse_buffer_init() use memblock_alloc_try_nid_raw() to allocate memory
for page management structure, if memory allocation fails from specified
node, it will fall back to allocate from other nodes.
Normally, the page management structure will not exceed 2% of the total
memory, but a large continuous block of allocation is needed. In most
cases, memory allocation from the specified node will succeed, but a
node memory become highly fragmented will fail. we expect to allocate
memory base section rather than by allocating a large block of memory
from other NUMA nodes
Add memblock_alloc_exact_nid_raw() for this situation, which allocate
boot memory block on the exact node. If a large contiguous block memory
allocate fail in sparse_buffer_init(), it will fall back to allocate
small block memory base section.
Link: http://lkml.kernel.org/r/66755ea7-ab10-8882-36fd-3e02b03775d5@huawei.com
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change "max_addr" to "end" for less confusion in
memblock_alloc_range_nid comments.
Link: http://lkml.kernel.org/r/20191113051822.3296-1-ruansy.fnst@cn.fujitsu.com
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fix typos for:
elaboarte -> elaborate
architecure -> architecture
compltes -> completes
And, convert the markup :c:func:`foo` to foo() as kernel documentation
toolchain can recognize foo() as a function.
Link: http://lkml.kernel.org/r/20190912123127.8694-1-caoj.fnst@cn.fujitsu.com
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mbind() is required to report EFAULT if range, specified by addr and
len, contains unmapped holes. In current implementation, below rules
are applied for this checking:
1: Unmapped holes at any part of the specified range should be reported
as EFAULT if mbind() for none MPOL_DEFAULT cases;
2: Unmapped holes at any part of the specified range should be ignored
(do not reprot EFAULT) if mbind() for MPOL_DEFAULT case;
3: The whole range in an unmapped hole should be reported as EFAULT;
Note that rule 2 does not fullfill the mbind() API definition, but since
that behavior has existed for long days (the internal flag
MPOL_MF_DISCONTIG_OK is for this purpose), this patch does not plan to
change it.
In current code, application observed inconsistent behavior on rule 1
and rule 2 respectively. That inconsistency is fixed as below details.
Cases of rule 1:
- Hole at head side of range. Current code reprot EFAULT, no change by
this patch.
[ vma ][ hole ][ vma ]
[ range ]
- Hole at middle of range. Current code report EFAULT, no change by
this patch.
[ vma ][ hole ][ vma ]
[ range ]
- Hole at tail side of range. Current code do not report EFAULT, this
patch fixes it.
[ vma ][ hole ][ vma ]
[ range ]
Cases of rule 2:
- Hole at head side of range. Current code reports EFAULT, this patch
fixes it.
[ vma ][ hole ][ vma ]
[ range ]
- Hole at middle of range. Current code does not report EFAULT, no
change by this patch.
[ vma ][ hole ][ vma]
[ range ]
- Hole at tail side of range. Current code does not report EFAULT, no
change by this patch.
[ vma ][ hole ][ vma]
[ range ]
This patch has no changes to rule 3.
The unmapped hole checking can also be handled by using .pte_hole(),
instead of .test_walk(). But .pte_hole() is called for holes inside and
outside vma, which causes more cost, so this patch keeps the original
design with .test_walk().
Link: http://lkml.kernel.org/r/1573218104-11021-3-git-send-email-lixinhai.lxh@gmail.com
Fixes: 6f4576e368 ("mempolicy: apply page table walker on queue_pages_range()")
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-man <linux-man@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: Fix checking unmapped holes for mbind", v4.
This patchset fix checking unmapped holes for mbind().
First patch makes sure the vma been correctly tracked in .test_walk(),
so each time when .test_walk() is called, the neighborhood of two vma
is correct.
Current problem is that the !vma_migratable() check could cause return
immediately without update tracking to vma.
Second patch fix the inconsistent report of EFAULT when mbind() is
called for MPOL_DEFAULT and non MPOL_DEFAULT cases, so application do
not need to have workaround code to handle this special behavior.
Currently there are two problems, one is that the .test_walk() can not
know there is hole at tail side of range, because .test_walk() only
call for vma not for hole. The other one is that mbind_range() checks
for hole at head side of range but do not consider the
MPOL_MF_DISCONTIG_OK flag as done in .test_walk().
This patch (of 2):
Checking unmapped hole and updating the previous vma must be handled
first, otherwise the unmapped hole could be calculated from a wrong
previous vma.
Several commits were relevant to this error:
- commit 6f4576e368 ("mempolicy: apply page table walker on
queue_pages_range()")
This commit was correct, the VM_PFNMAP check was after updating
previous vma
- commit 48684a65b4 ("mm: pagewalk: fix misbehavior of
walk_page_range for vma(VM_PFNMAP)")
This commit added VM_PFNMAP check before updating previous vma. Then,
there were two VM_PFNMAP check did same thing twice.
- commit acda0c3340 ("mm/mempolicy.c: get rid of duplicated check for
vma(VM_PFNMAP) in queue_page s_range()")
This commit tried to fix the duplicated VM_PFNMAP check, but it
wrongly removed the one which was after updating vma.
Link: http://lkml.kernel.org/r/1573218104-11021-2-git-send-email-lixinhai.lxh@gmail.com
Fixes: acda0c3340 (mm/mempolicy.c: get rid of duplicated check for vma(VM_PFNMAP) in queue_pages_range())
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-man <linux-man@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For each page scheduled for compaction (e. g. by z3fold_free()), try to
apply inter-page compaction before running the traditional/ existing
intra-page compaction. That means, if the page has only one buddy, we
treat that buddy as a new object that we aim to place into an existing
z3fold page. If such a page is found, that object is transferred and the
old page is freed completely. The transferred object is named "foreign"
and treated slightly differently thereafter.
Namely, we increase "foreign handle" counter for the new page. Pages with
non-zero "foreign handle" count become unmovable. This patch implements
"foreign handle" detection when a handle is freed to decrement the foreign
handle counter accordingly, so a page may as well become movable again as
the time goes by.
As a result, we almost always have exactly 3 objects per page and
significantly better average compression ratio.
[cai@lca.pw: fix -Wunused-but-set-variable warnings]
Link: http://lkml.kernel.org/r/1570542062-29144-1-git-send-email-cai@lca.pw
[vitalywool@gmail.com: avoid subtle race when freeing slots]
Link: http://lkml.kernel.org/r/20191127152118.6314b99074b0626d4c5a8835@gmail.com
[vitalywool@gmail.com: compact objects more accurately]
Link: http://lkml.kernel.org/r/20191127152216.6ad33745a21ba71c53606acb@gmail.com
[vitalywool@gmail.com: protect handle reads]
Link: http://lkml.kernel.org/r/20191127152345.8059852f60947686674d726d@gmail.com
Link: http://lkml.kernel.org/r/20191006041457.24113-1-vitalywool@gmail.com
Signed-off-by: Vitaly Wool <vitaly.vul@sony.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We split the LRU lists into inactive and an active parts to maximize
workingset protection while allowing just enough inactive cache space to
faciltate readahead and writeback for one-off file accesses (e.g. a
linear scan through a file, or logging); or just enough inactive anon to
maintain recent reference information when reclaim needs to swap.
With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, inactive:active size
decisions are done on a per-cgroup level. As a result, we'll reclaim a
cgroup's workingset when it doesn't have cold pages, even when one of its
siblings has plenty of it that should be reclaimed first.
For example: workload A has 50M worth of hot cache but doesn't do any
one-off file accesses; meanwhile, parallel workload B scans files and
rarely accesses the same page twice.
If these workloads were to run in an uncgrouped system, A would be
protected from the high rate of cache faults from B. But if they were put
in parallel cgroups for memory accounting purposes, B's fast cache fault
rate would push out the hot cache pages of A. This is unexpected and
undesirable - the "scan resistance" of the page cache is broken.
This patch moves inactive:active size balancing decisions to the root of
reclaim - the same level where the LRU order is established.
It does this by looking at the recursive size of the inactive and the
active file sets of the cgroup subtree at the beginning of the reclaim
cycle, and then making a decision - scan or skip active pages - that
applies throughout the entire run and to every cgroup involved.
With that in place, in the test above, the VM will recognize that there
are plenty of inactive pages in the combined cache set of workloads A and
B and prefer the one-off cache in B over the hot pages in A. The scan
resistance of the cache is restored.
[cai@lca.pw: fix some -Wenum-conversion warnings]
Link: http://lkml.kernel.org/r/1573848697-29262-1-git-send-email-cai@lca.pw
Link: http://lkml.kernel.org/r/20191107205334.158354-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We use refault information to determine whether the cache workingset is
stable or transitioning, and dynamically adjust the inactive:active file
LRU ratio so as to maximize protection from one-off cache during stable
periods, and minimize IO during transitions.
With cgroups and their nested LRU lists, we currently don't do this
correctly. While recursive cgroup reclaim establishes a relative LRU
order among the pages of all involved cgroups, refaults only affect the
local LRU order in the cgroup in which they are occuring. As a result,
cache transitions can take longer in a cgrouped system as the active pages
of sibling cgroups aren't challenged when they should be.
[ Right now, this is somewhat theoretical, because the siblings, under
continued regular reclaim pressure, should eventually run out of
inactive pages - and since inactive:active *size* balancing is also
done on a cgroup-local level, we will challenge the active pages
eventually in most cases. But the next patch will move that relative
size enforcement to the reclaim root as well, and then this patch
here will be necessary to propagate refault pressure to siblings. ]
This patch moves refault detection to the root of reclaim. Instead of
remembering the cgroup owner of an evicted page, remember the cgroup that
caused the reclaim to happen. When refaults later occur, they'll
correctly influence the cross-cgroup LRU order that reclaim follows.
I.e. if global reclaim kicked out pages in some subgroup A/B/C, the
refault of those pages will challenge the global LRU order, and not just
the local order down inside C.
[hannes@cmpxchg.org: use page_memcg() instead of another lookup]
Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org
Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: fix page aging across multiple cgroups".
When applications are put into unconfigured cgroups for memory accounting
purposes, the cgrouping itself should not change the behavior of the page
reclaim code. We expect the VM to reclaim the coldest pages in the
system. But right now the VM can reclaim hot pages in one cgroup while
there is eligible cold cache in others.
This is because one part of the reclaim algorithm isn't truly cgroup
hierarchy aware: the inactive/active list balancing. That is the part
that is supposed to protect hot cache data from one-off streaming IO.
The recursive cgroup reclaim scheme will scan and rotate the physical LRU
lists of each eligible cgroup at the same rate in a round-robin fashion,
thereby establishing a relative order among the pages of all those
cgroups. However, the inactive/active balancing decisions are made
locally within each cgroup, so when a cgroup is running low on cold pages,
its hot pages will get reclaimed - even when sibling cgroups have plenty
of cold cache eligible in the same reclaim run.
For example:
[root@ham ~]# head -n1 /proc/meminfo
MemTotal: 1016336 kB
[root@ham ~]# ./reclaimtest2.sh
Establishing 50M active files in cgroup A...
Hot pages cached: 12800/12800 workingset-a
Linearly scanning through 18G of file data in cgroup B:
real 0m4.269s
user 0m0.051s
sys 0m4.182s
Hot pages cached: 134/12800 workingset-a
The streaming IO in B, which doesn't benefit from caching at all, pushes
out most of the workingset in A.
Solution
This series fixes the problem by elevating inactive/active balancing
decisions to the toplevel of the reclaim run. This is either a cgroup
that hit its limit, or straight-up global reclaim if there is physical
memory pressure. From there, it takes a recursive view of the cgroup
subtree to decide whether page deactivation is necessary.
In the test above, the VM will then recognize that cgroup B has plenty of
eligible cold cache, and that the hot pages in A can be spared:
[root@ham ~]# ./reclaimtest2.sh
Establishing 50M active files in cgroup A...
Hot pages cached: 12800/12800 workingset-a
Linearly scanning through 18G of file data in cgroup B:
real 0m4.244s
user 0m0.064s
sys 0m4.177s
Hot pages cached: 12800/12800 workingset-a
Implementation
Whether active pages can be deactivated or not is influenced by two
factors: the inactive list dropping below a minimum size relative to the
active list, and the occurence of refaults.
This patch series first moves refault detection to the reclaim root, then
enforces the minimum inactive size based on a recursive view of the cgroup
tree's LRUs.
History
Note that this actually never worked correctly in Linux cgroups. In the
past it worked for global reclaim and leaf limit reclaim only (we used to
have two physical LRU linkages per page), but it never worked for
intermediate limit reclaim over multiple leaf cgroups.
We're noticing this now because 1) we're putting everything into cgroups
for accounting, not just the things we want to control and 2) we're moving
away from leaf limits that invoke reclaim on individual cgroups, toward
large tree reclaim, triggered by high-level limits, or physical memory
pressure that is influenced by local protections such as memory.low and
memory.min instead.
This patch (of 3):
When file pages are lower than the watermark on a node, we try to force
scan anonymous pages to counter-act the balancing algorithms preference
for new file pages when they are likely thrashing. This is a node-level
decision, but it's currently made each time we look at an lruvec. This is
unnecessarily expensive and also a layering violation that makes the code
harder to understand.
Clean this up by making the check once per node and setting a flag in the
scan_control.
Link: http://lkml.kernel.org/r/20191107205334.158354-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current writeback congestion tracking has separate flags for kswapd
reclaim (node level) and cgroup limit reclaim (memcg-node level). This is
unnecessarily complicated: the lruvec is an existing abstraction layer for
that node-memcg intersection.
Introduce lruvec->flags and LRUVEC_CONGESTED. Then track that at the
reclaim root level, which is either the NUMA node for global reclaim, or
the cgroup-node intersection for cgroup reclaim.
Link: http://lkml.kernel.org/r/20191022144803.302233-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function is getting long and unwieldy, split out the memcg bits.
The updated shrink_node() handles the generic (node) reclaim aspects:
- global vmpressure notifications
- writeback and congestion throttling
- reclaim/compaction management
- kswapd giving up on unreclaimable nodes
It then calls a new shrink_node_memcgs() which handles cgroup specifics:
- the cgroup tree traversal
- memory.low considerations
- per-cgroup slab shrinking callbacks
- per-cgroup vmpressure notifications
[hannes@cmpxchg.org: rename "root" to "target_memcg", per Roman]
Link: http://lkml.kernel.org/r/20191025143640.GA386981@cmpxchg.org
Link: http://lkml.kernel.org/r/20191022144803.302233-8-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An lruvec holds LRU pages owned by a certain NUMA node and cgroup.
Instead of awkwardly passing around a combination of a pgdat and a memcg
pointer, pass down the lruvec as soon as we can look it up.
Nested callers that need to access node or cgroup properties can look them
them up if necessary, but there are only a few cases.
Link: http://lkml.kernel.org/r/20191022144803.302233-7-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most of the function body is inside a loop, which imposes an additional
indentation and scoping level that makes the code a bit hard to follow and
modify.
The looping only happens in case of reclaim-compaction, which isn't the
common case. So rather than adding yet another function level to the
reclaim path and have every reclaim invocation go through a level that
only exists for one specific cornercase, use a retry goto.
Link: http://lkml.kernel.org/r/20191022144803.302233-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Seven years after introducing the global_reclaim() function, I still have
to double take when reading a callsite. I don't know how others do it,
this is a terrible name.
Invert the meaning and rename it to cgroup_reclaim().
[ After all, "global reclaim" is just regular reclaim invoked from the
page allocator. It's reclaim on behalf of a cgroup limit that is a
special case of reclaim, and should be explicit - not the reverse. ]
sane_reclaim() isn't very descriptive either: it tests whether we can use
the regular writeback throttling - available during regular page reclaim
or cgroup2 limit reclaim - or need to use the broken
wait_on_page_writeback() method. Use "writeback_throttling_sane()".
Link: http://lkml.kernel.org/r/20191022144803.302233-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
inactive_list_is_low() should be about one thing: checking the ratio
between inactive and active list. Kitchensink checks like the one for
swap space makes the function hard to use and modify its callsites.
Luckly, most callers already have an understanding of the swap situation,
so it's easy to clean up.
get_scan_count() has its own, memcg-aware swap check, and doesn't even get
to the inactive_list_is_low() check on the anon list when there is no swap
space available.
shrink_list() is called on the results of get_scan_count(), so that check
is redundant too.
age_active_anon() has its own totalswap_pages check right before it checks
the list proportions.
The shrink_node_memcg() site is the only one that doesn't do its own swap
check. Add it there.
Then delete the swap check from inactive_list_is_low().
Link: http://lkml.kernel.org/r/20191022144803.302233-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a per-memcg lruvec and a NUMA node lruvec. Which one is being
used is somewhat confusing right now, and it's easy to make mistakes -
especially when it comes to global reclaim.
How it works: when memory cgroups are enabled, we always use the
root_mem_cgroup's per-node lruvecs. When memory cgroups are not compiled
in or disabled at runtime, we use pgdat->lruvec.
Document that in a comment.
Due to the way the reclaim code is generalized, all lookups use the
mem_cgroup_lruvec() helper function, and nobody should have to find the
right lruvec manually right now. But to avoid future mistakes, rename the
pgdat->lruvec member to pgdat->__lruvec and delete the convenience wrapper
that suggests it's a commonly accessed member.
While in this area, swap the mem_cgroup_lruvec() argument order. The name
suggests a memcg operation, yet it takes a pgdat first and a memcg second.
I have to double take every time I call this. Fix that.
Link: http://lkml.kernel.org/r/20191022144803.302233-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: vmscan: cgroup-related cleanups".
Here are 8 patches that clean up the reclaim code's interaction with
cgroups a bit. They're not supposed to change any behavior, just make
the implementation easier to understand and work with.
This patch (of 8):
This function currently takes the node or lruvec size and subtracts the
zones that are excluded by the classzone index of the allocation. It uses
four different types of counters to do this.
Just add up the eligible zones.
[cai@lca.pw: fix an undefined behavior for zone id]
Link: http://lkml.kernel.org/r/20191108204407.1435-1-cai@lca.pw
[akpm@linux-foundation.org: deal with the MAX_NR_ZONES special case. per Qian Cai]
Link: http://lkml.kernel.org/r/64E60F6F-7582-427B-8DD5-EF97B1656F5A@lca.pw
Link: http://lkml.kernel.org/r/20191022144803.302233-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since lumpy reclaim was removed in v3.5 scan_control is not used by
may_write_to_{queue|inode} and pageout() anymore, remove the unused
parameter.
Link: http://lkml.kernel.org/r/1570124498-19300-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since 9092c71bb7 ("mm: use sc->priority for slab shrink targets") the
argument 'unsigned long *lru_pages' passed around with no purpose. Remove
it.
Link: http://lkml.kernel.org/r/20190228083329.31892-4-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Print nr_reserved_highatomic in show_free_areas, because when alloc_harder
is false, this value will be subtracted from the free_pages in
__zone_watermark_ok. Printing this value can help analyze memory
allocaction failure issues.
Link: http://lkml.kernel.org/r/19515f3de2fb6abe66b52e03e4b676a21e82beda.1573634806.git.lijiazi@xiaomi.com
Signed-off-by: lijiazi <lijiazi@xiaomi.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory hotplug needs to be able to reset and reinit the pcpu allocator
batch and high limits but this action is internal to the VM. Move the
declaration to internal.h
Link: http://lkml.kernel.org/r/20191021094808.28824-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both the percpu_pagelist_fraction sysctl handler and memory hotplug have
a common requirement of updating the pcpu page allocation batch and high
values. Split the relevant helper to share common code.
No functional change.
Link: http://lkml.kernel.org/r/20191021094808.28824-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
HugeTLB helper alloc_gigantic_page() implements fairly generic
allocation method where it scans over various zones looking for a large
contiguous pfn range before trying to allocate it with
alloc_contig_range().
Other than deriving the requested order from 'struct hstate', there is
nothing HugeTLB specific in there. This can be made available for
general use to allocate contiguous memory which could not have been
allocated through the buddy allocator.
alloc_gigantic_page() has been split carving out actual allocation
method which is then made available via new alloc_contig_pages() helper
wrapped under CONFIG_CONTIG_ALLOC. All references to 'gigantic' have
been replaced with more generic term 'contig'. Allocated pages here
should be freed with free_contig_range() or by calling __free_page() on
each allocated page.
Link: http://lkml.kernel.org/r/1571300646-32240-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "kasan: support backing vmalloc space with real shadow
memory", v11.
Currently, vmalloc space is backed by the early shadow page. This means
that kasan is incompatible with VMAP_STACK.
This series provides a mechanism to back vmalloc space with real,
dynamically allocated memory. I have only wired up x86, because that's
the only currently supported arch I can work with easily, but it's very
easy to wire up other architectures, and it appears that there is some
work-in-progress code to do this on arm64 and s390.
This has been discussed before in the context of VMAP_STACK:
- https://bugzilla.kernel.org/show_bug.cgi?id=202009
- https://lkml.org/lkml/2018/7/22/198
- https://lkml.org/lkml/2019/7/19/822
In terms of implementation details:
Most mappings in vmalloc space are small, requiring less than a full
page of shadow space. Allocating a full shadow page per mapping would
therefore be wasteful. Furthermore, to ensure that different mappings
use different shadow pages, mappings would have to be aligned to
KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE.
Instead, share backing space across multiple mappings. Allocate a
backing page when a mapping in vmalloc space uses a particular page of
the shadow region. This page can be shared by other vmalloc mappings
later on.
We hook in to the vmap infrastructure to lazily clean up unused shadow
memory.
Testing with test_vmalloc.sh on an x86 VM with 2 vCPUs shows that:
- Turning on KASAN, inline instrumentation, without vmalloc, introuduces
a 4.1x-4.2x slowdown in vmalloc operations.
- Turning this on introduces the following slowdowns over KASAN:
* ~1.76x slower single-threaded (test_vmalloc.sh performance)
* ~2.18x slower when both cpus are performing operations
simultaneously (test_vmalloc.sh sequential_test_order=1)
This is unfortunate but given that this is a debug feature only, not the
end of the world. The benchmarks are also a stress-test for the vmalloc
subsystem: they're not indicative of an overall 2x slowdown!
This patch (of 4):
Hook into vmalloc and vmap, and dynamically allocate real shadow memory
to back the mappings.
Most mappings in vmalloc space are small, requiring less than a full
page of shadow space. Allocating a full shadow page per mapping would
therefore be wasteful. Furthermore, to ensure that different mappings
use different shadow pages, mappings would have to be aligned to
KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE.
Instead, share backing space across multiple mappings. Allocate a
backing page when a mapping in vmalloc space uses a particular page of
the shadow region. This page can be shared by other vmalloc mappings
later on.
We hook in to the vmap infrastructure to lazily clean up unused shadow
memory.
To avoid the difficulties around swapping mappings around, this code
expects that the part of the shadow region that covers the vmalloc space
will not be covered by the early shadow page, but will be left unmapped.
This will require changes in arch-specific code.
This allows KASAN with VMAP_STACK, and may be helpful for architectures
that do not have a separate module space (e.g. powerpc64, which I am
currently working on). It also allows relaxing the module alignment
back to PAGE_SIZE.
Testing with test_vmalloc.sh on an x86 VM with 2 vCPUs shows that:
- Turning on KASAN, inline instrumentation, without vmalloc, introuduces
a 4.1x-4.2x slowdown in vmalloc operations.
- Turning this on introduces the following slowdowns over KASAN:
* ~1.76x slower single-threaded (test_vmalloc.sh performance)
* ~2.18x slower when both cpus are performing operations
simultaneously (test_vmalloc.sh sequential_test_order=3D1)
This is unfortunate but given that this is a debug feature only, not the
end of the world.
The full benchmark results are:
Performance
No KASAN KASAN original x baseline KASAN vmalloc x baseline x KASAN
fix_size_alloc_test 662004 11404956 17.23 19144610 28.92 1.68
full_fit_alloc_test 710950 12029752 16.92 13184651 18.55 1.10
long_busy_list_alloc_test 9431875 43990172 4.66 82970178 8.80 1.89
random_size_alloc_test 5033626 23061762 4.58 47158834 9.37 2.04
fix_align_alloc_test 1252514 15276910 12.20 31266116 24.96 2.05
random_size_align_alloc_te 1648501 14578321 8.84 25560052 15.51 1.75
align_shift_alloc_test 147 830 5.65 5692 38.72 6.86
pcpu_alloc_test 80732 125520 1.55 140864 1.74 1.12
Total Cycles 119240774314 763211341128 6.40 1390338696894 11.66 1.82
Sequential, 2 cpus
No KASAN KASAN original x baseline KASAN vmalloc x baseline x KASAN
fix_size_alloc_test 1423150 14276550 10.03 27733022 19.49 1.94
full_fit_alloc_test 1754219 14722640 8.39 15030786 8.57 1.02
long_busy_list_alloc_test 11451858 52154973 4.55 107016027 9.34 2.05
random_size_alloc_test 5989020 26735276 4.46 68885923 11.50 2.58
fix_align_alloc_test 2050976 20166900 9.83 50491675 24.62 2.50
random_size_align_alloc_te 2858229 17971700 6.29 38730225 13.55 2.16
align_shift_alloc_test 405 6428 15.87 26253 64.82 4.08
pcpu_alloc_test 127183 151464 1.19 216263 1.70 1.43
Total Cycles 54181269392 308723699764 5.70 650772566394 12.01 2.11
fix_size_alloc_test 1420404 14289308 10.06 27790035 19.56 1.94
full_fit_alloc_test 1736145 14806234 8.53 15274301 8.80 1.03
long_busy_list_alloc_test 11404638 52270785 4.58 107550254 9.43 2.06
random_size_alloc_test 6017006 26650625 4.43 68696127 11.42 2.58
fix_align_alloc_test 2045504 20280985 9.91 50414862 24.65 2.49
random_size_align_alloc_te 2845338 17931018 6.30 38510276 13.53 2.15
align_shift_alloc_test 472 3760 7.97 9656 20.46 2.57
pcpu_alloc_test 118643 132732 1.12 146504 1.23 1.10
Total Cycles 54040011688 309102805492 5.72 651325675652 12.05 2.11
[dja@axtens.net: fixups]
Link: http://lkml.kernel.org/r/20191120052719.7201-1-dja@axtens.net
Link: https://bugzilla.kernel.org/show_bug.cgi?id=3D202009
Link: http://lkml.kernel.org/r/20191031093909.9228-2-dja@axtens.net
Signed-off-by: Mark Rutland <mark.rutland@arm.com> [shadow rework]
Signed-off-by: Daniel Axtens <dja@axtens.net>
Co-developed-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the new allocation approach introduced in the 5.2 kernel, it
becomes possible to get rid of one global spinlock. By doing that we
can further improve the KVA from the performance point of view.
Basically we can have two independent locks, one for allocation part and
another one for deallocation, because of two different entities: "free
data structures" and "busy data structures".
As a result, allocation/deallocation operations can still interfere
between each other in case of running simultaneously on different CPUs,
it means there is still dependency, but with two locks it becomes lower.
Summarizing:
- it reduces the high lock contention
- it allows to perform operations on "free" and "busy"
trees in parallel on different CPUs. Please note it
does not solve scalability issue.
Test results:
In order to evaluate this patch, we can run "vmalloc test driver" to see
how many CPU cycles it takes to complete all test cases running
sequentially. All online CPUs run it so it will cause a high lock
contention.
HiKey 960, ARM64, 8xCPUs, big.LITTLE:
<snip>
sudo ./test_vmalloc.sh sequential_test_order=1
<snip>
<default>
[ 390.950557] All test took CPU0=457126382 cycles
[ 391.046690] All test took CPU1=454763452 cycles
[ 391.128586] All test took CPU2=454539334 cycles
[ 391.222669] All test took CPU3=455649517 cycles
[ 391.313946] All test took CPU4=388272196 cycles
[ 391.410425] All test took CPU5=384036264 cycles
[ 391.492219] All test took CPU6=387432964 cycles
[ 391.578433] All test took CPU7=387201996 cycles
<default>
<patched>
[ 304.721224] All test took CPU0=391521310 cycles
[ 304.821219] All test took CPU1=393533002 cycles
[ 304.917120] All test took CPU2=392243032 cycles
[ 305.008986] All test took CPU3=392353853 cycles
[ 305.108944] All test took CPU4=297630721 cycles
[ 305.196406] All test took CPU5=297548736 cycles
[ 305.288602] All test took CPU6=297092392 cycles
[ 305.381088] All test took CPU7=297293597 cycles
<patched>
~14%-23% patched variant is better.
Link: http://lkml.kernel.org/r/20191022155800.20468-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When fit type is NE_FIT_TYPE there is a need in one extra object.
Usually the "ne_fit_preload_node" per-CPU variable has it and there is
no need in GFP_NOWAIT allocation, but there are exceptions.
This commit just adds more explanations, as a result giving answers on
questions like when it can occur, how often, under which conditions and
what happens if GFP_NOWAIT gets failed.
Link: http://lkml.kernel.org/r/20191016095438.12391-3-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Daniel Wagner <dwagner@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allocation functions should comply with the given gfp_mask as much as
possible. The preallocation code in alloc_vmap_area doesn't follow that
pattern and it is using a hardcoded GFP_KERNEL. Although this doesn't
really make much difference because vmalloc is not GFP_NOWAIT compliant
in general (e.g. page table allocations are GFP_KERNEL) there is no
reason to spread that bad habit and it is good to fix the antipattern.
[mhocko@suse.com: rewrite changelog]
Link: http://lkml.kernel.org/r/20191016095438.12391-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Daniel Wagner <dwagner@suse.de>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some background. The preemption was disabled before to guarantee that a
preloaded object is available for a CPU, it was stored for. That was
achieved by combining the disabling the preemption and taking the spin
lock while the ne_fit_preload_node is checked.
The aim was to not allocate in atomic context when spinlock is taken
later, for regular vmap allocations. But that approach conflicts with
CONFIG_PREEMPT_RT philosophy. It means that calling spin_lock() with
disabled preemption is forbidden in the CONFIG_PREEMPT_RT kernel.
Therefore, get rid of preempt_disable() and preempt_enable() when the
preload is done for splitting purpose. As a result we do not guarantee
now that a CPU is preloaded, instead we minimize the case when it is
not, with this change, by populating the per cpu preload pointer under
the vmap_area_lock.
This implies that at least each caller that has done the preallocation
will not fallback to an atomic allocation later. It is possible that
the preallocation would be pointless or that no preallocation is done
because of the race but the data shows that this is really rare.
For example i run the special test case that follows the preload pattern
and path. 20 "unbind" threads run it and each does 1000000 allocations.
Only 3.5 times among 1000000 a CPU was not preloaded. So it can happen
but the number is negligible.
[mhocko@suse.com: changelog additions]
Link: http://lkml.kernel.org/r/20191016095438.12391-1-urezki@gmail.com
Fixes: 82dd23e84b ("mm/vmalloc.c: preload a CPU with one object for split purpose")
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Daniel Wagner <dwagner@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gfpflags_allow_blocking() does not care about __GFP_HIGHMEM, so
highmem_mask can be removed.
Link: http://lkml.kernel.org/r/1568812319-3467-1-git-send-email-liuxiang_1999@126.com
Signed-off-by: Liu Xiang <liuxiang_1999@126.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vincent has noticed [1] that there is something unusual with the memmap
allocations going on on his platform
: I noticed this because on my ARM64 platform, with 1 GiB of memory the
: first [and only] section is allocated from the zeroing path while with
: 2 GiB of memory the first 1 GiB section is allocated from the
: non-zeroing path.
The underlying problem is that although sparse_buffer_init allocates
enough memory for all sections on the node sparse_buffer_alloc is not
able to consume them due to mismatch in the expected allocation
alignement. While sparse_buffer_init preallocation uses the PAGE_SIZE
alignment the real memmap has to be aligned to section_map_size() this
results in a wasted initial chunk of the preallocated memmap and
unnecessary fallback allocation for a section.
While we are at it also change __populate_section_memmap to align to the
requested size because at least VMEMMAP has constrains to have memmap
properly aligned.
[1] http://lkml.kernel.org/r/20191030131122.8256-1-vincent.whitchurch@axis.com
[akpm@linux-foundation.org: tweak layout, per David]
Link: http://lkml.kernel.org/r/20191119092642.31799-1-mhocko@kernel.org
Fixes: 35fd1eb1e8 ("mm/sparse: abstract sparse buffer allocations")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Debugged-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Building the kernel on s390 with -Og produces the following warning:
WARNING: vmlinux.o(.text+0x28dabe): Section mismatch in reference from the function populate_section_memmap() to the function .meminit.text:__populate_section_memmap()
The function populate_section_memmap() references
the function __meminit __populate_section_memmap().
This is often because populate_section_memmap lacks a __meminit
annotation or the annotation of __populate_section_memmap is wrong.
While -Og is not supported, in theory this might still happen with
another compiler or on another architecture. So fix this by using the
correct section annotations.
[iii@linux.ibm.com: v2]
Link: http://lkml.kernel.org/r/20191030151639.41486-1-iii@linux.ibm.com
Link: http://lkml.kernel.org/r/20191028165549.14478-1-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sparsemem without VMEMMAP has two allocation paths to allocate the
memory needed for its memmap (done in sparse_mem_map_populate()).
In one allocation path (sparse_buffer_alloc() succeeds), the memory is
not zeroed (since it was previously allocated with
memblock_alloc_try_nid_raw()).
In the other allocation path (sparse_buffer_alloc() fails and
sparse_mem_map_populate() falls back to memblock_alloc_try_nid()), the
memory is zeroed.
AFAICS this difference does not appear to be on purpose. If the code is
supposed to work with non-initialized memory (__init_single_page() takes
care of zeroing the struct pages which are actually used), we should
consistently not zero the memory, to avoid masking bugs.
( I noticed this because on my ARM64 platform, with 1 GiB of memory the
first [and only] section is allocated from the zeroing path while with
2 GiB of memory the first 1 GiB section is allocated from the
non-zeroing path. )
Michal:
"the main user visible problem is a memory wastage. The overal amount
of memory should be small. I wouldn't call it stable material."
Link: http://lkml.kernel.org/r/20191030131122.8256-1-vincent.whitchurch@axis.com
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Our onlining/offlining code is unnecessarily complicated. Only memory
blocks added during boot can have holes (a range that is not
IORESOURCE_SYSTEM_RAM). Hotplugged memory never has holes (e.g., see
add_memory_resource()). All memory blocks that belong to boot memory
are already online.
Note that boot memory can have holes and the memmap of the holes is
marked PG_reserved. However, also memory allocated early during boot is
PG_reserved - basically every page of boot memory that is not given to
the buddy is PG_reserved.
Therefore, when we stop allowing to offline memory blocks with holes, we
implicitly no longer have to deal with onlining memory blocks with
holes. E.g., online_pages() will do a walk_system_ram_range(...,
online_pages_range), whereby online_pages_range() will effectively only
free the memory holes not falling into a hole to the buddy. The other
pages (holes) are kept PG_reserved (via
move_pfn_range_to_zone()->memmap_init_zone()).
This allows to simplify the code. For example, we no longer have to
worry about marking pages that fall into memory holes PG_reserved when
onlining memory. We can stop setting pages PG_reserved completely in
memmap_init_zone().
Offlining memory blocks added during boot is usually not guaranteed to
work either way (unmovable data might have easily ended up on that
memory during boot). So stopping to do that should not really hurt.
Also, people are not even aware of a setup where onlining/offlining of
memory blocks with holes used to work reliably (see [1] and [2]
especially regarding the hotplug path) - I doubt it worked reliably.
For the use case of offlining memory to unplug DIMMs, we should see no
change. (holes on DIMMs would be weird).
Please note that hardware errors (PG_hwpoison) are not memory holes and
are not affected by this change when offlining.
[1] https://lkml.org/lkml/2019/10/22/135
[2] https://lkml.org/lkml/2019/8/14/1365
Link: http://lkml.kernel.org/r/20191119115237.6662-1-david@redhat.com
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have two types of users of page isolation:
1. Memory offlining: Offline memory so it can be unplugged. Memory
won't be touched.
2. Memory allocation: Allocate memory (e.g., alloc_contig_range()) to
become the owner of the memory and make use of
it.
For example, in case we want to offline memory, we can ignore (skip
over) PageHWPoison() pages, as the memory won't get used. We can allow
to offline memory. In contrast, we don't want to allow to allocate such
memory.
Let's generalize the approach so we can special case other types of
pages we want to skip over in case we offline memory. While at it, also
pass the same flags to test_pages_isolated().
Link: http://lkml.kernel.org/r/20191021172353.3056-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: Memory offlining + page isolation cleanups", v2.
This patch (of 2):
We call __offline_isolated_pages() from __offline_pages() after all
pages were isolated and are either free (PageBuddy()) or PageHWPoison.
Nothing can stop us from offlining memory at this point.
In __offline_isolated_pages() we first set all affected memory sections
offline (offline_mem_sections(pfn, end_pfn)), to mark the memmap as
invalid (pfn_to_online_page() will no longer succeed), and then walk
over all pages to pull the free pages from the free lists (to the
isolated free lists, to be precise).
Note that re-onlining a memory block will result in the whole memmap
getting reinitialized, overwriting any old state. We already poision
the memmap when offlining is complete to find any access to
stale/uninitialized memmaps.
So, setting the pages PageReserved() is not helpful. The memap is
marked offline and all pageblocks are isolated. As soon as offline, the
memmap is stale either way.
This looks like a leftover from ancient times where we initialized the
memmap when adding memory and not when onlining it (the pages were set
PageReserved so re-onling would work as expected).
Link: http://lkml.kernel.org/r/20191021172353.3056-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's drop the now unused functions.
Link: http://lkml.kernel.org/r/20190909114830.662-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/memory_hotplug: Export generic_online_page()".
Let's replace the __online_page...() functions by generic_online_page().
Hyper-V only wants to delay the actual onlining of un-backed pages, so
we can simpy re-use the generic function.
This patch (of 3):
Let's expose generic_online_page() so online_page_callback users can
simply fall back to the generic implementation when actually deciding to
online the pages.
Link: http://lkml.kernel.org/r/20190909114830.662-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On PowerPC, the address ranges allocated to OpenCAPI LPC memory are
allocated from firmware. These address ranges may be higher than what
older kernels permit, as we increased the maximum permissable address in
commit 4ffe713b75 ("powerpc/mm: Increase the max addressable memory to
2PB"). It is possible that the addressable range may change again in
the future.
In this scenario, we end up with a bogus section returned from
__section_nr (see the discussion on the thread "mm: Trigger bug on if a
section is not found in __section_nr").
Adding a check here means that we fail early and have an opportunity to
handle the error gracefully, rather than rumbling on and potentially
accessing an incorrect section.
Further discussion is also on the thread ("powerpc: Perform a bounds
check in arch_add_memory")
http://lkml.kernel.org/r/20190827052047.31547-1-alastair@au1.ibm.com
Link: http://lkml.kernel.org/r/20191001004617.7536-2-alastair@au1.ibm.com
Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently during memory hot add procedure, memory gets into memblock
before calling arch_add_memory() which creates its linear mapping.
add_memory_resource() {
..................
memblock_add_node()
..................
arch_add_memory()
..................
}
But during memory hot remove procedure, removal from memblock happens
first before its linear mapping gets teared down with
arch_remove_memory() which is not consistent. Resource removal should
happen in reverse order as they were added. However this does not pose
any problem for now, unless there is an assumption regarding linear
mapping. One example was a subtle failure on arm64 platform [1].
Though this has now found a different solution.
try_remove_memory() {
..................
memblock_free()
memblock_remove()
..................
arch_remove_memory()
..................
}
This changes the sequence of resource removal including memblock and
linear mapping tear down during memory hot remove which will now be the
reverse order in which they were added during memory hot add. The
changed removal order looks like the following.
try_remove_memory() {
..................
arch_remove_memory()
..................
memblock_free()
memblock_remove()
..................
}
[1] https://patchwork.kernel.org/patch/11127623/
Memory hot remove now works on arm64 without this because a recent
commit 60bb462fc7ad ("drivers/base/node.c: simplify
unregister_memory_block_under_nodes()").
This does not fix a serious problem. It just removes an inconsistency
while freeing resources during memory hot remove which for now does not
pose a real problem.
David mentioned that re-ordering should still make sense for consistency
purpose (removing stuff in the reverse order they were added). This
patch is now detached from arm64 hot-remove series.
Michal:
: I would just a note that the inconsistency doesn't pose any problem now
: but if somebody makes any assumptions about linear mappings then it could
: get subtly broken like your example for arm64 which has found a different
: solution in the meantime.
Link: http://lkml.kernel.org/r/1569380273-7708-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_shift() is supported after the commit 94ad933810 ("mm: introduce
page_shift()").
So replace with page_shift() in add_to_kill() for readability.
Link: http://lkml.kernel.org/r/543d8bc9-f2e7-3023-7c35-2e7ed67c0e82@huawei.com
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently soft_offline_page() receives struct page, and its sibling
memory_failure() receives pfn. This discrepancy looks weird and makes
precheck on pfn validity tricky. So let's align them.
Link: http://lkml.kernel.org/r/20191016234706.GA5493@www9186uo.sakura.ne.jp
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
add_to_kill() expects the first 'tk' to be pre-allocated, it makes
subsequent allocations on need basis, this makes the code a bit
difficult to read.
Move all the allocation internal to add_to_kill() and drop the **tk
argument.
Link: http://lkml.kernel.org/r/1565112345-28754-2-git-send-email-jane.chu@oracle.com
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
F_SEAL_FUTURE_WRITE has unexpected behavior when used with MAP_PRIVATE:
A private mapping created after the memfd file that gets sealed with
F_SEAL_FUTURE_WRITE loses the copy-on-write at fork behavior, meaning
children and parent share the same memory, even though the mapping is
private.
The reason for this is due to the code below:
static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
{
struct shmem_inode_info *info = SHMEM_I(file_inode(file));
if (info->seals & F_SEAL_FUTURE_WRITE) {
/*
* New PROT_WRITE and MAP_SHARED mmaps are not allowed when
* "future write" seal active.
*/
if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE))
return -EPERM;
/*
* Since the F_SEAL_FUTURE_WRITE seals allow for a MAP_SHARED
* read-only mapping, take care to not allow mprotect to revert
* protections.
*/
vma->vm_flags &= ~(VM_MAYWRITE);
}
...
}
And for the mm to know if a mapping is copy-on-write:
static inline bool is_cow_mapping(vm_flags_t flags)
{
return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}
The patch fixes the issue by making the mprotect revert protection
happen only for shared mappings. For private mappings, using mprotect
will have no effect on the seal behavior.
The F_SEAL_FUTURE_WRITE feature was introduced in v5.1 so v5.3.x stable
kernels would need a backport.
[akpm@linux-foundation.org: reflow comment, per Christoph]
Link: http://lkml.kernel.org/r/20191107195355.80608-1-joel@joelfernandes.org
Fixes: ab3948f58f ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd")
Signed-off-by: Nicolas Geoffray <ngeoffray@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A huge pud page can theoretically be faulted in racing with pmd_alloc()
in __handle_mm_fault(). That will lead to pmd_alloc() returning an
invalid pmd pointer.
Fix this by adding a pud_trans_unstable() function similar to
pmd_trans_unstable() and check whether the pud is really stable before
using the pmd pointer.
Race:
Thread 1: Thread 2: Comment
create_huge_pud() Fallback - not taken.
create_huge_pud() Taken.
pmd_alloc() Returns an invalid pointer.
This will result in user-visible huge page data corruption.
Note that this was caught during a code audit rather than a real
experienced problem. It looks to me like the only implementation that
currently creates huge pud pagetable entries is dev_dax_huge_fault()
which doesn't appear to care much about private (COW) mappings or
write-tracking which is, I believe, a prerequisite for create_huge_pud()
falling back on thread 1, but not in thread 2.
Link: http://lkml.kernel.org/r/20191115115808.21181-2-thomas_os@shipmail.org
Fixes: a00cc7d9dd ("mm, x86: add support for PUD-sized transparent hugepages")
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The __page_check_anon_rmap() just calls two BUG_ON()s protected by
CONFIG_DEBUG_VM, the #ifdef could be eliminated by using VM_BUG_ON_PAGE().
Link: http://lkml.kernel.org/r/1573157346-111316-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace DESTROY_BY_RCU with SLAB_TYPESAFE_BY_RCU because
SLAB_DESTROY_BY_RCU has been renamed to SLAB_TYPESAFE_BY_RCU by commit
5f0d5a3ae7 ("mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU")
Link: http://lkml.kernel.org/r/20191017093554.22562-1-miles.chen@mediatek.com
Signed-off-by: Miles Chen <miles.chen@mediatek.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This came up when removing __ARCH_HAS_5LEVEL_HACK for ARC as code bloat.
With this patch we see the following code reduction.
| bloat-o-meter2 vmlinux-D-elide-p4d_free_tlb vmlinux-E-elide-p?d_clear_bad
| add/remove: 0/2 grow/shrink: 0/0 up/down: 0/-40 (-40)
| function old new delta
| pud_clear_bad 20 - -20
| p4d_clear_bad 20 - -20
| Total: Before=4136930, After=4136890, chg -1.000000%
Link: http://lkml.kernel.org/r/20191016162400.14796-6-vgupta@synopsys.com
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_unmapped_area() returns an address or -errno on failure. Historically
we have checked for the failure by offset_in_page() which is correct but
quite hard to read. Newer code started using IS_ERR_VALUE which is much
easier to read. Convert remaining users of offset_in_page as well.
[mhocko@suse.com: rewrite changelog]
[mhocko@kernel.org: fix mremap.c and uprobes.c sites also]
Link: http://lkml.kernel.org/r/20191012102512.28051-1-pugaowei@gmail.com
Signed-off-by: Gaowei Pu <pugaowei@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In __anon_vma_prepare(), we will try to find anon_vma if it is possible to
reuse it. While on fork, the logic is different.
Since commit 5beb493052 ("mm: change anon_vma linking to fix
multi-process server scalability issue"), function anon_vma_clone() tries
to allocate new anon_vma for child process. But the logic here will
allocate a new anon_vma for each vma, even in parent this vma is mergeable
and share the same anon_vma with its sibling. This may do better for
scalability issue, while it is not necessary to do so especially after
interval tree is used.
Commit 7a3ef208e6 ("mm: prevent endless growth of anon_vma hierarchy")
tries to reuse some anon_vma by counting child anon_vma and attached vmas.
While for those mergeable anon_vmas, we can just reuse it and not
necessary to go through the logic.
After this change, kernel build test reduces 20% anon_vma allocation.
Do the same kernel build test, it shows run time in sys reduced 11.6%.
Origin:
real 2m50.467s
user 17m52.002s
sys 1m51.953s
real 2m48.662s
user 17m55.464s
sys 1m50.553s
real 2m51.143s
user 17m59.687s
sys 1m53.600s
Patched:
real 2m39.933s
user 17m1.835s
sys 1m38.802s
real 2m39.321s
user 17m1.634s
sys 1m39.206s
real 2m39.575s
user 17m1.420s
sys 1m38.845s
Link: http://lkml.kernel.org/r/20191011072256.16275-2-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before commit 7a3ef208e6 ("mm: prevent endless growth of anon_vma
hierarchy"), anon_vma_clone() doesn't change dst->anon_vma. While after
this commit, anon_vma_clone() will try to reuse an exist one on forking.
But this commit go a little bit further for the case not forking.
anon_vma_clone() is called from __vma_split(), __split_vma(), copy_vma()
and anon_vma_fork(). For the first three places, the purpose here is
get a copy of src and we don't expect to touch dst->anon_vma even it is
NULL.
While after that commit, it is possible to reuse an anon_vma when
dst->anon_vma is NULL. This is not we intend to have.
This patch stops reuse of anon_vma for non-fork cases.
Link: http://lkml.kernel.org/r/20191011072256.16275-1-richardw.yang@linux.intel.com
Fixes: 7a3ef208e6 ("mm: prevent endless growth of anon_vma hierarchy")
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now we use rb_parent to get next, while this is not necessary.
When prev is NULL, this means vma should be the first element in the list.
Then next should be current first one (mm->mmap), no matter whether we
have parent or not.
After removing it, the code shows the beauty of symmetry.
Link: http://lkml.kernel.org/r/20190813032656.16625-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just make the code a little easier to read.
Link: http://lkml.kernel.org/r/20191006012636.31521-3-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The third parameter of __vma_unlink_common() could differentiate these two
types. __vma_unlink_prev() is not necessary now.
Link: http://lkml.kernel.org/r/20191006012636.31521-2-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently __vma_unlink_common handles two cases:
* has_prev
* or not
When has_prev is false, it is obvious prev is calculated from
vma->vm_prev in __vma_unlink_common.
When has_prev is true, the prev is passed through from __vma_unlink_prev
in __vma_adjust for non-case 8. And at the beginning next is calculated
from vma->vm_next, which implies vma is next->vm_prev.
The above statement sounds a little complicated, while to think in
another point of view, no matter whether vma and next is swapped, the
mmap link list still preserves its property. It is proper to access
vma->vm_prev.
Link: http://lkml.kernel.org/r/20191006012636.31521-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a very slow operation. Right now POSIX_FADV_DONTNEED is the top
user because it has to freeze page references when removing it from the
cache. invalidate_bdev() calls it for the same reason. Both are
triggered from userspace, so it's easy to generate a storm.
mlock/mlockall no longer calls lru_add_drain_all - I've seen here
serious slowdown on older kernels.
There are some less obvious paths in memory migration/CMA/offlining
which shouldn't call frequently.
The worst case requires a non-trivial workload because
lru_add_drain_all() skips cpus where vectors are empty. Something must
constantly generate a flow of pages for each cpu. Also cpus must be
busy to make scheduling per-cpu works slower. And the machine must be
big enough (64+ cpus in our case).
In our case that was a massive series of mlock calls in map-reduce while
other tasks write logs (and generates flows of new pages in per-cpu
vectors). Mlock calls were serialized by mutex and accumulated latency
up to 10 seconds or more.
The kernel does not call lru_add_drain_all on mlock paths since 4.15,
but the same scenario could be triggered by fadvise(POSIX_FADV_DONTNEED)
or any other remaining user.
There is no reason to do the drain again if somebody else already
drained all the per-cpu vectors while we waited for the lock.
Piggyback on a drain starting and finishing while we wait for the lock:
all pages pending at the time of our entry were drained from the
vectors.
Callers like POSIX_FADV_DONTNEED retry their operations once after
draining per-cpu vectors when pages have unexpected references.
Link: http://lkml.kernel.org/r/157019456205.3142.3369423180908482020.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The upper level of "if" makes sure (end >= next->vm_end), which means
there are only two possibilities:
1) end == next->vm_end
2) end > next->vm_end
remove_next is assigned to be (1 + end > next->vm_end). This means if
remove_next is 1, end must equal to next->vm_end.
The VM_WARN_ON will never trigger.
Link: http://lkml.kernel.org/r/20190912063126.13250-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a process updates the RSS of a different process, the rss_stat
tracepoint appears in the context of the process doing the update. This
can confuse userspace that the RSS of process doing the update is
updated, while in reality a different process's RSS was updated.
This issue happens in reclaim paths such as with direct reclaim or
background reclaim.
This patch adds more information to the tracepoint about whether the mm
being updated belongs to the current process's context (curr field). We
also include a hash of the mm pointer so that the process who the mm
belongs to can be uniquely identified (mm_id field).
Also vsprintf.c is refactored a bit to allow reuse of hashing code.
[akpm@linux-foundation.org: remove unused local `str']
[joelaf@google.com: inline call to ptr_to_hashval]
Link: http://lore.kernel.org/r/20191113153816.14b95acd@gandalf.local.home
Link: http://lkml.kernel.org/r/20191114164622.GC233237@google.com
Link: http://lkml.kernel.org/r/20191106024452.81923-1-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reported-by: Ioannis Ilkos <ilkos@google.com>
Acked-by: Petr Mladek <pmladek@suse.com> [lib/vsprintf.c]
Cc: Tim Murray <timmurray@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Carmen Jackson <carmenjackson@google.com>
Cc: Mayank Gupta <mayankgupta@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Useful to track how RSS is changing per TGID to detect spikes in RSS and
memory hogs. Several Android teams have been using this patch in
various kernel trees for half a year now. Many reported to me it is
really useful so I'm posting it upstream.
Initial patch developed by Tim Murray. Changes I made from original
patch: o Prevent any additional space consumed by mm_struct.
Regarding the fact that the RSS may change too often thus flooding the
traces - note that, there is some "hysterisis" with this already. That
is - We update the counter only if we receive 64 page faults due to
SPLIT_RSS_ACCOUNTING. However, during zapping or copying of pte range,
the RSS is updated immediately which can become noisy/flooding. In a
previous discussion, we agreed that BPF or ftrace can be used to rate
limit the signal if this becomes an issue.
Also note that I added wrappers to trace_rss_stat to prevent compiler
errors where linux/mm.h is included from tracing code, causing errors
such as:
CC kernel/trace/power-traces.o
In file included from ./include/trace/define_trace.h:102,
from ./include/trace/events/kmem.h:342,
from ./include/linux/mm.h:31,
from ./include/linux/ring_buffer.h:5,
from ./include/linux/trace_events.h:6,
from ./include/trace/events/power.h:12,
from kernel/trace/power-traces.c:15:
./include/trace/trace_events.h:113:22: error: field `ent' has incomplete type
struct trace_entry ent; \
Link: http://lore.kernel.org/r/20190903200905.198642-1-joel@joelfernandes.org
Link: http://lkml.kernel.org/r/20191001172817.234886-1-joel@joelfernandes.org
Co-developed-by: Tim Murray <timmurray@google.com>
Signed-off-by: Tim Murray <timmurray@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Carmen Jackson <carmenjackson@google.com>
Cc: Mayank Gupta <mayankgupta@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One of our services is observing hanging ps/top/etc under heavy write
IO, and the task states show this is an mmap_sem priority inversion:
A write fault is holding the mmap_sem in read-mode and waiting for
(heavily cgroup-limited) IO in balance_dirty_pages():
balance_dirty_pages+0x724/0x905
balance_dirty_pages_ratelimited+0x254/0x390
fault_dirty_shared_page.isra.96+0x4a/0x90
do_wp_page+0x33e/0x400
__handle_mm_fault+0x6f0/0xfa0
handle_mm_fault+0xe4/0x200
__do_page_fault+0x22b/0x4a0
page_fault+0x45/0x50
Somebody tries to change the address space, contending for the mmap_sem in
write-mode:
call_rwsem_down_write_failed_killable+0x13/0x20
do_mprotect_pkey+0xa8/0x330
SyS_mprotect+0xf/0x20
do_syscall_64+0x5b/0x100
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
The waiting writer locks out all subsequent readers to avoid lock
starvation, and several threads can be seen hanging like this:
call_rwsem_down_read_failed+0x14/0x30
proc_pid_cmdline_read+0xa0/0x480
__vfs_read+0x23/0x140
vfs_read+0x87/0x130
SyS_read+0x42/0x90
do_syscall_64+0x5b/0x100
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
To fix this, do what we do for cache read faults already: drop the
mmap_sem before calling into anything IO bound, in this case the
balance_dirty_pages() function, and return VM_FAULT_RETRY.
Link: http://lkml.kernel.org/r/20190924194238.GA29030@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 1ba6fc9af3 ("mm: vmscan: do not share cgroup iteration
between reclaimers"), the memcg reclaim does not bail out earlier based
on sc->nr_reclaimed and will traverse all the nodes. All the
reclaimable pages of the memcg on all the nodes will be scanned relative
to the reclaim priority. So, there is no need to maintain state
regarding which node to start the memcg reclaim from.
This patch effectively reverts the commit 889976dbcb ("memcg: reclaim
memory from nodes in round-robin order") and commit 453a9bf347
("memcg: fix numa scan information update to be triggered by memory
event").
[shakeelb@google.com: v2]
Link: http://lkml.kernel.org/r/20191030204232.139424-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20191029234753.224143-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Setting a memory.high limit below the usage makes almost no effort to
shrink the cgroup to the new target size.
While memory.high is a "soft" limit that isn't supposed to cause OOM
situations, we should still try harder to meet a user request through
persistent reclaim.
For example, after setting a 10M memory.high on an 800M cgroup full of
file cache, the usage shrinks to about 350M:
+ cat /cgroup/workingset/memory.current
841568256
+ echo 10M
+ cat /cgroup/workingset/memory.current
355729408
This isn't exactly what the user would expect to happen. Setting the
value a few more times eventually whittles the usage down to what we
are asking for:
+ echo 10M
+ cat /cgroup/workingset/memory.current
104181760
+ echo 10M
+ cat /cgroup/workingset/memory.current
31801344
+ echo 10M
+ cat /cgroup/workingset/memory.current
10440704
To improve this, add reclaim retry loops to the memory.high write()
callback, similar to what we do for memory.max, to make a reasonable
effort that the usage meets the requested size after the call returns.
Afterwards, a single write() to memory.high is enough in all but extreme
cases:
+ cat /cgroup/workingset/memory.current
841609216
+ echo 10M
+ cat /cgroup/workingset/memory.current
10182656
790M is not a reasonable reclaim target to ask of a single reclaim
invocation. And it wouldn't be reasonable to optimize the reclaim code
for it. So asking for the full size but retrying is not a bad choice
here: we express our intent, and benefit if reclaim becomes better at
handling larger requests, but we also acknowledge that some of the
deltas we can encounter in memory_high_write() are just too ridiculously
big for a single reclaim invocation to manage.
Link: http://lkml.kernel.org/r/20191022201518.341216-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the reclaim loop in memory_max_write() is ^C'd or similar, we set err
to -EINTR. But we don't return err. Once the limit is set, we always
return success (nbytes). Delete the dead code.
Link: http://lkml.kernel.org/r/20191022201518.341216-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The mem_cgroup_reclaim_cookie is only used in memcg softlimit reclaim now,
and the priority of the reclaim is always 0. We don't need to define the
iter in struct mem_cgroup_per_node as an array any more. That could make
the code more clear and save some space.
Link: http://lkml.kernel.org/r/1569897728-1686-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A zoned block device consists of a number of zones. Zones are either
conventional and accepting random writes or sequential and requiring
that writes be issued in LBA order from each zone write pointer
position. For the write restriction, zoned block devices are not
suitable for a swap device. Disallow swapon on them.
[akpm@linux-foundation.org: reflow and reword comment, per Christoph]
Link: http://lkml.kernel.org/r/20191015085814.637837-1-naohiro.aota@wdc.com
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix comments of __get_user_pages() and get_user_pages_remote(), make
them more clear.
Link: http://lkml.kernel.org/r/1572443533-3118-1-git-send-email-liuxiang_1999@126.com
Signed-off-by: Liu Xiang <liuxiang_1999@126.com>
Suggested-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
check_and_migrate_cma_pages() was recording the result of
__get_user_pages_locked() in an unsigned "nr_pages" variable.
Because __get_user_pages_locked() returns a signed value that can
include negative errno values, this had the effect of hiding errors.
Change check_and_migrate_cma_pages() implementation so that it uses a
signed variable instead, and propagates the results back to the caller
just as other gup internal functions do.
This was discovered with the help of unsigned_lesser_than_zero.cocci.
Link: http://lkml.kernel.org/r/1571671030-58029-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Suggested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
generic_file_direct_write() tries to invalidate pagecache after O_DIRECT
write. Unlike to similar code in dio_complete() this silently ignores
error returned from invalidate_inode_pages2_range().
According to comment this code here because not all filesystems call
dio_complete() to do proper invalidation after O_DIRECT write. Noticeable
example is a blkdev_direct_IO().
This patch calls dio_warn_stale_pagecache() if invalidation fails.
Link: http://lkml.kernel.org/r/157270038294.4812.2238891109785106069.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This helper prints warning if direct I/O write failed to invalidate cache,
and set EIO at inode to warn usersapce about possible data corruption.
See also commit 5a9d929d6e ("iomap: report collisions between directio
and buffered writes to userspace").
Direct I/O is supported by non-disk filesystems, for example NFS. Thus
generic code needs this even in kernel without CONFIG_BLOCK.
Link: http://lkml.kernel.org/r/157270038074.4812.7980855544557488880.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
generic_file_direct_write() invalidates cache at entry. Second time this
should be done when request completes. But this function calls second
invalidation at exit unconditionally even for async requests.
This patch skips second invalidation for async requests (-EIOCBQUEUED).
Link: http://lkml.kernel.org/r/157270037850.4812.15036239021726025572.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The function doesn't need to return any value, and the check can be done
in one pass.
There is a behavior change: before the patch, we stop at the first invalid
free object; after the patch, we stop at the first invalid object, free or
in use. This shouldn't matter because the original behavior isn't
intended anyway.
Link: http://lkml.kernel.org/r/20191108193958.205102-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The type of local variable *type* of new_kmalloc_cache() should be enum
kmalloc_cache_type instead of int, so correct it.
Link: http://lkml.kernel.org/r/1569241648-26908-4-git-send-email-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The size of kmalloc can be obtained from kmalloc_info[], so remove
kmalloc_size() that will not be used anymore.
Link: http://lkml.kernel.org/r/1569241648-26908-3-git-send-email-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm, slab: Make kmalloc_info[] contain all types of names", v6.
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM
and KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name,
but the names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically
generated by kmalloc_cache_name().
Patch1 predefines the names of all types of kmalloc to save
the time spent dynamically generating names.
These changes make sense, and the time spent by new_kmalloc_cache()
has been reduced by approximately 36.3%.
Time spent by new_kmalloc_cache()
(CPU cycles)
5.3-rc7 66264
5.3-rc7+patch 42188
This patch (of 3):
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM and
KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name, but the
names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically generated by
kmalloc_cache_name().
This patch predefines the names of all types of kmalloc to save the time
spent dynamically generating names.
Besides, remove the kmalloc_cache_name() that is no longer used.
Link: http://lkml.kernel.org/r/1569241648-26908-2-git-send-email-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is another round of bug fixing and cleanup. This time the focus is on
the driver pattern to use mmu notifiers to monitor a VA range. This code
is lifted out of many drivers and hmm_mirror directly into the
mmu_notifier core and written using the best ideas from all the driver
implementations.
This removes many bugs from the drivers and has a very pleasing
diffstat. More drivers can still be converted, but that is for another
cycle.
- A shared branch with RDMA reworking the RDMA ODP implementation
- New mmu_interval_notifier API. This is focused on the use case of
monitoring a VA and simplifies the process for drivers
- A common seq-count locking scheme built into the mmu_interval_notifier
API usable by drivers that call get_user_pages() or hmm_range_fault()
with the VA range
- Conversion of mlx5 ODP, hfi1, radeon, nouveau, AMD GPU, and Xen GntDev
drivers to the new API. This deletes a lot of wonky driver code.
- Two improvements for hmm_range_fault(), from testing done by Ralph
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl3cCjQACgkQOG33FX4g
mxpp8xAAiR9iOdT28m/tx1GF31XludrMhRZVIiz0vmCIxIiAkWekWEfAEVm9PDnh
wdrxTJohSs+B65AK3sfToOM3AIuNCuFVWmbbHI5qmOO76vaSvcZa905Z++pNsawO
Bn8mgRCprYoFHcxWLvTvnA5U0g1S2BSSOwBSZI43CbEnVvHjYAR6MnvRqfGMk+NF
bf8fTk/x+fl0DCemhynlBLuJkogzoE2Hgl0yPY5bFna4PktOxdpa1yPaQsiqZ7e6
2s2NtM3pbMBJk0W42q5BU+aPhiqfxFFszasPSLBduXrD2xDsG76HJdHj5VydKmfL
nelG4BvqJozXTEZWvTEePYhCqaZ41eJZ7Asw8BXtmacVqE5mDlTXo/Zdgbz7yEOR
mI5MVyjD5rauZJldUOWXbwrPoWVFRvboauehiSgqvxvT9HvlFp9GKObSuu4gubBQ
mzxs4t48tPhA7bswLmw0/pETSogFuVDfaB7hsyY0gi8EwxMFMpw2qFypm1PEEF+C
BuUxCSShzvNKrraNe5PWaNNFd3AzIwAOWJHE+poH4bCoXQVr5nA+rq2gnHkdY5vq
/xrBCyxkf0U05YoFGYembPVCInMehzp9Xjy8V+SueSvCg2/TYwGDCgGfsbe9dNOP
Bc40JpS7BDn5w9nyLUJmOx7jfruNV6kx1QslA7NDDrB/rzOlsEc=
=Hj8a
-----END PGP SIGNATURE-----
Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull hmm updates from Jason Gunthorpe:
"This is another round of bug fixing and cleanup. This time the focus
is on the driver pattern to use mmu notifiers to monitor a VA range.
This code is lifted out of many drivers and hmm_mirror directly into
the mmu_notifier core and written using the best ideas from all the
driver implementations.
This removes many bugs from the drivers and has a very pleasing
diffstat. More drivers can still be converted, but that is for another
cycle.
- A shared branch with RDMA reworking the RDMA ODP implementation
- New mmu_interval_notifier API. This is focused on the use case of
monitoring a VA and simplifies the process for drivers
- A common seq-count locking scheme built into the
mmu_interval_notifier API usable by drivers that call
get_user_pages() or hmm_range_fault() with the VA range
- Conversion of mlx5 ODP, hfi1, radeon, nouveau, AMD GPU, and Xen
GntDev drivers to the new API. This deletes a lot of wonky driver
code.
- Two improvements for hmm_range_fault(), from testing done by Ralph"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
mm/hmm: remove hmm_range_dma_map and hmm_range_dma_unmap
mm/hmm: make full use of walk_page_range()
xen/gntdev: use mmu_interval_notifier_insert
mm/hmm: remove hmm_mirror and related
drm/amdgpu: Use mmu_interval_notifier instead of hmm_mirror
drm/amdgpu: Use mmu_interval_insert instead of hmm_mirror
drm/amdgpu: Call find_vma under mmap_sem
nouveau: use mmu_interval_notifier instead of hmm_mirror
nouveau: use mmu_notifier directly for invalidate_range_start
drm/radeon: use mmu_interval_notifier_insert
RDMA/hfi1: Use mmu_interval_notifier_insert for user_exp_rcv
RDMA/odp: Use mmu_interval_notifier_insert()
mm/hmm: define the pre-processor related parts of hmm.h even if disabled
mm/hmm: allow hmm_range to be used with a mmu_interval_notifier or hmm_mirror
mm/mmu_notifier: add an interval tree notifier
mm/mmu_notifier: define the header pre-processor parts even if disabled
mm/hmm: allow snapshot of the special zero page
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJd4HA/AAoJEAx081l5xIa+wKUP/RrFElDUUIsYZ0M8i5FGSVdZ
dD8zL3z0NHxWu+UwieAWrIKU69k+wYwsnFFwHsAxOXflmcA1UeBTqcMLgb5tECcq
s2bNqd97XAjbDzreUIeHL3vrSNIeMg8m4Rpo7mdtC3DphoBEodYeRsAez7Ml42Ey
J0oBdWgcBt+sC6iq6f6dEBZmNjYtx/m5L+FX35jRCHAfthjD7AbccK5eK3++Vtni
Af2SZKMQr1uwjP6+ottx89LdpXCrQc2vNHm3nCnDxtLV8PgzXzXiSno0aqc3zk81
J2D/7pb0q561gOBmvW+XfkzMQm/PXtLvhP25iD2dKRDhDjeRgmWLS84J1JePoFQ2
1NALHyuYIRUKjVUEXlq9bx4pMci40/ifM2EaRW+TjhsdmmjA4bbni5O6UW8lhrvb
Ji/bL/LBpmrX0DxKis1xh0iJBdW/aEltVcLqHNPXs0SvkeiMNW0frJss5KP9Yp9G
n8BRS4HahAbNoKUsa9UPcwOAjrY6KwqnKV+PShTly70Po2JhpWISejcTMicIdmgZ
xiXWzq5SB7ZmXzmmEOL5qZzY6iMj0QZYfvVziehwCLiwNVyc2Rpt0tZwLceW/Zfn
Z3sFRUH19pIpQogErtiYiBhw9v481qCUy2ooeEj5YTwiLwGAyJ1BvGwSlD91mdbd
0RiixbdVNBrmaS7WfHCo
=G9TA
-----END PGP SIGNATURE-----
Merge tag 'drm-vmwgfx-coherent-2019-11-29' of git://anongit.freedesktop.org/drm/drm
Pull drm coherent memory support for vmwgfx from Dave Airlie:
"This is a separate pull for the mm pagewalking + drm/vmwgfx work
Thomas did and you were involved in, I've left it separate in case you
don't feel as comfortable with it as the other stuff.
It has mm acks/r-b in the right places from what I can see"
* tag 'drm-vmwgfx-coherent-2019-11-29' of git://anongit.freedesktop.org/drm/drm:
drm/vmwgfx: Add surface dirty-tracking callbacks
drm/vmwgfx: Implement an infrastructure for read-coherent resources
drm/vmwgfx: Use an RBtree instead of linked list for MOB resources
drm/vmwgfx: Implement an infrastructure for write-coherent resources
mm: Add write-protect and clean utilities for address space ranges
mm: Add a walk_page_mapping() function to the pagewalk code
mm: pagewalk: Take the pagetable lock in walk_pte_range()
mm: Remove BUG_ON mmap_sem not held from xxx_trans_huge_lock()
drm/ttm: Convert vm callbacks to helpers
drm/ttm: Remove explicit typecasts of vm_private_data
On PEF-enabled POWER platforms that support running of secure guests,
secure pages of the guest are represented by device private pages
in the host. Such pages needn't participate in KSM merging. This is
achieved by using ksm_madvise() call which need to be exported
since KVM PPC can be a kernel module.
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Graphics APIs like OpenGL 4.4 and Vulkan require the graphics driver
to provide coherent graphics memory, meaning that the GPU sees any
content written to the coherent memory on the next GPU operation that
touches that memory, and the CPU sees any content written by the GPU
to that memory immediately after any fence object trailing the GPU
operation is signaled.
Paravirtual drivers that otherwise require explicit synchronization
needs to do this by hooking up dirty tracking to pagefault handlers
and buffer object validation.
Provide mm helpers needed for this and that also allow for huge pmd-
and pud entries (patch 1-3), and the associated vmwgfx code (patch 4-7).
The code has been tested and exercised by a tailored version of mesa
where we disable all explicit synchronization and assume graphics memory
is coherent. The performance loss varies of course; a typical number is
around 5%.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Hellstrom <thomas_os@shipmail.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20191113131639.4653-1-thomas_os@shipmail.org
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- A comprehensive rewrite of the robust/PI futex code's exit handling
to fix various exit races. (Thomas Gleixner et al)
- Rework the generic REFCOUNT_FULL implementation using
atomic_fetch_* operations so that the performance impact of the
cmpxchg() loops is mitigated for common refcount operations.
With these performance improvements the generic implementation of
refcount_t should be good enough for everybody - and this got
confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
REFCOUNT_FULL entirely, leaving the generic implementation enabled
unconditionally. (Will Deacon)
- Other misc changes, fixes, cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
lkdtm: Remove references to CONFIG_REFCOUNT_FULL
locking/refcount: Remove unused 'refcount_error_report()' function
locking/refcount: Consolidate implementations of refcount_t
locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
locking/refcount: Move saturation warnings out of line
locking/refcount: Improve performance of generic REFCOUNT_FULL code
locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header
locking/refcount: Remove unused refcount_*_checked() variants
locking/refcount: Ensure integer operands are treated as signed
locking/refcount: Define constants for saturation and max refcount values
futex: Prevent exit livelock
futex: Provide distinct return value when owner is exiting
futex: Add mutex around futex exit
futex: Provide state handling for exec() as well
futex: Sanitize exit state handling
futex: Mark the begin of futex exit explicitly
futex: Set task::futex_state to DEAD right after handling futex exit
futex: Split futex_mm_release() for exit/exec
exit/exec: Seperate mm_release()
futex: Replace PF_EXITPIDONE with a state
...
Pull networking updates from David Miller:
"Another merge window, another pull full of stuff:
1) Support alternative names for network devices, from Jiri Pirko.
2) Introduce per-netns netdev notifiers, also from Jiri Pirko.
3) Support MSG_PEEK in vsock/virtio, from Matias Ezequiel Vara
Larsen.
4) Allow compiling out the TLS TOE code, from Jakub Kicinski.
5) Add several new tracepoints to the kTLS code, also from Jakub.
6) Support set channels ethtool callback in ena driver, from Sameeh
Jubran.
7) New SCTP events SCTP_ADDR_ADDED, SCTP_ADDR_REMOVED,
SCTP_ADDR_MADE_PRIM, and SCTP_SEND_FAILED_EVENT. From Xin Long.
8) Add XDP support to mvneta driver, from Lorenzo Bianconi.
9) Lots of netfilter hw offload fixes, cleanups and enhancements,
from Pablo Neira Ayuso.
10) PTP support for aquantia chips, from Egor Pomozov.
11) Add UDP segmentation offload support to igb, ixgbe, and i40e. From
Josh Hunt.
12) Add smart nagle to tipc, from Jon Maloy.
13) Support L2 field rewrite by TC offloads in bnxt_en, from Venkat
Duvvuru.
14) Add a flow mask cache to OVS, from Tonghao Zhang.
15) Add XDP support to ice driver, from Maciej Fijalkowski.
16) Add AF_XDP support to ice driver, from Krzysztof Kazimierczak.
17) Support UDP GSO offload in atlantic driver, from Igor Russkikh.
18) Support it in stmmac driver too, from Jose Abreu.
19) Support TIPC encryption and auth, from Tuong Lien.
20) Introduce BPF trampolines, from Alexei Starovoitov.
21) Make page_pool API more numa friendly, from Saeed Mahameed.
22) Introduce route hints to ipv4 and ipv6, from Paolo Abeni.
23) Add UDP segmentation offload to cxgb4, Rahul Lakkireddy"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1857 commits)
libbpf: Fix usage of u32 in userspace code
mm: Implement no-MMU variant of vmalloc_user_node_flags
slip: Fix use-after-free Read in slip_open
net: dsa: sja1105: fix sja1105_parse_rgmii_delays()
macvlan: schedule bc_work even if error
enetc: add support Credit Based Shaper(CBS) for hardware offload
net: phy: add helpers phy_(un)lock_mdio_bus
mdio_bus: don't use managed reset-controller
ax88179_178a: add ethtool_op_get_ts_info()
mlxsw: spectrum_router: Fix use of uninitialized adjacency index
mlxsw: spectrum_router: After underlay moves, demote conflicting tunnels
bpf: Simplify __bpf_arch_text_poke poke type handling
bpf: Introduce BPF_TRACE_x helper for the tracing tests
bpf: Add bpf_jit_blinding_enabled for !CONFIG_BPF_JIT
bpf, testing: Add various tail call test cases
bpf, x86: Emit patchable direct jump as tail call
bpf: Constant map key tracking for prog array pokes
bpf: Add poke dependency tracking for prog array maps
bpf: Add initial poke descriptor table for jit images
bpf: Move owner type, jited info into array auxiliary data
...
- On ARMv8 CPUs without hardware updates of the access flag, avoid
failing cow_user_page() on PFN mappings if the pte is old. The patches
introduce an arch_faults_on_old_pte() macro, defined as false on x86.
When true, cow_user_page() makes the pte young before attempting
__copy_from_user_inatomic().
- Covert the synchronous exception handling paths in
arch/arm64/kernel/entry.S to C.
- FTRACE_WITH_REGS support for arm64.
- ZONE_DMA re-introduced on arm64 to support Raspberry Pi 4
- Several kselftest cases specific to arm64, together with a MAINTAINERS
update for these files (moved to the ARM64 PORT entry).
- Workaround for a Neoverse-N1 erratum where the CPU may fetch stale
instructions under certain conditions.
- Workaround for Cortex-A57 and A72 errata where the CPU may
speculatively execute an AT instruction and associate a VMID with the
wrong guest page tables (corrupting the TLB).
- Perf updates for arm64: additional PMU topologies on HiSilicon
platforms, support for CCN-512 interconnect, AXI ID filtering in the
IMX8 DDR PMU, support for the CCPI2 uncore PMU in ThunderX2.
- GICv3 optimisation to avoid a heavy barrier when accessing the
ICC_PMR_EL1 register.
- ELF HWCAP documentation updates and clean-up.
- SMC calling convention conduit code clean-up.
- KASLR diagnostics printed during boot
- NVIDIA Carmel CPU added to the KPTI whitelist
- Some arm64 mm clean-ups: use generic free_initrd_mem(), remove stale
macro, simplify calculation in __create_pgd_mapping(), typos.
- Kconfig clean-ups: CMDLINE_FORCE to depend on CMDLINE, choice for
endinanness to help with allmodconfig.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAl3YJswACgkQa9axLQDI
XvFwYg//aTGhNLew3ADgW2TYal7LyqetRROixPBrzqHLu2A8No1+QxHMaKxpZVyf
pt25tABuLtPHql3qBzE0ltmfbLVsPj/3hULo404EJb9HLRfUnVGn7gcPkc+p4YAr
IYkYPXJbk6OlJ84vI+4vXmDEF12bWCqamC9qZ+h99qTpMjFXFO17DSJ7xQ8Xic3A
HHgCh4uA7gpTVOhLxaS6KIw+AZNYwvQxLXch2+wj6agbGX79uw9BeMhqVXdkPq8B
RTDJpOdS970WOT4cHWOkmXwsqqGRqgsgyu+bRUJ0U72+0y6MX0qSHIUnVYGmNc5q
Dtox4rryYLvkv/hbpkvjgVhv98q3J1mXt/CalChWB5dG4YwhJKN2jMiYuoAvB3WS
6dR7Dfupgai9gq1uoKgBayS2O6iFLSa4g58vt3EqUBqmM7W7viGFPdLbuVio4ycn
CNF2xZ8MZR6Wrh1JfggO7Hc11EJdSqESYfHO6V/pYB4pdpnqJLDoriYHXU7RsZrc
HvnrIvQWKMwNbqBvpNbWvK5mpBMMX2pEienA3wOqKNH7MbepVsG+npOZTVTtl9tN
FL0ePb/mKJu/2+gW8ntiqYn7EzjKprRmknOiT2FjWWo0PxgJ8lumefuhGZZbaOWt
/aTAeD7qKd/UXLKGHF/9v3q4GEYUdCFOXP94szWVPyLv+D9h8L8=
=TPL9
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"Apart from the arm64-specific bits (core arch and perf, new arm64
selftests), it touches the generic cow_user_page() (reviewed by
Kirill) together with a macro for x86 to preserve the existing
behaviour on this architecture.
Summary:
- On ARMv8 CPUs without hardware updates of the access flag, avoid
failing cow_user_page() on PFN mappings if the pte is old. The
patches introduce an arch_faults_on_old_pte() macro, defined as
false on x86. When true, cow_user_page() makes the pte young before
attempting __copy_from_user_inatomic().
- Covert the synchronous exception handling paths in
arch/arm64/kernel/entry.S to C.
- FTRACE_WITH_REGS support for arm64.
- ZONE_DMA re-introduced on arm64 to support Raspberry Pi 4
- Several kselftest cases specific to arm64, together with a
MAINTAINERS update for these files (moved to the ARM64 PORT entry).
- Workaround for a Neoverse-N1 erratum where the CPU may fetch stale
instructions under certain conditions.
- Workaround for Cortex-A57 and A72 errata where the CPU may
speculatively execute an AT instruction and associate a VMID with
the wrong guest page tables (corrupting the TLB).
- Perf updates for arm64: additional PMU topologies on HiSilicon
platforms, support for CCN-512 interconnect, AXI ID filtering in
the IMX8 DDR PMU, support for the CCPI2 uncore PMU in ThunderX2.
- GICv3 optimisation to avoid a heavy barrier when accessing the
ICC_PMR_EL1 register.
- ELF HWCAP documentation updates and clean-up.
- SMC calling convention conduit code clean-up.
- KASLR diagnostics printed during boot
- NVIDIA Carmel CPU added to the KPTI whitelist
- Some arm64 mm clean-ups: use generic free_initrd_mem(), remove
stale macro, simplify calculation in __create_pgd_mapping(), typos.
- Kconfig clean-ups: CMDLINE_FORCE to depend on CMDLINE, choice for
endinanness to help with allmodconfig"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits)
arm64: Kconfig: add a choice for endianness
kselftest: arm64: fix spelling mistake "contiguos" -> "contiguous"
arm64: Kconfig: make CMDLINE_FORCE depend on CMDLINE
MAINTAINERS: Add arm64 selftests to the ARM64 PORT entry
arm64: kaslr: Check command line before looking for a seed
arm64: kaslr: Announce KASLR status on boot
kselftest: arm64: fake_sigreturn_misaligned_sp
kselftest: arm64: fake_sigreturn_bad_size
kselftest: arm64: fake_sigreturn_duplicated_fpsimd
kselftest: arm64: fake_sigreturn_missing_fpsimd
kselftest: arm64: fake_sigreturn_bad_size_for_magic0
kselftest: arm64: fake_sigreturn_bad_magic
kselftest: arm64: add helper get_current_context
kselftest: arm64: extend test_init functionalities
kselftest: arm64: mangle_pstate_invalid_mode_el[123][ht]
kselftest: arm64: mangle_pstate_invalid_daif_bits
kselftest: arm64: mangle_pstate_invalid_compat_toggle and common utils
kselftest: arm64: extend toplevel skeleton Makefile
drivers/perf: hisi: update the sccl_id/ccl_id for certain HiSilicon platform
arm64: mm: reserve CMA and crashkernel in ZONE_DMA32
...
To fix build with !CONFIG_MMU, implement it for no-MMU configurations as well.
Fixes: fc9702273e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20191123220835.1237773-1-andriin@fb.com
These two functions have never been used since they were added.
Link: https://lore.kernel.org/r/20191113134528.21187-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
hmm_range_fault() calls find_vma() and walk_page_range() in a loop. This
is unnecessary duplication since walk_page_range() calls find_vma() in a
loop already.
Simplify hmm_range_fault() by defining a walk_test() callback function to
filter unhandled vmas.
This also fixes a bug where hmm_range_fault() was not checking start >=
vma->vm_start before checking vma->vm_flags so hmm_range_fault() could
return an error based on the wrong vma for the requested range.
It also fixes a bug when the vma has no read access and the caller did not
request a fault, there shouldn't be any error return code.
Link: https://lore.kernel.org/r/20191104222141.5173-2-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The only two users of this are now converted to use mmu_interval_notifier,
delete all the code and update hmm.rst.
Link: https://lore.kernel.org/r/20191112202231.3856-14-jgg@ziepe.ca
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
hmm_mirror's handling of ranges does not use a sequence count which
results in this bug:
CPU0 CPU1
hmm_range_wait_until_valid(range)
valid == true
hmm_range_fault(range)
hmm_invalidate_range_start()
range->valid = false
hmm_invalidate_range_end()
range->valid = true
hmm_range_valid(range)
valid == true
Where the hmm_range_valid() should not have succeeded.
Adding the required sequence count would make it nearly identical to the
new mmu_interval_notifier. Instead replace the hmm_mirror stuff with
mmu_interval_notifier.
Co-existence of the two APIs is the first step.
Link: https://lore.kernel.org/r/20191112202231.3856-4-jgg@ziepe.ca
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Tested-by: Philip Yang <Philip.Yang@amd.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Of the 13 users of mmu_notifiers, 8 of them use only
invalidate_range_start/end() and immediately intersect the
mmu_notifier_range with some kind of internal list of VAs. 4 use an
interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
of some kind (scif_dma, vhost, gntdev, hmm)
And the remaining 5 either don't use invalidate_range_start() or do some
special thing with it.
It turns out that building a correct scheme with an interval tree is
pretty complicated, particularly if the use case is synchronizing against
another thread doing get_user_pages(). Many of these implementations have
various subtle and difficult to fix races.
This approach puts the interval tree as common code at the top of the mmu
notifier call tree and implements a shareable locking scheme.
It includes:
- An interval tree tracking VA ranges, with per-range callbacks
- A read/write locking scheme for the interval tree that avoids
sleeping in the notifier path (for OOM killer)
- A sequence counter based collision-retry locking scheme to tell
device page fault that a VA range is being concurrently invalidated.
This is based on various ideas:
- hmm accumulates invalidated VA ranges and releases them when all
invalidates are done, via active_invalidate_ranges count.
This approach avoids having to intersect the interval tree twice (as
umem_odp does) at the potential cost of a longer device page fault.
- kvm/umem_odp use a sequence counter to drive the collision retry,
via invalidate_seq
- a deferred work todo list on unlock scheme like RTNL, via deferred_list.
This makes adding/removing interval tree members more deterministic
- seqlock, except this version makes the seqlock idea multi-holder on the
write side by protecting it with active_invalidate_ranges and a spinlock
To minimize MM overhead when only the interval tree is being used, the
entire SRCU and hlist overheads are dropped using some simple
branches. Similarly the interval tree overhead is dropped when in hlist
mode.
The overhead from the mandatory spinlock is broadly the same as most of
existing users which already had a lock (or two) of some sort on the
invalidation path.
Link: https://lore.kernel.org/r/20191112202231.3856-3-jgg@ziepe.ca
Acked-by: Christian König <christian.koenig@amd.com>
Tested-by: Philip Yang <Philip.Yang@amd.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Minor conflict in drivers/s390/net/qeth_l2_main.c, kept the lock
from commit c8183f5489 ("s390/qeth: fix potential deadlock on
workqueue flush"), removed the code which was removed by commit
9897d583b0 ("s390/qeth: consolidate some duplicated HW cmd code").
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
It's possible to hit the WARN_ON_ONCE(page_mapped(page)) in
remove_stable_node() when it races with __mmput() and squeezes in
between ksm_exit() and exit_mmap().
WARNING: CPU: 0 PID: 3295 at mm/ksm.c:888 remove_stable_node+0x10c/0x150
Call Trace:
remove_all_stable_nodes+0x12b/0x330
run_store+0x4ef/0x7b0
kernfs_fop_write+0x200/0x420
vfs_write+0x154/0x450
ksys_write+0xf9/0x1d0
do_syscall_64+0x99/0x510
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Remove the warning as there is nothing scary going on.
Link: http://lkml.kernel.org/r/20191119131850.5675-1-aryabinin@virtuozzo.com
Fixes: cbf86cfe04 ("ksm: remove old stable nodes more thoroughly")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's limit shrinking to !ZONE_DEVICE so we can fix the current code.
We should never try to touch the memmap of offline sections where we
could have uninitialized memmaps and could trigger BUGs when calling
page_to_nid() on poisoned pages.
There is no reliable way to distinguish an uninitialized memmap from an
initialized memmap that belongs to ZONE_DEVICE, as we don't have
anything like SECTION_IS_ONLINE we can use similar to
pfn_to_online_section() for !ZONE_DEVICE memory.
E.g., set_zone_contiguous() similarly relies on pfn_to_online_section()
and will therefore never set a ZONE_DEVICE zone consecutive. Stopping
to shrink the ZONE_DEVICE therefore results in no observable changes,
besides /proc/zoneinfo indicating different boundaries - something we
can totally live with.
Before commit d0dc12e86b ("mm/memory_hotplug: optimize memory
hotplug"), the memmap was initialized with 0 and the node with the right
value. So the zone might be wrong but not garbage. After that commit,
both the zone and the node will be garbage when touching uninitialized
memmaps.
Toshiki reported a BUG (race between delayed initialization of
ZONE_DEVICE memmaps without holding the memory hotplug lock and
concurrent zone shrinking).
https://lkml.org/lkml/2019/11/14/1040
"Iteration of create and destroy namespace causes the panic as below:
kernel BUG at mm/page_alloc.c:535!
CPU: 7 PID: 2766 Comm: ndctl Not tainted 5.4.0-rc4 #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:set_pfnblock_flags_mask+0x95/0xf0
Call Trace:
memmap_init_zone_device+0x165/0x17c
memremap_pages+0x4c1/0x540
devm_memremap_pages+0x1d/0x60
pmem_attach_disk+0x16b/0x600 [nd_pmem]
nvdimm_bus_probe+0x69/0x1c0
really_probe+0x1c2/0x3e0
driver_probe_device+0xb4/0x100
device_driver_attach+0x4f/0x60
bind_store+0xc9/0x110
kernfs_fop_write+0x116/0x190
vfs_write+0xa5/0x1a0
ksys_write+0x59/0xd0
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9
While creating a namespace and initializing memmap, if you destroy the
namespace and shrink the zone, it will initialize the memmap outside
the zone and trigger VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page),
pfn), page) in set_pfnblock_flags_mask()."
This BUG is also mitigated by this commit, where we for now stop to
shrink the ZONE_DEVICE zone until we can do it in a safe and clean way.
Link: http://lkml.kernel.org/r/20191006085646.5768-5-david@redhat.com
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b]
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reported-by: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Damian Tometzki <damian.tometzki@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jun Yao <yaojun8558363@gmail.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Rich Felker <dalias@libc.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Yu Zhao <yuzhao@google.com>
Cc: <stable@vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-11-20
The following pull-request contains BPF updates for your *net-next* tree.
We've added 81 non-merge commits during the last 17 day(s) which contain
a total of 120 files changed, 4958 insertions(+), 1081 deletions(-).
There are 3 trivial conflicts, resolve it by always taking the chunk from
196e8ca748:
<<<<<<< HEAD
=======
void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
>>>>>>> 196e8ca748
<<<<<<< HEAD
void *bpf_map_area_alloc(u64 size, int numa_node)
=======
static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
>>>>>>> 196e8ca748
<<<<<<< HEAD
if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
=======
/* kmalloc()'ed memory can't be mmap()'ed */
if (!mmapable && size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>>>>>>> 196e8ca748
The main changes are:
1) Addition of BPF trampoline which works as a bridge between kernel functions,
BPF programs and other BPF programs along with two new use cases: i) fentry/fexit
BPF programs for tracing with practically zero overhead to call into BPF (as
opposed to k[ret]probes) and ii) attachment of the former to networking related
programs to see input/output of networking programs (covering xdpdump use case),
from Alexei Starovoitov.
2) BPF array map mmap support and use in libbpf for global data maps; also a big
batch of libbpf improvements, among others, support for reading bitfields in a
relocatable manner (via libbpf's CO-RE helper API), from Andrii Nakryiko.
3) Extend s390x JIT with usage of relative long jumps and loads in order to lift
the current 64/512k size limits on JITed BPF programs there, from Ilya Leoshkevich.
4) Add BPF audit support and emit messages upon successful prog load and unload in
order to have a timeline of events, from Daniel Borkmann and Jiri Olsa.
5) Extension to libbpf and xdpsock sample programs to demo the shared umem mode
(XDP_SHARED_UMEM) as well as RX-only and TX-only sockets, from Magnus Karlsson.
6) Several follow-up bug fixes for libbpf's auto-pinning code and a new API
call named bpf_get_link_xdp_info() for retrieving the full set of prog
IDs attached to XDP, from Toke Høiland-Jørgensen.
7) Add BTF support for array of int, array of struct and multidimensional arrays
and enable it for skb->cb[] access in kfree_skb test, from Martin KaFai Lau.
8) Fix AF_XDP by using the correct number of channels from ethtool, from Luigi Rizzo.
9) Two fixes for BPF selftest to get rid of a hang in test_tc_tunnel and to avoid
xdping to be run as standalone, from Jiri Benc.
10) Various BPF selftest fixes when run with latest LLVM trunk, from Yonghong Song.
11) Fix a memory leak in BPF fentry test run data, from Colin Ian King.
12) Various smaller misc cleanups and improvements mostly all over BPF selftests and
samples, from Daniel T. Lee, Andre Guedes, Anders Roxell, Mao Wenan, Yue Haibing.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ability to memory-map contents of BPF array map. This is extremely useful
for working with BPF global data from userspace programs. It allows to avoid
typical bpf_map_{lookup,update}_elem operations, improving both performance
and usability.
There had to be special considerations for map freezing, to avoid having
writable memory view into a frozen map. To solve this issue, map freezing and
mmap-ing is happening under mutex now:
- if map is already frozen, no writable mapping is allowed;
- if map has writable memory mappings active (accounted in map->writecnt),
map freezing will keep failing with -EBUSY;
- once number of writable memory mappings drops to zero, map freezing can be
performed again.
Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
can't be memory mapped either.
For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
to be mmap()'able. We also need to make sure that array data memory is
page-sized and page-aligned, so we over-allocate memory in such a way that
struct bpf_array is at the end of a single page of memory with array->value
being aligned with the start of the second page. On deallocation we need to
accomodate this memory arrangement to free vmalloc()'ed memory correctly.
One important consideration regarding how memory-mapping subsystem functions.
Memory-mapping subsystem provides few optional callbacks, among them open()
and close(). close() is called for each memory region that is unmapped, so
that users can decrease their reference counters and free up resources, if
necessary. open() is *almost* symmetrical: it's called for each memory region
that is being mapped, **except** the very first one. So bpf_map_mmap does
initial refcnt bump, while open() will do any extra ones after that. Thus
number of close() calls is equal to number of open() calls plus one more.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com
This blacklists several compilation units from KCSAN. See the respective
inline comments for the reasoning.
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
PageAnon() and PageKsm() use the low two bits of the page->mapping
pointer to indicate the page type. PageAnon() only checks the LSB while
PageKsm() checks the least significant 2 bits are equal to 3.
Therefore, PageAnon() is true for KSM pages. __dump_page() incorrectly
will never print "ksm" because it checks PageAnon() first. Fix this by
checking PageKsm() first.
Link: http://lkml.kernel.org/r/20191113000651.20677-1-rcampbell@nvidia.com
Fixes: 1c6fb1d89e ("mm: print more information about mapping in __dump_page")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When dumping struct page information, __dump_page() prints the page type
with a trailing blank followed by the page flags on a separate line:
anon
flags: 0x100000000090034(uptodate|lru|active|head|swapbacked)
It looks like the intent was to use pr_cont() for printing "flags:" but
pr_cont() usage is discouraged so fix this by extending the format to
include the flags into a single line:
anon flags: 0x100000000090034(uptodate|lru|active|head|swapbacked)
If the page is file backed, the name might be long so use two lines:
shmem_aops name:"dev/zero"
flags: 0x10000000008000c(uptodate|dirty|swapbacked)
Eliminate pr_conf() usage as well for appending compound_mapcount.
Link: http://lkml.kernel.org/r/20191112012608.16926-1-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following race is observed due to which a processes faulting on a
swap entry, finds the page neither in swapcache nor swap. This causes
zram to give a zero filled page that gets mapped to the process,
resulting in a user space crash later.
Consider parent and child processes Pa and Pb sharing the same swap slot
with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
Virtual address 'VA' of Pa and Pb points to the shared swap entry.
Pa Pb
fault on VA fault on VA
do_swap_page do_swap_page
lookup_swap_cache fails lookup_swap_cache fails
Pb scheduled out
swapin_readahead (deletes zram entry)
swap_free (makes swap_count 1)
Pb scheduled in
swap_readpage (swap_count == 1)
Takes SWP_SYNCHRONOUS_IO path
zram enrty absent
zram gives a zero filled page
Fix this by making sure that swap slot is freed only when swap count
drops down to one.
Link: http://lkml.kernel.org/r/1571743294-14285-1-git-send-email-vinmenon@codeaurora.org
Fixes: aa8d22a11d ("mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no other reference")
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Suggested-by: Minchan Kim <minchan@google.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
try_offline_node() is pretty much broken right now:
- The node span is updated when onlining memory, not when adding it. We
ignore memory that was mever onlined. Bad.
- We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily
trigger a kernel panic. Bad for memory that is offline but also bad
for subsection hotadd with ZONE_DEVICE, whereby the memmap of the
first PFN of a section might contain garbage.
- Sections belonging to mixed nodes are not properly considered.
As memory blocks might belong to multiple nodes, we would have to walk
all pageblocks (or at least subsections) within present sections.
However, we don't have a way to identify whether a memmap that is not
online was initialized (relevant for ZONE_DEVICE). This makes things
more complicated.
Luckily, we can piggy pack on the node span and the nid stored in memory
blocks. Currently, the node span is grown when calling
move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when
removing memory, before calling try_offline_node(). Sysfs links are
created via link_mem_sections(), e.g., during boot or when adding
memory.
If the node still spans memory or if any memory block belongs to the
nid, we don't set the node offline. As memory blocks that span multiple
nodes cannot get offlined, the nid stored in memory blocks is reliable
enough (for such online memory blocks, the node still spans the memory).
Introduce for_each_memory_block() to efficiently walk all memory blocks.
Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span
when removing ZONE_DEVICE memory to fix similar issues (access of
garbage memmaps) - until we have a reliable way to identify whether
these memmaps were properly initialized. This implies later, that once
a node had ZONE_DEVICE memory, we won't be able to set a node offline -
which should be acceptable.
Since commit f1dd2cd13c ("mm, memory_hotplug: do not associate
hotadded memory to zones until online") memory that is added is not
assoziated with a zone/node (memmap not initialized). The introducing
commit 60a5a19e74 ("memory-hotplug: remove sysfs file of node")
already missed that we could have multiple nodes for a section and that
the zone/node span is updated when onlining pages, not when adding them.
I tested this by hotplugging two DIMMs to a memory-less and cpu-less
NUMA node. The node is properly onlined when adding the DIMMs. When
removing the DIMMs, the node is properly offlined.
Masayoshi Mizuma reported:
: Without this patch, memory hotplug fails as panic:
:
: BUG: kernel NULL pointer dereference, address: 0000000000000000
: ...
: Call Trace:
: remove_memory_block_devices+0x81/0xc0
: try_remove_memory+0xb4/0x130
: __remove_memory+0xa/0x20
: acpi_memory_device_remove+0x84/0x100
: acpi_bus_trim+0x57/0x90
: acpi_bus_trim+0x2e/0x90
: acpi_device_hotplug+0x2b2/0x4d0
: acpi_hotplug_work_fn+0x1a/0x30
: process_one_work+0x171/0x380
: worker_thread+0x49/0x3f0
: kthread+0xf8/0x130
: ret_from_fork+0x35/0x40
[david@redhat.com: v3]
Link: http://lkml.kernel.org/r/20191102120221.7553-1-david@redhat.com
Link: http://lkml.kernel.org/r/20191028105458.28320-1-david@redhat.com
Fixes: 60a5a19e74 ("memory-hotplug: remove sysfs file of node")
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visiable after d0dc12e86b
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Nayna Jain <nayna@linux.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In collapse_file(), for !is_shmem case, current check cannot guarantee
the locked page is up-to-date. Specifically, xas_unlock_irq() should
not be called before lock_page() and get_page(); and it is necessary to
recheck PageUptodate() after locking the page.
With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text
may contain corrupted data. This is because khugepaged mistakenly
collapses some not up-to-date sub pages into a huge page, and assumes
the huge page is up-to-date. This will NOT corrupt data in the disk,
because the page is read-only and never written back. Fix this by
properly checking PageUptodate() after locking the page. This check
replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);".
Also, move PageDirty() check after locking the page. Current khugepaged
should not try to collapse dirty file THP, because it is limited to
read-only .text. The only case we hit a dirty page here is when the
page hasn't been written since write. Bail out and retry when this
happens.
syzbot reported bug on previous version of this patch.
Link: http://lkml.kernel.org/r/20191106060930.2571389-2-songliubraving@fb.com
Fixes: 99cb0dbd47 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reported-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 1b7e816fc8 ("mm: slub: Fix slab walking for init_on_free")
fixed one problem with the slab walking but missed a key detail: When
walking the list, the head and tail pointers need to be updated since we
end up reversing the list as a result. Without doing this, bulk free is
broken.
One way this is exposed is a NULL pointer with slub_debug=F:
=============================================================================
BUG skbuff_head_cache (Tainted: G T): Object already free
-----------------------------------------------------------------------------
INFO: Slab 0x000000000d2d2f8f objects=16 used=3 fp=0x0000000064309071 flags=0x3fff00000000201
BUG: kernel NULL pointer dereference, address: 0000000000000000
Oops: 0000 [#1] PREEMPT SMP PTI
RIP: 0010:print_trailer+0x70/0x1d5
Call Trace:
<IRQ>
free_debug_processing.cold.37+0xc9/0x149
__slab_free+0x22a/0x3d0
kmem_cache_free_bulk+0x415/0x420
__kfree_skb_flush+0x30/0x40
net_rx_action+0x2dd/0x480
__do_softirq+0xf0/0x246
irq_exit+0x93/0xb0
do_IRQ+0xa0/0x110
common_interrupt+0xf/0xf
</IRQ>
Given we're now almost identical to the existing debugging code which
correctly walks the list, combine with that.
Link: https://lkml.kernel.org/r/20191104170303.GA50361@gandi.net
Link: http://lkml.kernel.org/r/20191106222208.26815-1-labbott@redhat.com
Fixes: 1b7e816fc8 ("mm: slub: Fix slab walking for init_on_free")
Signed-off-by: Laura Abbott <labbott@redhat.com>
Reported-by: Thibaut Sautereau <thibaut.sautereau@clip-os.org>
Acked-by: David Rientjes <rientjes@google.com>
Tested-by: Alexander Potapenko <glider@google.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <clipos@ssi.gouv.fr>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An exiting task might belong to an offline cgroup. In this case an
attempt to grab a cgroup reference from the task can end up with an
infinite loop in hugetlb_cgroup_charge_cgroup(), because neither the
cgroup will become online, neither the task will be migrated to a live
cgroup.
Fix this by switching over to css_tryget(). As css_tryget_online()
can't guarantee that the cgroup won't go offline, in most cases the
check doesn't make sense. In this particular case users of
hugetlb_cgroup_charge_cgroup() are not affected by this change.
A similar problem is described by commit 18fa84a2db ("cgroup: Use
css_tryget() instead of css_tryget_online() in task_get_css()").
Link: http://lkml.kernel.org/r/20191106225131.3543616-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We've encountered a rcu stall in get_mem_cgroup_from_mm():
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
(t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
<...>
RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
<...>
__memcg_kmem_charge+0x55/0x140
__alloc_pages_nodemask+0x267/0x320
pipe_write+0x1ad/0x400
new_sync_write+0x127/0x1c0
__kernel_write+0x4f/0xf0
dump_emit+0x91/0xc0
writenote+0xa0/0xc0
elf_core_dump+0x11af/0x1430
do_coredump+0xc65/0xee0
get_signal+0x132/0x7c0
do_signal+0x36/0x640
exit_to_usermode_loop+0x61/0xd0
do_syscall_64+0xd4/0x100
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The problem is caused by an exiting task which is associated with an
offline memcg. We're iterating over and over in the do {} while
(!css_tryget_online()) loop, but obviously the memcg won't become online
and the exiting task won't be migrated to a live memcg.
Let's fix it by switching from css_tryget_online() to css_tryget().
As css_tryget_online() cannot guarantee that the memcg won't go offline,
the check is usually useless, except some rare cases when for example it
determines if something should be presented to a user.
A similar problem is described by commit 18fa84a2db ("cgroup: Use
css_tryget() instead of css_tryget_online() in task_get_css()").
Johannes:
: The bug aside, it doesn't matter whether the cgroup is online for the
: callers. It used to matter when offlining needed to evacuate all charges
: from the memcg, and so needed to prevent new ones from showing up, but we
: don't care now.
Link: http://lkml.kernel.org/r/20191106225131.3543616-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Shakeel Butt <shakeeb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutn <mkoutny@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recently, I hit the following issue when running upstream.
kernel BUG at mm/vmscan.c:1521!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
Call Trace:
reclaim_pages+0x499/0x800 mm/vmscan.c:2188
madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
walk_pmd_range mm/pagewalk.c:53 [inline]
walk_pud_range mm/pagewalk.c:112 [inline]
walk_p4d_range mm/pagewalk.c:139 [inline]
walk_pgd_range mm/pagewalk.c:166 [inline]
__walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
walk_page_range+0x179/0x310 mm/pagewalk.c:349
madvise_pageout_page_range mm/madvise.c:506 [inline]
madvise_pageout+0x1f0/0x330 mm/madvise.c:542
madvise_vma mm/madvise.c:931 [inline]
__do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
madvise_pageout() accesses the specified range of the vma and isolates
them, then runs shrink_page_list() to reclaim its memory. But it also
isolates the unevictable pages to reclaim. Hence, we can catch the
cases in shrink_page_list().
The root cause is that we scan the page tables instead of specific LRU
list. and so we need to filter out the unevictable lru pages from our
end.
Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com
Fixes: 1a4e58cce8 ("mm: introduce MADV_PAGEOUT")
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit d883544515 ("mm: mempolicy: make the behavior consistent when
MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") fixed the return value
of mbind() for a couple of corner cases. But, it altered the errno for
some other cases, for example, mbind() should return -EFAULT when part
or all of the memory range specified by nodemask and maxnode points
outside your accessible address space, or there was an unmapped hole in
the specified memory range specified by addr and len.
Fix this by preserving the errno returned by queue_pages_range(). And,
the pagelist may be not empty even though queue_pages_range() returns
error, put the pages back to LRU since mbind_range() is not called to
really apply the policy so those pages should not be migrated, this is
also the old behavior before the problematic commit.
Link: http://lkml.kernel.org/r/1572454731-3925-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: d883544515 ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reported-by: Li Xinhai <lixinhai.lxh@gmail.com>
Reviewed-by: Li Xinhai <lixinhai.lxh@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org> [4.19 and 5.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we have KERNEL_HEADER_TEST all headers are generally compile
tested, so relying on makefile tricks to avoid compiling code that depends
on CONFIG_MMU_NOTIFIER is more annoying.
Instead follow the usual pattern and provide most of the header with only
the functions stubbed out when CONFIG_MMU_NOTIFIER is disabled. This
ensures code compiles no matter what the config setting is.
While here, struct mmu_notifier_mm is private to mmu_notifier.c, move it.
Link: https://lore.kernel.org/r/20191112202231.3856-2-jgg@ziepe.ca
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
One conflict in the BPF samples Makefile, some fixes in 'net' whilst
we were converting over to Makefile.target rules in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
While upgrading from 4.16 to 5.2, we noticed these allocation errors in
the log of the new kernel:
SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
node 0: slabs: 5, objs: 170, free: 0
slab_out_of_memory+1
___slab_alloc+969
__slab_alloc+14
kmem_cache_alloc+346
inet_twsk_alloc+60
tcp_time_wait+46
tcp_fin+206
tcp_data_queue+2034
tcp_rcv_state_process+784
tcp_v6_do_rcv+405
__release_sock+118
tcp_close+385
inet_release+46
__sock_release+55
sock_close+17
__fput+170
task_work_run+127
exit_to_usermode_loop+191
do_syscall_64+212
entry_SYSCALL_64_after_hwframe+68
accompanied by an increase in machines going completely radio silent
under memory pressure.
One thing that changed since 4.16 is e699e2c6a6 ("net, mm: account
sock objects to kmemcg"), which made these slab caches subject to cgroup
memory accounting and control.
The problem with that is that cgroups, unlike the page allocator, do not
maintain dedicated atomic reserves. As a cgroup's usage hovers at its
limit, atomic allocations - such as done during network rx - can fail
consistently for extended periods of time. The kernel is not able to
operate under these conditions.
We don't want to revert the culprit patch, because it indeed tracks a
potentially substantial amount of memory used by a cgroup.
We also don't want to implement dedicated atomic reserves for cgroups.
There is no point in keeping a fixed margin of unused bytes in the
cgroup's memory budget to accomodate a consumer that is impossible to
predict - we'd be wasting memory and get into configuration headaches,
not unlike what we have going with min_free_kbytes. We do this for
physical mem because we have to, but cgroups are an accounting game.
Instead, account these privileged allocations to the cgroup, but let
them bypass the configured limit if they have to. This way, we get the
benefits of accounting the consumed memory and have it exert pressure on
the rest of the cgroup, but like with the page allocator, we shift the
burden of reclaimining on behalf of atomic allocations onto the regular
allocations that can block.
Link: http://lkml.kernel.org/r/20191022233708.365764-1-hannes@cmpxchg.org
Fixes: e699e2c6a6 ("net, mm: account sock objects to kmemcg")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org> [4.18+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We recently started updating the node span based on the zone span to
avoid touching uninitialized memmaps.
Currently, we will always detect the node span to start at 0, meaning a
node can easily span too many pages. pgdat_is_empty() will still work
correctly if all zones span no pages. We should skip over all zones
without spanned pages and properly handle the first detected zone that
spans pages.
Unfortunately, in contrast to the zone span (/proc/zoneinfo), the node
span cannot easily be inspected and tested. The node span gives no real
guarantees when an architecture supports memory hotplug, meaning it can
easily contain holes or span pages of different nodes.
The node span is not really used after init on architectures that
support memory hotplug.
E.g., we use it in mm/memory_hotplug.c:try_offline_node() and in
mm/kmemleak.c:kmemleak_scan(). These users seem to be fine.
Link: http://lkml.kernel.org/r/20191027222714.5313-1-david@redhat.com
Fixes: 00d6c019b5 ("mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span()")
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_cgroup_ino() doesn't return a valid memcg pointer for non-compound
slab pages, because it depends on PgHead AND PgSlab flags to be set to
determine the memory cgroup from the kmem_cache. It's correct for
compound pages, but not for generic small pages. Those don't have PgHead
set, so it ends up returning zero.
Fix this by replacing the condition to PageSlab() && !PageTail().
Before this patch:
[root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
0x0000000000000080 38 0 _______S___________________________________ slab
After this patch:
[root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
0x0000000000000080 147 0 _______S___________________________________ slab
Also, hwpoison_filter_task() uses output of page_cgroup_ino() in order
to filter error injection events based on memcg. So if
page_cgroup_ino() fails to return memcg pointer, we just fail to inject
memory error. Considering that hwpoison filter is for testing, affected
users are limited and the impact should be marginal.
[n-horiguchi@ah.jp.nec.com: changelog additions]
Link: http://lkml.kernel.org/r/20191031012151.2722280-1-guro@fb.com
Fixes: 4d96ba3530 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While investigating a bug related to higher atomic allocation failures,
we noticed the failure warnings positively drowning the console, and in
our case trigger lockup warnings because of a serial console too slow to
handle all that output.
But even if we had a faster console, it's unclear what additional
information the current level of repetition provides.
Allocation failures happen for three reasons: The machine is OOM, the VM
is failing to handle reasonable requests, or somebody is making
unreasonable requests (and didn't acknowledge their opportunism with
__GFP_NOWARN). Having the memory dump, a callstack, and the ratelimit
stats on skipped failure warnings should provide enough information to
let users/admins/developers know whether something is wrong and point
them in the right direction for debugging, bpftracing etc.
Limit allocation failure warnings to one spew every ten seconds.
Link: http://lkml.kernel.org/r/20191028194906.26899-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I got some khugepaged spew on a 32bit x86:
BUG: sleeping function called from invalid context at include/linux/mmu_notifier.h:346
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 25, name: khugepaged
INFO: lockdep is turned off.
CPU: 1 PID: 25 Comm: khugepaged Not tainted 5.4.0-rc5-elk+ #206
Hardware name: System manufacturer P5Q-EM/P5Q-EM, BIOS 2203 07/08/2009
Call Trace:
dump_stack+0x66/0x8e
___might_sleep.cold.96+0x95/0xa6
__might_sleep+0x2e/0x80
collapse_huge_page.isra.51+0x5ac/0x1360
khugepaged+0x9a9/0x20f0
kthread+0xf5/0x110
ret_from_fork+0x2e/0x38
Looks like it's due to CONFIG_HIGHPTE=y pte_offset_map()->kmap_atomic()
vs. mmu_notifier_invalidate_range_start(). Let's do the naive approach
and just reorder the two operations.
Link: http://lkml.kernel.org/r/20191029201513.GG1208@intel.com
Fixes: 810e24e009 ("mm/mmu_notifiers: annotate with might_sleep()")
Signed-off-by: Ville Syrjl <ville.syrjala@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
This is not really nice because it blocks both any interrupts on that
cpu and the page allocator. On large machines this might even trigger
the hard lockup detector.
Considering the pagetypeinfo is a debugging tool we do not really need
exact numbers here. The primary reason to look at the outuput is to see
how pageblocks are spread among different migratetypes and low number of
pages is much more interesting therefore putting a bound on the number
of pages on the free_list sounds like a reasonable tradeoff.
The new output will simply tell
[...]
Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648
instead of
Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648
The limit has been chosen arbitrary and it is a subject of a future
change should there be a need for that.
While we are at it, also drop the zone lock after each free_list
iteration which will help with the IRQ and page allocator responsiveness
even further as the IRQ lock held time is always bound to those 100k
pages.
[akpm@linux-foundation.org: tweak comment text, per David Hildenbrand]
Link: http://lkml.kernel.org/r/20191025072610.18526-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Roman Gushchin <guro@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/pagetypeinfo is a debugging tool to examine internal page
allocator state wrt to fragmentation. It is not very useful for any
other use so normal users really do not need to read this file.
Waiman Long has noticed that reading this file can have negative side
effects because zone->lock is necessary for gathering data and that a)
interferes with the page allocator and its users and b) can lead to hard
lockups on large machines which have very long free_list.
Reduce both issues by simply not exporting the file to regular users.
Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org
Fixes: 467c996c1e ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Waiman Long <longman@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Jann Horn <jannh@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The return code from the op callback is actually in _ret, while the
WARN_ON was checking ret which causes it to misfire.
Link: http://lkml.kernel.org/r/20191025175502.GA31127@ziepe.ca
Fixes: 8402ce61be ("mm/mmu_notifiers: check if mmu notifier callbacks are allowed to fail")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Deferred memory initialisation updates zone->managed_pages during the
initialisation phase but before that finishes, the per-cpu page
allocator (pcpu) calculates the number of pages allocated/freed in
batches as well as the maximum number of pages allowed on a per-cpu
list. As zone->managed_pages is not up to date yet, the pcpu
initialisation calculates inappropriately low batch and high values.
This increases zone lock contention quite severely in some cases with
the degree of severity depending on how many CPUs share a local zone and
the size of the zone. A private report indicated that kernel build
times were excessive with extremely high system CPU usage. A perf
profile indicated that a large chunk of time was lost on zone->lock
contention.
This patch recalculates the pcpu batch and high values after deferred
initialisation completes for every populated zone in the system. It was
tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
workload -- allmodconfig and all available CPUs.
mmtests configuration: config-workload-kernbench-max Configuration was
modified to build on a fresh XFS partition.
kernbench
5.4.0-rc3 5.4.0-rc3
vanilla resetpcpu-v2
Amean user-256 13249.50 ( 0.00%) 16401.31 * -23.79%*
Amean syst-256 14760.30 ( 0.00%) 4448.39 * 69.86%*
Amean elsp-256 162.42 ( 0.00%) 119.13 * 26.65%*
Stddev user-256 42.97 ( 0.00%) 19.15 ( 55.43%)
Stddev syst-256 336.87 ( 0.00%) 6.71 ( 98.01%)
Stddev elsp-256 2.46 ( 0.00%) 0.39 ( 84.03%)
5.4.0-rc3 5.4.0-rc3
vanilla resetpcpu-v2
Duration User 39766.24 49221.79
Duration System 44298.10 13361.67
Duration Elapsed 519.11 388.87
The patch reduces system CPU usage by 69.86% and total build time by
26.65%. The variance of system CPU usage is also much reduced.
Before, this was the breakdown of batch and high values over all zones
was:
256 batch: 1
256 batch: 63
512 batch: 7
256 high: 0
256 high: 378
512 high: 42
512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After
the patch:
256 batch: 1
768 batch: 63
256 high: 0
768 high: 378
[mgorman@techsingularity.net: fix merge/linkage snafu]
Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org> [4.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add two utilities to 1) write-protect and 2) clean all ptes pointing into
a range of an address space.
The utilities are intended to aid in tracking dirty pages (either
driver-allocated system memory or pci device memory).
The write-protect utility should be used in conjunction with
page_mkwrite() and pfn_mkwrite() to trigger write page-faults on page
accesses. Typically one would want to use this on sparse accesses into
large memory regions. The clean utility should be used to utilize
hardware dirtying functionality and avoid the overhead of page-faults,
typically on large accesses into small memory regions.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
For users that want to travers all page table entries pointing into a
region of a struct address_space mapping, introduce a walk_page_mapping()
function.
The walk_page_mapping() function will be initially be used for dirty-
tracking in virtual graphics drivers.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Without the lock, anybody modifying a pte from within this function might
have it concurrently modified by someone else.
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Add two new probe_kernel_read_strict() and strncpy_from_unsafe_strict()
helpers which by default alias to the __probe_kernel_read() and the
__strncpy_from_unsafe(), respectively, but can be overridden by archs
which have non-overlapping address ranges for kernel space and user
space in order to bail out with -EFAULT when attempting to probe user
memory including non-canonical user access addresses [0]:
4-level page tables:
user-space mem: 0x0000000000000000 - 0x00007fffffffffff
non-canonical: 0x0000800000000000 - 0xffff7fffffffffff
5-level page tables:
user-space mem: 0x0000000000000000 - 0x00ffffffffffffff
non-canonical: 0x0100000000000000 - 0xfeffffffffffffff
The idea is that these helpers are complementary to the probe_user_read()
and strncpy_from_unsafe_user() which probe user-only memory. Both added
helpers here do the same, but for kernel-only addresses.
Both set of helpers are going to be used for BPF tracing. They also
explicitly avoid throwing the splat for non-canonical user addresses from
00c42373d3 ("x86-64: add warning for non-canonical user access address
dereferences").
For compat, the current probe_kernel_read() and strncpy_from_unsafe() are
left as-is.
[0] Documentation/x86/x86_64/mm.txt
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: x86@kernel.org
Link: https://lore.kernel.org/bpf/eefeefd769aa5a013531f491a71f0936779e916b.1572649915.git.daniel@iogearbox.net
Commit 3d7081822f ("uaccess: Add non-pagefault user-space read functions")
missed to add probe write function, therefore factor out a probe_write_common()
helper with most logic of probe_kernel_write() except setting KERNEL_DS, and
add a new probe_user_write() helper so it can be used from BPF side.
Again, on some archs, the user address space and kernel address space can
co-exist and be overlapping, so in such case, setting KERNEL_DS would mean
that the given address is treated as being in kernel address space.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/bpf/9df2542e68141bfa3addde631441ee45503856a8.1572649915.git.daniel@iogearbox.net
If a device driver like nouveau tries to use hmm_range_fault() to access
the special shared zero page in system memory, hmm_range_fault() will
return -EFAULT and kill the process.
Allow hmm_range_fault() to return success (0) when the CPU pagetable entry
points to the special shared zero page.
page_to_pfn() and pfn_to_page() are defined on the zero page so just
handle it like any other page.
Link: https://lore.kernel.org/r/20191023195515.13168-3-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Once a THP is added to the page cache, it cannot be dropped via
/proc/sys/vm/drop_caches. Fix this issue with proper handling in
invalidate_mapping_pages().
Link: http://lkml.kernel.org/r/20191017164223.2762148-5-songliubraving@fb.com
Fixes: 99cb0dbd47 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__remove_mapping() assumes that pages can only be either base pages or
HPAGE_PMD_SIZE. Ask the page what size it is.
Link: http://lkml.kernel.org/r/20191017164223.2762148-4-songliubraving@fb.com
Fixes: 99cb0dbd47 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make sure split_huge_page_to_list() handles the state of shmem THP and
file THP properly.
Link: http://lkml.kernel.org/r/20191017164223.2762148-3-songliubraving@fb.com
Fixes: 60fbf0ab5d ("mm,thp: stats for file backed THP")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm_init.c needs to include <linux/mman.h> for the definition of
vm_committed_as_batch. Fixes the following sparse warning:
mm/mm_init.c:141:5: warning: symbol 'vm_committed_as_batch' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20191016091509.26708-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The generic_file_vm_ops is defined in <linux/ramfs.h> so include it to
fix the following warning:
mm/filemap.c:2717:35: warning: symbol 'generic_file_vm_ops' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20191008102311.25432-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Include <linux/huge_mm.h> for the definition of is_vma_temporary_stack
to fix the following sparse warning:
mm/rmap.c:1673:6: warning: symbol 'is_vma_temporary_stack' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20191009151155.27763-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Reviewed-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mapped, dirty and writeback pages are also counted in per-lruvec stats.
These counters needs update when page is moved between cgroups.
Currently is nobody *consuming* the lruvec versions of these counters and
that there is no user-visible effect.
Link: http://lkml.kernel.org/r/157112699975.7360.1062614888388489788.stgit@buzz
Fixes: 00f3ca2c2d ("mm: memcontrol: per-lruvec stats infrastructure")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Uninitialized memmaps contain garbage and in the worst case trigger
kernel BUGs, especially with CONFIG_PAGE_POISONING. They should not get
touched.
Let's make sure that we only consider online memory (managed by the
buddy) that has initialized memmaps. ZONE_DEVICE is not applicable.
page_zone() will call page_to_nid(), which will trigger
VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING
and CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps. This
can be the case when an offline memory block (e.g., never onlined) is
spanned by a zone.
Note: As explained by Michal in [1], alloc_contig_range() will verify
the range. So it boils down to the wrong access in this function.
[1] http://lkml.kernel.org/r/20180423000943.GO17484@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20191015120717.4858-1-david@redhat.com
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b]
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: <stable@vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Until commit 92d12f9544 ("memblock: refactor internal allocation
functions") the maximal address for memblock allocations was forced to
memblock.current_limit only for the allocation functions returning
virtual address. The changes introduced by that commit moved the limit
enforcement into the allocation core and as a result the allocation
functions returning physical address also started to limit allocations
to memblock.current_limit.
This caused breakage of etnaviv GPU driver:
etnaviv etnaviv: bound 130000.gpu (ops gpu_ops)
etnaviv etnaviv: bound 134000.gpu (ops gpu_ops)
etnaviv etnaviv: bound 2204000.gpu (ops gpu_ops)
etnaviv-gpu 130000.gpu: model: GC2000, revision: 5108
etnaviv-gpu 130000.gpu: command buffer outside valid memory window
etnaviv-gpu 134000.gpu: model: GC320, revision: 5007
etnaviv-gpu 134000.gpu: command buffer outside valid memory window
etnaviv-gpu 2204000.gpu: model: GC355, revision: 1215
etnaviv-gpu 2204000.gpu: Ignoring GPU with VG and FE2.0
Restore the behaviour of memblock_phys* family so that these functions
will not enforce memblock.current_limit.
Link: http://lkml.kernel.org/r/1570915861-17633-1-git-send-email-rppt@kernel.org
Fixes: 92d12f9544 ("memblock: refactor internal allocation functions")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: Adam Ford <aford173@gmail.com>
Tested-by: Adam Ford <aford173@gmail.com> [imx6q-logicpd]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: Lucas Stach <l.stach@pengutronix.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 1a61ab8038 ("mm: memcontrol: replace zone summing with
lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
instead of calculating it directly from lru_zone_size with an idea that
this would be more effective.
Tim has reported that this is not really the case for their database
benchmark which is showing an opposite results where lruvec_page_state
is taking up a huge chunk of CPU cycles (about 25% of the system time
which is roughly 7% of total cpu cycles) on 5.3 kernels. The workload
is running on a larger machine (96cpus), it has many cgroups (500) and
it is heavily direct reclaim bound.
Tim Chen said:
: The problem can also be reproduced by running simple multi-threaded
: pmbench benchmark with a fast Optane SSD swap (see profile below).
:
:
: 6.15% 3.08% pmbench [kernel.vmlinux] [k] lruvec_lru_size
: |
: |--3.07%--lruvec_lru_size
: | |
: | |--2.11%--cpumask_next
: | | |
: | | --1.66%--find_next_bit
: | |
: | --0.57%--call_function_interrupt
: | |
: | --0.55%--smp_call_function_interrupt
: |
: |--1.59%--0x441f0fc3d009
: | _ops_rdtsc_init_base_freq
: | access_histogram
: | page_fault
: | __do_page_fault
: | handle_mm_fault
: | __handle_mm_fault
: | |
: | --1.54%--do_swap_page
: | swapin_readahead
: | swap_cluster_readahead
: | |
: | --1.53%--read_swap_cache_async
: | __read_swap_cache_async
: | alloc_pages_vma
: | __alloc_pages_nodemask
: | __alloc_pages_slowpath
: | try_to_free_pages
: | do_try_to_free_pages
: | shrink_node
: | shrink_node_memcg
: | |
: | |--0.77%--lruvec_lru_size
: | |
: | --0.76%--inactive_list_is_low
: | |
: | --0.76%--lruvec_lru_size
: |
: --1.50%--measure_read
: page_fault
: __do_page_fault
: handle_mm_fault
: __handle_mm_fault
: do_swap_page
: swapin_readahead
: swap_cluster_readahead
: |
: --1.48%--read_swap_cache_async
: __read_swap_cache_async
: alloc_pages_vma
: __alloc_pages_nodemask
: __alloc_pages_slowpath
: try_to_free_pages
: do_try_to_free_pages
: shrink_node
: shrink_node_memcg
: |
: |--0.75%--inactive_list_is_low
: | |
: | --0.75%--lruvec_lru_size
: |
: --0.73%--lruvec_lru_size
The likely culprit is the cache traffic the lruvec_page_state_local
generates. Dave Hansen says:
: I was thinking purely of the cache footprint. If it's reading
: pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
: bytes of cache *96 CPUs = 18k of data, mostly read-only. 1 cgroup would
: be 18k of data for the whole system and the caching would be pretty
: efficient and all 18k would probably survive a tight page fault loop in
: the L1. 500 cgroups would be ~90k of data per CPU thread which doesn't
: fit in the L1 and probably wouldn't survive a tight page fault loop if
: both logical threads were banging on different cgroups.
:
: It's just a theory, but it's why I noted the number of cgroups when I
: initially saw this show up in profiles
Fix the regression by partially reverting the said commit and calculate
the lru size explicitly.
Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com
Fixes: 1a61ab8038 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: <stable@vger.kernel.org> [5.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In several routines, the "flags" argument is incorrectly named "write".
Change it to "flags".
Also, in one place, the misnaming led to an actual bug:
"flags & FOLL_WRITE" is required, rather than just "flags".
(That problem was flagged by krobot, in v1 of this patch.)
Also, change the flags argument from int, to unsigned int.
You can see that this was a simple oversight, because the
calling code passes "flags" to the fifth argument:
gup_pgd_range():
...
if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
PGDIR_SHIFT, next, flags, pages, nr))
...which, until this patch, the callees referred to as "write".
Also, change two lines to avoid checkpatch line length
complaints, and another line to fix another oversight
that checkpatch called out: missing "int" on pdshift.
Link: http://lkml.kernel.org/r/20191014184639.1512873-3-jhubbard@nvidia.com
Fixes: b798bec474 ("mm/gup: change write parameter to flags in fast walk")
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reported-by: kbuild test robot <lkp@intel.com>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Karsten reported the following panic in __free_slab() happening on a s390x
machine:
Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 0000000000000000 TEID: 0000000000000483
Fault in home space mode while using kernel ASCE.
AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
Krnl Code: 00000000003cada6: e310a1400004 lg %r1,320(%r10)
00000000003cadac: c0e50046c286 brasl %r14,ca32b8
#00000000003cadb2: a7f4fe36 brc 15,3caa1e
>00000000003cadb6: e32060800024 stg %r2,128(%r6)
00000000003cadbc: a7f4fd9e brc 15,3ca8f8
00000000003cadc0: c0e50046790c brasl %r14,c99fd8
00000000003cadc6: a7f4fe2c brc 15,3caa
00000000003cadc6: a7f4fe2c brc 15,3caa1e
00000000003cadca: ecb1ffff00d9 aghik %r11,%r1,-1
Call Trace:
(<00000000003cabcc> __free_slab+0x49c/0x6b0)
<00000000001f5886> rcu_core+0x5a6/0x7e0
<0000000000ca2dea> __do_softirq+0xf2/0x5c0
<0000000000152644> irq_exit+0x104/0x130
<000000000010d222> do_IRQ+0x9a/0xf0
<0000000000ca2344> ext_int_handler+0x130/0x134
<0000000000103648> enabled_wait+0x58/0x128
(<0000000000103634> enabled_wait+0x44/0x128)
<0000000000103b00> arch_cpu_idle+0x40/0x58
<0000000000ca0544> default_idle_call+0x3c/0x68
<000000000018eaa4> do_idle+0xec/0x1c0
<000000000018ee0e> cpu_startup_entry+0x36/0x40
<000000000122df34> arch_call_rest_init+0x5c/0x88
<0000000000000000> 0x0
INFO: lockdep is turned off.
Last Breaking-Event-Address:
<00000000003ca8f4> __free_slab+0x1c4/0x6b0
Kernel panic - not syncing: Fatal exception in interrupt
The kernel panics on an attempt to dereference the NULL memcg pointer.
When shutdown_cache() is called from the kmem_cache_destroy() context, a
memcg kmem_cache might have empty slab pages in a partial list, which are
still charged to the memory cgroup.
These pages are released by free_partial() at the beginning of
shutdown_cache(): either directly or by scheduling a RCU-delayed work
(if the kmem_cache has the SLAB_TYPESAFE_BY_RCU flag). The latter case
is when the reported panic can happen: memcg_unlink_cache() is called
immediately after shrinking partial lists, without waiting for scheduled
RCU works. It sets the kmem_cache->memcg_params.memcg pointer to NULL,
and the following attempt to dereference it by __free_slab() from the
RCU work context causes the panic.
To fix the issue, let's postpone the release of the memcg pointer to
destroy_memcg_params(). It's called from a separate work context by
slab_caches_to_rcu_destroy_workfn(), which contains a full RCU barrier.
This guarantees that all scheduled page release RCU works will complete
before the memcg pointer will be zeroed.
Big thanks for Karsten for the perfect report containing all necessary
information, his help with the analysis of the problem and testing of the
fix.
Link: http://lkml.kernel.org/r/20191010160549.1584316-1-guro@fb.com
Fixes: fb2f2b0adb ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Karsten Graul <kgraul@linux.ibm.com>
Tested-by: Karsten Graul <kgraul@linux.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/memory_hotplug: Shrink zones before removing memory",
v6.
This series fixes the access of uninitialized memmaps when shrinking
zones/nodes and when removing memory. Also, it contains all fixes for
crashes that can be triggered when removing certain namespace using
memunmap_pages() - ZONE_DEVICE, reported by Aneesh.
We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
more involved (we don't have SECTION_IS_ONLINE as an indicator), and
shrinking is only of limited use (set_zone_contiguous() cannot detect
the ZONE_DEVICE as contiguous).
We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount
of code to a minimum. Shrinking is especially necessary to keep
zone->contiguous set where possible, especially, on memory unplug of
DIMMs at zone boundaries.
--------------------------------------------------------------------------
Zones are now properly shrunk when offlining memory blocks or when
onlining failed. This allows to properly shrink zones on memory unplug
even if the separate memory blocks of a DIMM were onlined to different
zones or re-onlined to a different zone after offlining.
Example:
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 0
present 0
managed 0
:/# echo "online_movable" > /sys/devices/system/memory/memory41/state
:/# echo "online_movable" > /sys/devices/system/memory/memory43/state
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 98304
present 65536
managed 65536
:/# echo 0 > /sys/devices/system/memory/memory43/online
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 32768
present 32768
managed 32768
:/# echo 0 > /sys/devices/system/memory/memory41/online
:/# cat /proc/zoneinfo
Node 1, zone Movable
spanned 0
present 0
managed 0
This patch (of 10):
With an altmap, the memmap falling into the reserved altmap space are not
initialized and, therefore, contain a garbage NID and a garbage zone.
Make sure to read the NID/zone from a memmap that was initialized.
This fixes a kernel crash that is observed when destroying a namespace:
kernel BUG at include/linux/mm.h:1107!
cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
pc: c0000000004b9728: memunmap_pages+0x238/0x340
lr: c0000000004b9724: memunmap_pages+0x234/0x340
...
pid = 3669, comm = ndctl
kernel BUG at include/linux/mm.h:1107!
devm_action_release+0x30/0x50
release_nodes+0x268/0x2d0
device_release_driver_internal+0x174/0x240
unbind_store+0x13c/0x190
drv_attr_store+0x44/0x60
sysfs_kf_write+0x70/0xa0
kernfs_fop_write+0x1ac/0x290
__vfs_write+0x3c/0x70
vfs_write+0xe4/0x200
ksys_write+0x7c/0x140
system_call+0x5c/0x68
The "page_zone(pfn_to_page(pfn)" was introduced by 69324b8f48 ("mm,
devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support"), however, I
think we will never have driver reserved memory with
MEMORY_DEVICE_PRIVATE (no altmap AFAIKS).
[david@redhat.com: minimze code changes, rephrase description]
Link: http://lkml.kernel.org/r/20191006085646.5768-2-david@redhat.com
Fixes: 2c2a5af6fe ("mm, memory_hotplug: add nid parameter to arch_remove_memory")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Damian Tometzki <damian.tometzki@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jun Yao <yaojun8558363@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Rich Felker <dalias@libc.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Yu Zhao <yuzhao@google.com>
Cc: <stable@vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We might use the nid of memmaps that were never initialized. For
example, if the memmap was poisoned, we will crash the kernel in
pfn_to_nid() right now. Let's use the calculated boundaries of the
separate zones instead. This now also avoids having to iterate over a
whole bunch of subsections again, after shrinking one zone.
Before commit d0dc12e86b ("mm/memory_hotplug: optimize memory
hotplug"), the memmap was initialized to 0 and the node was set to the
right value. After that commit, the node might be garbage.
We'll have to fix shrink_zone_span() next.
Link: http://lkml.kernel.org/r/20191006085646.5768-4-david@redhat.com
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [d0dc12e86b]
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Damian Tometzki <damian.tometzki@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jun Yao <yaojun8558363@gmail.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Rich Felker <dalias@libc.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Yu Zhao <yuzhao@google.com>
Cc: <stable@vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Uninitialized memmaps contain garbage and in the worst case trigger
kernel BUGs, especially with CONFIG_PAGE_POISONING. They should not get
touched.
For example, when not onlining a memory block that is spanned by a zone
and reading /proc/pagetypeinfo with CONFIG_DEBUG_VM_PGFLAGS and
CONFIG_PAGE_POISONING, we can trigger a kernel BUG:
:/# echo 1 > /sys/devices/system/memory/memory40/online
:/# echo 1 > /sys/devices/system/memory/memory42/online
:/# cat /proc/pagetypeinfo > test.file
page:fffff2c585200000 is uninitialized and poisoned
raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
There is not page extension available.
------------[ cut here ]------------
kernel BUG at include/linux/mm.h:1107!
invalid opcode: 0000 [#1] SMP NOPTI
Please note that this change does not affect ZONE_DEVICE, because
pagetypeinfo_showmixedcount_print() is called from
mm/vmstat.c:pagetypeinfo_showmixedcount() only for populated zones, and
ZONE_DEVICE is never populated (zone->present_pages always 0).
[david@redhat.com: move check to outer loop, add comment, rephrase description]
Link: http://lkml.kernel.org/r/20191011140638.8160-1-david@redhat.com
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e86b
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should check for pfn_to_online_page() to not access uninitialized
memmaps. Reshuffle the code so we don't have to duplicate the error
message.
Link: http://lkml.kernel.org/r/20191009142435.3975-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b]
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we tested pmdk unit test [1] vmmalloc_fork TEST3 on arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.
To reproduce the bug, the cmd is as follows after you deployed everything:
make -C src/test/vmmalloc_fork/ TEST_TIME=60m check
Below call trace is from arm64 do_page_fault for debugging purpose:
[ 110.016195] Call trace:
[ 110.016826] do_page_fault+0x5a4/0x690
[ 110.017812] do_mem_abort+0x50/0xb0
[ 110.018726] el1_da+0x20/0xc4
[ 110.019492] __arch_copy_from_user+0x180/0x280
[ 110.020646] do_wp_page+0xb0/0x860
[ 110.021517] __handle_mm_fault+0x994/0x1338
[ 110.022606] handle_mm_fault+0xe8/0x180
[ 110.023584] do_page_fault+0x240/0x690
[ 110.024535] do_mem_abort+0x50/0xb0
[ 110.025423] el0_da+0x20/0x24
The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[ffff9b007000] pgd=000000023d4f8003, pud=000000023da9b003,
pmd=000000023d4b3003, pte=360000298607bd3
As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."
This patch fixes it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()
Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
in case there can be some obscure use-case (by Kirill).
[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
Signed-off-by: Jia He <justin.he@arm.com>
Reported-by: Yibo Cai <Yibo.Cai@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mmap /dev/dax more than once, then read the poison location using
address from one of the mappings. The other mappings due to not having
the page mapped in will cause SIGKILLs delivered to the process.
SIGKILL succeeds over SIGBUS, so user process loses the opportunity to
handle the UE.
Although one may add MAP_POPULATE to mmap(2) to work around the issue,
MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
isn't always an option.
Details -
ndctl inject-error --block=10 --count=1 namespace6.0
./read_poison -x dax6.0 -o 5120 -m 2
mmaped address 0x7f5bb6600000
mmaped address 0x7f3cf3600000
doing local read at address 0x7f3cf3601400
Killed
Console messages in instrumented kernel -
mce: Uncorrected hardware memory error in user-access at edbe201400
Memory failure: tk->addr = 7f5bb6601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
dev_pagemap_mapping_shift: page edbe201: no PUD
Memory failure: tk->size_shift == 0
Memory failure: Unable to find user space address edbe201 in read_poison
Memory failure: tk->addr = 7f3cf3601000
Memory failure: address edbe201: call dev_pagemap_mapping_shift
Memory failure: tk->size_shift = 21
Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of failure to unmap corrupted page
=> to deliver SIGKILL
Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory corruption
=> to deliver SIGBUS
Link: http://lkml.kernel.org/r/1565112345-28754-3-git-send-email-jane.chu@oracle.com
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-doc warning in mm/slab.c:
mm/slab.c:4215: warning: Function parameter or member 'objp' not described in '__ksize'
Also add Return: documentation section for this function.
Link: http://lkml.kernel.org/r/68c9fd7d-f09e-d376-e292-c7b2bdf1774d@infradead.org
Fixes: 10d1f8cb39 ("mm/slab: refactor common ksize KASAN logic into slab_common.c")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Florian and Dave reported [1] a NULL pointer dereference in
__reset_isolation_pfn(). While the exact cause is unclear, staring at
the code revealed two bugs, which might be related.
One bug is that if zone starts in the middle of pageblock, block_page
might correspond to different pfn than block_pfn, and then the
pfn_valid_within() checks will check different pfn's than those accessed
via struct page. This might result in acessing an unitialized page in
CONFIG_HOLES_IN_ZONE configs.
The other bug is that end_page refers to the first page of next
pageblock and not last page of current pageblock. The online and valid
check is then wrong and with sections, the while (page < end_page) loop
might wander off actual struct page arrays.
[1] https://lore.kernel.org/linux-xfs/87o8z1fvqu.fsf@mid.deneb.enyo.de/
Link: http://lkml.kernel.org/r/20191008152915.24704-1-vbabka@suse.cz
Fixes: 6b0868c820 ("mm/compaction.c: correct zone boundary handling when resetting pageblock skip hints")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Florian Weimer <fw@deneb.enyo.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit b39d0ee263 ("mm, page_alloc: avoid expensive reclaim when
compaction may not succeed") has chnaged the allocator to bail out from
the allocator early to prevent from a potentially excessive memory
reclaim. __GFP_RETRY_MAYFAIL is designed to retry the allocation,
reclaim and compaction loop as long as there is a reasonable chance to
make forward progress. Neither COMPACT_SKIPPED nor COMPACT_DEFERRED at
the INIT_COMPACT_PRIORITY compaction attempt gives this feedback.
The most obvious affected subsystem is hugetlbfs which allocates huge
pages based on an admin request (or via admin configured overcommit). I
have done a simple test which tries to allocate half of the memory for
hugetlb pages while the memory is full of a clean page cache. This is
not an unusual situation because we try to cache as much of the memory
as possible and sysctl/sysfs interface to allocate huge pages is there
for flexibility to allocate hugetlb pages at any time.
System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
after the memory is prefilled by a clean page cache:
root@test1:~# cat hugetlb_test.sh
set -x
echo 0 > /proc/sys/vm/nr_hugepages
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
TS=$(date +%s)
echo 256 > /proc/sys/vm/nr_hugepages
cat /proc/sys/vm/nr_hugepages
The results for 2 consecutive runs on clean 5.3
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
+ date +%s
+ TS=1569905284
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
256
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
+ date +%s
+ TS=1569905311
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
256
Now with b39d0ee263 applied
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
+ date +%s
+ TS=1569905516
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
11
root@test1:~# sh hugetlb_test.sh
+ echo 0
+ echo 3
+ echo 1
+ dd if=/mnt/data/file-1G of=/dev/null bs=4096
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
+ date +%s
+ TS=1569905541
+ echo 256
+ cat /proc/sys/vm/nr_hugepages
12
The success rate went down by factor of 20!
Although hugetlb allocation requests might fail and it is reasonable to
expect them to under extremely fragmented memory or when the memory is
under a heavy pressure but the above situation is not that case.
Fix the regression by reverting back to the previous behavior for
__GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for
those requests.
Mike said:
: hugetlbfs allocations are commonly done via sysctl/sysfs shortly after
: boot where this may not be as much of an issue. However, I am aware of at
: least three use cases where allocations are made after the system has been
: up and running for quite some time:
:
: - DB reconfiguration. If sysctl/sysfs fails to get required number of
: huge pages, system is rebooted to perform allocation after boot.
:
: - VM provisioning. If unable get required number of huge pages, fall
: back to base pages.
:
: - An application that does not preallocate pool, but rather allocates
: pages at fault time for optimal NUMA locality.
:
: In all cases, I would expect b39d0ee263 to cause regressions and
: noticable behavior changes.
:
: My quick/limited testing in
: https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com
: was insufficient. It was also mentioned that if something like
: b39d0ee263 went forward, I would like exemptions for __GFP_RETRY_MAYFAIL
: requests as in this patch.
[mhocko@suse.com: reworded changelog]
Link: http://lkml.kernel.org/r/20191007075548.12456-1-mhocko@kernel.org
Fixes: b39d0ee263 ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
slab_alloc_node() already zeroed out the freelist pointer if
init_on_free was on. Thibaut Sautereau noticed that the same needs to
be done for kmem_cache_alloc_bulk(), which performs the allocations
separately.
kmem_cache_alloc_bulk() is currently used in two places in the kernel,
so this change is unlikely to have a major performance impact.
SLAB doesn't require a similar change, as auto-initialization makes the
allocator store the freelist pointers off-slab.
Link: http://lkml.kernel.org/r/20191007091605.30530-1-glider@google.com
Fixes: 6471384af2 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
Signed-off-by: Alexander Potapenko <glider@google.com>
Reported-by: Thibaut Sautereau <thibaut@sautereau.fr>
Reported-by: Kees Cook <keescook@chromium.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Laura Abbott <labbott@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A long time ago we fixed a similar deadlock in show_slab_objects() [1].
However, it is apparently due to the commits like 01fb58bcba ("slab:
remove synchronous synchronize_sched() from memcg cache deactivation
path") and 03afc0e25f ("slab: get_online_mems for
kmem_cache_{create,destroy,shrink}"), this kind of deadlock is back by
just reading files in /sys/kernel/slab which will generate a lockdep
splat below.
Since the "mem_hotplug_lock" here is only to obtain a stable online node
mask while racing with NUMA node hotplug, in the worst case, the results
may me miscalculated while doing NUMA node hotplug, but they shall be
corrected by later reads of the same files.
WARNING: possible circular locking dependency detected
------------------------------------------------------
cat/5224 is trying to acquire lock:
ffff900012ac3120 (mem_hotplug_lock.rw_sem){++++}, at:
show_slab_objects+0x94/0x3a8
but task is already holding lock:
b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (kn->count#45){++++}:
lock_acquire+0x31c/0x360
__kernfs_remove+0x290/0x490
kernfs_remove+0x30/0x44
sysfs_remove_dir+0x70/0x88
kobject_del+0x50/0xb0
sysfs_slab_unlink+0x2c/0x38
shutdown_cache+0xa0/0xf0
kmemcg_cache_shutdown_fn+0x1c/0x34
kmemcg_workfn+0x44/0x64
process_one_work+0x4f4/0x950
worker_thread+0x390/0x4bc
kthread+0x1cc/0x1e8
ret_from_fork+0x10/0x18
-> #1 (slab_mutex){+.+.}:
lock_acquire+0x31c/0x360
__mutex_lock_common+0x16c/0xf78
mutex_lock_nested+0x40/0x50
memcg_create_kmem_cache+0x38/0x16c
memcg_kmem_cache_create_func+0x3c/0x70
process_one_work+0x4f4/0x950
worker_thread+0x390/0x4bc
kthread+0x1cc/0x1e8
ret_from_fork+0x10/0x18
-> #0 (mem_hotplug_lock.rw_sem){++++}:
validate_chain+0xd10/0x2bcc
__lock_acquire+0x7f4/0xb8c
lock_acquire+0x31c/0x360
get_online_mems+0x54/0x150
show_slab_objects+0x94/0x3a8
total_objects_show+0x28/0x34
slab_attr_show+0x38/0x54
sysfs_kf_seq_show+0x198/0x2d4
kernfs_seq_show+0xa4/0xcc
seq_read+0x30c/0x8a8
kernfs_fop_read+0xa8/0x314
__vfs_read+0x88/0x20c
vfs_read+0xd8/0x10c
ksys_read+0xb0/0x120
__arm64_sys_read+0x54/0x88
el0_svc_handler+0x170/0x240
el0_svc+0x8/0xc
other info that might help us debug this:
Chain exists of:
mem_hotplug_lock.rw_sem --> slab_mutex --> kn->count#45
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(kn->count#45);
lock(slab_mutex);
lock(kn->count#45);
lock(mem_hotplug_lock.rw_sem);
*** DEADLOCK ***
3 locks held by cat/5224:
#0: 9eff00095b14b2a0 (&p->lock){+.+.}, at: seq_read+0x4c/0x8a8
#1: 0eff008997041480 (&of->mutex){+.+.}, at: kernfs_seq_start+0x34/0xf0
#2: b8ff009693eee398 (kn->count#45){++++}, at:
kernfs_seq_start+0x44/0xf0
stack backtrace:
Call trace:
dump_backtrace+0x0/0x248
show_stack+0x20/0x2c
dump_stack+0xd0/0x140
print_circular_bug+0x368/0x380
check_noncircular+0x248/0x250
validate_chain+0xd10/0x2bcc
__lock_acquire+0x7f4/0xb8c
lock_acquire+0x31c/0x360
get_online_mems+0x54/0x150
show_slab_objects+0x94/0x3a8
total_objects_show+0x28/0x34
slab_attr_show+0x38/0x54
sysfs_kf_seq_show+0x198/0x2d4
kernfs_seq_show+0xa4/0xcc
seq_read+0x30c/0x8a8
kernfs_fop_read+0xa8/0x314
__vfs_read+0x88/0x20c
vfs_read+0xd8/0x10c
ksys_read+0xb0/0x120
__arm64_sys_read+0x54/0x88
el0_svc_handler+0x170/0x240
el0_svc+0x8/0xc
I think it is important to mention that this doesn't expose the
show_slab_objects to use-after-free. There is only a single path that
might really race here and that is the slab hotplug notifier callback
__kmem_cache_shrink (via slab_mem_going_offline_callback) but that path
doesn't really destroy kmem_cache_node data structures.
[1] http://lkml.iu.edu/hypermail/linux/kernel/1101.0/02850.html
[akpm@linux-foundation.org: add comment explaining why we don't need mem_hotplug_lock]
Link: http://lkml.kernel.org/r/1570192309-10132-1-git-send-email-cai@lca.pw
Fixes: 01fb58bcba ("slab: remove synchronous synchronize_sched() from memcg cache deactivation path")
Fixes: 03afc0e25f ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}")
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 37389167a2 ("mm, page_owner: keep owner info when freeing the
page") has introduced a flag PAGE_EXT_OWNER_ACTIVE to indicate that page
is tracked as being allocated. Kirril suggested naming it
PAGE_EXT_OWNER_ALLOCATED to make it more clear, as "active is somewhat
loaded term for a page".
Link: http://lkml.kernel.org/r/20190930122916.14969-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 8974558f49 ("mm, page_owner, debug_pagealloc: save and dump
freeing stack trace") enhanced page_owner to also store freeing stack
trace, when debug_pagealloc is also enabled. KASAN would also like to
do this [1] to improve error reports to debug e.g. UAF issues.
Kirill has suggested that the freeing stack trace saving should be also
possible to be enabled separately from KASAN or debug_pagealloc, i.e.
with an extra boot option. Qian argued that we have enough options
already, and avoiding the extra overhead is not worth the complications
in the case of a debugging option. Kirill noted that the extra stack
handle in struct page_owner requires 0.1% of memory.
This patch therefore enables free stack saving whenever page_owner is
enabled, regardless of whether debug_pagealloc or KASAN is also enabled.
KASAN kernels booted with page_owner=on will thus benefit from the
improved error reports.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=203967
[vbabka@suse.cz: v3]
Link: http://lkml.kernel.org/r/20191007091808.7096-3-vbabka@suse.cz
Link: http://lkml.kernel.org/r/20190930122916.14969-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Qian Cai <cai@lca.pw>
Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Suggested-by: Walter Wu <walter-zh.wu@mediatek.com>
Suggested-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "followups to debug_pagealloc improvements through
page_owner", v3.
These are followups to [1] which made it to Linus meanwhile. Patches 1
and 3 are based on Kirill's review, patch 2 on KASAN request [2]. It
would be nice if all of this made it to 5.4 with [1] already there (or
at least Patch 1).
This patch (of 3):
As noted by Kirill, commit 7e2f2a0cd1 ("mm, page_owner: record page
owner for each subpage") has introduced an off-by-one error in
__set_page_owner_handle() when looking up page_ext for subpages. As a
result, the head page page_owner info is set twice, while for the last
tail page, it's not set at all.
Fix this and also make the code more efficient by advancing the page_ext
pointer we already have, instead of calling lookup_page_ext() for each
subpage. Since the full size of struct page_ext is not known at compile
time, we can't use a simple page_ext++ statement, so introduce a
page_ext_next() inline function for that.
Link: http://lkml.kernel.org/r/20190930122916.14969-2-vbabka@suse.cz
Fixes: 7e2f2a0cd1 ("mm, page_owner: record page owner for each subpage")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Kirill A. Shutemov <kirill@shutemov.name>
Reported-by: Miles Chen <miles.chen@mediatek.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In case of an error (e.g. memory pool too small), kmemleak disables
itself and cleans up the already allocated metadata objects. However, if
this happens early before the RCU callback mechanism is available,
put_object() skips call_rcu() and frees the object directly. This is not
safe with the RCU list traversal in __kmemleak_do_cleanup().
Change the list traversal in __kmemleak_do_cleanup() to
list_for_each_entry_safe() and remove the rcu_read_{lock,unlock} since
the kmemleak is already disabled at this point. In addition, avoid an
unnecessary metadata object rb-tree look-up since it already has the
struct kmemleak_object pointer.
Fixes: c566586818 ("mm: kmemleak: use the memory pool for early allocations")
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reported-by: Marc Dionne <marc.c.dionne@gmail.com>
Reported-by: Ted Ts'o <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull mount fixes from Al Viro:
"A couple of regressions from the mount series"
* 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: add missing blkdev_put() in get_tree_bdev()
shmem: fix LSM options parsing
In most configurations, kmalloc() happens to return naturally aligned
(i.e. aligned to the block size itself) blocks for power of two sizes.
That means some kmalloc() users might unknowingly rely on that
alignment, until stuff breaks when the kernel is built with e.g.
CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then
developers have to devise workaround such as own kmem caches with
specified alignment [1], which is not always practical, as recently
evidenced in [2].
The topic has been discussed at LSF/MM 2019 [3]. Adding a
'kmalloc_aligned()' variant would not help with code unknowingly relying
on the implicit alignment. For slab implementations it would either
require creating more kmalloc caches, or allocate a larger size and only
give back part of it. That would be wasteful, especially with a generic
alignment parameter (in contrast with a fixed alignment to size).
Ideally we should provide to mm users what they need without difficult
workarounds or own reimplementations, so let's make the kmalloc()
alignment to size explicitly guaranteed for power-of-two sizes under all
configurations. What this means for the three available allocators?
* SLAB object layout happens to be mostly unchanged by the patch. The
implicitly provided alignment could be compromised with
CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
caches with alignment larger than unsigned long long. Practically on at
least x86 this includes kmalloc caches as they use cache line alignment,
which is larger than that. Still, this patch ensures alignment on all
arches and cache sizes.
* SLUB layout is also unchanged unless redzoning is enabled through
CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
With this patch, explicit alignment is guaranteed with redzoning as
well. This will result in more memory being wasted, but that should be
acceptable in a debugging scenario.
* SLOB has no implicit alignment so this patch adds it explicitly for
kmalloc(). The potential downside is increased fragmentation. While
pathological allocation scenarios are certainly possible, in my testing,
after booting a x86_64 kernel+userspace with virtme, around 16MB memory
was consumed by slab pages both before and after the patch, with
difference in the noise.
[1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
[2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
[3] https://lwn.net/Articles/787740/
[akpm@linux-foundation.org: documentation fixlet, per Matthew]
Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: David Sterba <dsterba@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "guarantee natural alignment for kmalloc()", v2.
This patch (of 2):
SLOB currently doesn't account its pages at all, so in /proc/meminfo the
Slab field shows zero. Modifying a counter on page allocation and
freeing should be acceptable even for the small system scenarios SLOB is
intended for. Since reclaimable caches are not separated in SLOB,
account everything as unreclaimable.
SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
larger than order-1 page, that are passed directly to the page
allocator. As they also don't appear in /proc/slabinfo, it might look
like a memory leak. For consistency, account them as well. (SLAB
doesn't actually use page allocator directly, so no change there).
Ideally SLOB and SLUB would be handled in separate patches, but due to
the shared kmalloc_order() function and different kfree()
implementations, it's easier to patch both at once to prevent
inconsistencies.
Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is an incremental improvement on the existing
memory.{low,min} relative reclaim work to base its scan pressure
calculations on how much protection is available compared to the current
usage, rather than how much the current usage is over some protection
threshold.
This change doesn't change the experience for the user in the normal
case too much. One benefit is that it replaces the (somewhat arbitrary)
100% cutoff with an indefinite slope, which makes it easier to ballpark
a memory.low value.
As well as this, the old methodology doesn't quite apply generically to
machines with varying amounts of physical memory. Let's say we have a
top level cgroup, workload.slice, and another top level cgroup,
system-management.slice. We want to roughly give 12G to
system-management.slice, so on a 32GB machine we set memory.low to 20GB
in workload.slice, and on a 64GB machine we set memory.low to 52GB.
However, because these are relative amounts to the total machine size,
while the amount of memory we want to generally be willing to yield to
system.slice is absolute (12G), we end up putting more pressure on
system.slice just because we have a larger machine and a larger workload
to fill it, which seems fairly unintuitive. With this new behaviour, we
don't end up with this unintended side effect.
Previously the way that memory.low protection works is that if you are
50% over a certain baseline, you get 50% of your normal scan pressure.
This is certainly better than the previous cliff-edge behaviour, but it
can be improved even further by always considering memory under the
currently enforced protection threshold to be out of bounds. This means
that we can set relatively low memory.low thresholds for variable or
bursty workloads while still getting a reasonable level of protection,
whereas with the previous version we may still trivially hit the 100%
clamp. The previous 100% clamp is also somewhat arbitrary, whereas this
one is more concretely based on the currently enforced protection
threshold, which is likely easier to reason about.
There is also a subtle issue with the way that proportional reclaim
worked previously -- it promotes having no memory.low, since it makes
pressure higher during low reclaim. This happens because we base our
scan pressure modulation on how far memory.current is between memory.min
and memory.low, but if memory.low is unset, we only use the overage
method. In most cromulent configurations, this then means that we end
up with *more* pressure than with no memory.low at all when we're in low
reclaim, which is not really very usable or expected.
With this patch, memory.low and memory.min affect reclaim pressure in a
more understandable and composable way. For example, from a user
standpoint, "protected" memory now remains untouchable from a reclaim
aggression standpoint, and users can also have more confidence that
bursty workloads will still receive some amount of guaranteed
protection.
Link: http://lkml.kernel.org/r/20190322160307.GA3316@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman points out that when when we do the low reclaim pass, we scale the
reclaim pressure relative to position between 0 and the maximum
protection threshold.
However, if the maximum protection is based on memory.elow, and
memory.emin is above zero, this means we still may get binary behaviour
on second-pass low reclaim. This is because we scale starting at 0, not
starting at memory.emin, and since we don't scan at all below emin, we
end up with cliff behaviour.
This should be a fairly uncommon case since usually we don't go into the
second pass, but it makes sense to scale our low reclaim pressure
starting at emin.
You can test this by catting two large sparse files, one in a cgroup
with emin set to some moderate size compared to physical RAM, and
another cgroup without any emin. In both cgroups, set an elow larger
than 50% of physical RAM. The one with emin will have less page
scanning, as reclaim pressure is lower.
Rebase on top of and apply the same idea as what was applied to handle
cgroup_memory=disable properly for the original proportional patch
http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name ("mm,
memcg: Handle cgroup_disable=memory when getting memcg protection").
Link: http://lkml.kernel.org/r/20190201051810.GA18895@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Suggested-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cgroup v2 introduces two memory protection thresholds: memory.low
(best-effort) and memory.min (hard protection). While they generally do
what they say on the tin, there is a limitation in their implementation
that makes them difficult to use effectively: that cliff behaviour often
manifests when they become eligible for reclaim. This patch implements
more intuitive and usable behaviour, where we gradually mount more
reclaim pressure as cgroups further and further exceed their protection
thresholds.
This cliff edge behaviour happens because we only choose whether or not
to reclaim based on whether the memcg is within its protection limits
(see the use of mem_cgroup_protected in shrink_node), but we don't vary
our reclaim behaviour based on this information. Imagine the following
timeline, with the numbers the lruvec size in this zone:
1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
scanned. (?!)
* Of course, we won't usually scan all available pages in the zone even
without this patch because of scan control priority, over-reclaim
protection, etc. However, as shown by the tests at the end, these
techniques don't sufficiently throttle such an extreme change in input,
so cliff-like behaviour isn't really averted by their existence alone.
Here's an example of how this plays out in practice. At Facebook, we are
trying to protect various workloads from "system" software, like
configuration management tools, metric collectors, etc (see this[0] case
study). In order to find a suitable memory.low value, we start by
determining the expected memory range within which the workload will be
comfortable operating. This isn't an exact science -- memory usage deemed
"comfortable" will vary over time due to user behaviour, differences in
composition of work, etc, etc. As such we need to ballpark memory.low,
but doing this is currently problematic:
1. If we end up setting it too low for the workload, it won't have
*any* effect (see discussion above). The group will receive the full
weight of reclaim and won't have any priority while competing with the
less important system software, as if we had no memory.low configured
at all.
2. Because of this behaviour, we end up erring on the side of setting
it too high, such that the comfort range is reliably covered. However,
protected memory is completely unavailable to the rest of the system,
so we might cause undue memory and IO pressure there when we *know* we
have some elasticity in the workload.
3. Even if we get the value totally right, smack in the middle of the
comfort zone, we get extreme jumps between no pressure and full
pressure that cause unpredictable pressure spikes in the workload due
to the current binary reclaim behaviour.
With this patch, we can set it to our ballpark estimation without too much
worry. Any undesirable behaviour, such as too much or too little reclaim
pressure on the workload or system will be proportional to how far our
estimation is off. This means we can set memory.low much more
conservatively and thus waste less resources *without* the risk of the
workload falling off a cliff if we overshoot.
As a more abstract technical description, this unintuitive behaviour
results in having to give high-priority workloads a large protection
buffer on top of their expected usage to function reliably, as otherwise
we have abrupt periods of dramatically increased memory pressure which
hamper performance. Having to set these thresholds so high wastes
resources and generally works against the principle of work conservation.
In addition, having proportional memory reclaim behaviour has other
benefits. Most notably, before this patch it's basically mandatory to set
memory.low to a higher than desirable value because otherwise as soon as
you exceed memory.low, all protection is lost, and all pages are eligible
to scan again. By contrast, having a gradual ramp in reclaim pressure
means that you now still get some protection when thresholds are exceeded,
which means that one can now be more comfortable setting memory.low to
lower values without worrying that all protection will be lost. This is
important because workingset size is really hard to know exactly,
especially with variable workloads, so at least getting *some* protection
if your workingset size grows larger than you expect increases user
confidence in setting memory.low without a huge buffer on top being
needed.
Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
assistance in thinking about how to make this work better.
In testing these changes, I intended to verify that:
1. Changes in page scanning become gradual and proportional instead of
binary.
To test this, I experimented stepping further and further down
memory.low protection on a workload that floats around 19G workingset
when under memory.low protection, watching page scan rates for the
workload cgroup:
+------------+-----------------+--------------------+--------------+
| memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
+------------+-----------------+--------------------+--------------+
| 21G | 0 | 0 | N/A |
| 17G | 867 | 3799 | 23% |
| 12G | 1203 | 3543 | 34% |
| 8G | 2534 | 3979 | 64% |
| 4G | 3980 | 4147 | 96% |
| 0 | 3799 | 3980 | 95% |
+------------+-----------------+--------------------+--------------+
As you can see, the test kernel (with a kernel containing this
patch) ramps up page scanning significantly more gradually than the
control kernel (without this patch).
2. More gradual ramp up in reclaim aggression doesn't result in
premature OOMs.
To test this, I wrote a script that slowly increments the number of
pages held by stress(1)'s --vm-keep mode until a production system
entered severe overall memory contention. This script runs in a highly
protected slice taking up the majority of available system memory.
Watching vmstat revealed that page scanning continued essentially
nominally between test and control, without causing forward reclaim
progress to become arrested.
[0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project
[akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
[chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The "mode" and "level" variables are enums and in this context GCC will
treat them as unsigned ints so the error handling is never triggered.
I also removed the bogus initializer because it isn't required any more
and it's sort of confusing.
[akpm@linux-foundation.org: reduce implicit and explicit typecasting]
[akpm@linux-foundation.org: fix return value, add comment, per Matthew]
Link: http://lkml.kernel.org/r/20190925110449.GO3264@mwanda
Fixes: 3cadfa2b94 ("mm/vmpressure.c: convert to use match_string() helper")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Enrico Weigelt <info@metux.net>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a really hard to reproduce race in z3fold between z3fold_free()
and z3fold_reclaim_page(). z3fold_reclaim_page() can claim the page
after z3fold_free() has checked if the page was claimed and
z3fold_free() will then schedule this page for compaction which may in
turn lead to random page faults (since that page would have been
reclaimed by then).
Fix that by claiming page in the beginning of z3fold_free() and not
forgetting to clear the claim in the end.
[vitalywool@gmail.com: v2]
Link: http://lkml.kernel.org/r/20190928113456.152742cf@bigdell
Link: http://lkml.kernel.org/r/20190926104844.4f0c6efa1366b8f5741eaba9@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Reported-by: Markus Linnala <markus.linnala@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Markus Linnala <markus.linnala@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We get two warnings when build kernel W=1:
mm/shuffle.c:36:12: warning: no previous prototype for `shuffle_show' [-Wmissing-prototypes]
mm/sparse.c:220:6: warning: no previous prototype for `subsection_mask_set' [-Wmissing-prototypes]
Make the functions static to fix this.
Link: http://lkml.kernel.org/r/1566978161-7293-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SECTION_SIZE and SECTION_MASK macros are not getting used anymore. But
they do conflict with existing definitions on arm64 platform causing
following warning during build. Lets drop these unused macros.
mm/memremap.c:16: warning: "SECTION_MASK" redefined
#define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
arch/arm64/include/asm/pgtable-hwdef.h:79: note: this is the location of the previous definition
#define SECTION_MASK (~(SECTION_SIZE-1))
mm/memremap.c:17: warning: "SECTION_SIZE" redefined
#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
arch/arm64/include/asm/pgtable-hwdef.h:78: note: this is the location of the previous definition
#define SECTION_SIZE (_AC(1, UL) << SECTION_SHIFT)
Link: http://lkml.kernel.org/r/1569312010-31313-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A removable block device, such as NVMe or SSD connected over Thunderbolt
can be hot-removed any time including when the system is suspended. When
device is hot-removed during suspend and the system gets resumed, kernel
first resumes devices and then thaws the userspace including freezable
workqueues. What happens in that case is that the NVMe driver notices
that the device is unplugged and removes it from the system. This ends
up calling bdi_unregister() for the gendisk which then schedules
wb_workfn() to be run one more time.
However, since the bdi_wq is still frozen flush_delayed_work() call in
wb_shutdown() blocks forever halting system resume process. User sees
this as hang as nothing is happening anymore.
Triggering sysrq-w reveals this:
Workqueue: nvme-wq nvme_remove_dead_ctrl_work [nvme]
Call Trace:
? __schedule+0x2c5/0x630
? wait_for_completion+0xa4/0x120
schedule+0x3e/0xc0
schedule_timeout+0x1c9/0x320
? resched_curr+0x1f/0xd0
? wait_for_completion+0xa4/0x120
wait_for_completion+0xc3/0x120
? wake_up_q+0x60/0x60
__flush_work+0x131/0x1e0
? flush_workqueue_prep_pwqs+0x130/0x130
bdi_unregister+0xb9/0x130
del_gendisk+0x2d2/0x2e0
nvme_ns_remove+0xed/0x110 [nvme_core]
nvme_remove_namespaces+0x96/0xd0 [nvme_core]
nvme_remove+0x5b/0x160 [nvme]
pci_device_remove+0x36/0x90
device_release_driver_internal+0xdf/0x1c0
nvme_remove_dead_ctrl_work+0x14/0x30 [nvme]
process_one_work+0x1c2/0x3f0
worker_thread+0x48/0x3e0
kthread+0x100/0x140
? current_work+0x30/0x30
? kthread_park+0x80/0x80
ret_from_fork+0x35/0x40
This is not limited to NVMes so exactly same issue can be reproduced by
hot-removing SSD (over Thunderbolt) while the system is suspended.
Prevent this from happening by removing WQ_FREEZABLE from bdi_wq.
Reported-by: AceLan Kao <acelan.kao@canonical.com>
Link: https://marc.info/?l=linux-kernel&m=138695698516487
Link: https://bugzilla.kernel.org/show_bug.cgi?id=204385
Link: https://lore.kernel.org/lkml/20191002122136.GD2819@lahna.fi.intel.com/#t
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge hugepage allocation updates from David Rientjes:
"We (mostly Linus, Andrea, and myself) have been discussing offlist how
to implement a sane default allocation strategy for hugepages on NUMA
platforms.
With these reverts in place, the page allocator will happily allocate
a remote hugepage immediately rather than try to make a local hugepage
available. This incurs a substantial performance degradation when
memory compaction would have otherwise made a local hugepage
available.
This series reverts those reverts and attempts to propose a more sane
default allocation strategy specifically for hugepages. Andrea
acknowledges this is likely to fix the swap storms that he originally
reported that resulted in the patches that removed __GFP_THISNODE from
hugepage allocations.
The immediate goal is to return 5.3 to the behavior the kernel has
implemented over the past several years so that remote hugepages are
not immediately allocated when local hugepages could have been made
available because the increased access latency is untenable.
The next goal is to introduce a sane default allocation strategy for
hugepages allocations in general regardless of the configuration of
the system so that we prevent thrashing of local memory when
compaction is unlikely to succeed and can prefer remote hugepages over
remote native pages when the local node is low on memory."
Note on timing: this reverts the hugepage VM behavior changes that got
introduced fairly late in the 5.3 cycle, and that fixed a huge
performance regression for certain loads that had been around since
4.18.
Andrea had this note:
"The regression of 4.18 was that it was taking hours to start a VM
where 3.10 was only taking a few seconds, I reported all the details
on lkml when it was finally tracked down in August 2018.
https://lore.kernel.org/linux-mm/20180820032640.9896-2-aarcange@redhat.com/
__GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio
workload degrade like in the "current upstream" above. And it still
would have been that bad as above until 5.3-rc5"
where the bad behavior ends up happening as you fill up a local node,
and without that change, you'd get into the nasty swap storm behavior
due to compaction working overtime to make room for more memory on the
nodes.
As a result 5.3 got the two performance fix reverts in rc5.
However, David Rientjes then noted that those performance fixes in turn
regressed performance for other loads - although not quite to the same
degree. He suggested reverting the reverts and instead replacing them
with two small changes to how hugepage allocations are done (patch
descriptions rephrased by me):
- "avoid expensive reclaim when compaction may not succeed": just admit
that the allocation failed when you're trying to allocate a huge-page
and compaction wasn't successful.
- "allow hugepage fallback to remote nodes when madvised": when that
node-local huge-page allocation failed, retry without forcing the
local node.
but by then I judged it too late to replace the fixes for a 5.3 release.
So 5.3 was released with behavior that harked back to the pre-4.18 logic.
But now we're in the merge window for 5.4, and we can see if this
alternate model fixes not just the horrendous swap storm behavior, but
also restores the performance regression that the late reverts caused.
Fingers crossed.
* emailed patches from David Rientjes <rientjes@google.com>:
mm, page_alloc: allow hugepage fallback to remote nodes when madvised
mm, page_alloc: avoid expensive reclaim when compaction may not succeed
Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
Revert "Revert "mm, thp: restore node-local hugepage allocations""
For systems configured to always try hard to allocate transparent
hugepages (thp defrag setting of "always") or for memory that has been
explicitly madvised to MADV_HUGEPAGE, it is often better to fallback to
remote memory to allocate the hugepage if the local allocation fails
first.
The point is to allow the initial call to __alloc_pages_node() to attempt
to defragment local memory to make a hugepage available, if possible,
rather than immediately fallback to remote memory. Local hugepages will
always have a better access latency than remote (huge)pages, so an attempt
to make a hugepage available locally is always preferred.
If memory compaction cannot be successful locally, however, it is likely
better to fallback to remote memory. This could take on two forms: either
allow immediate fallback to remote memory or do per-zone watermark checks.
It would be possible to fallback only when per-zone watermarks fail for
order-0 memory, since that would require local reclaim for all subsequent
faults so remote huge allocation is likely better than thrashing the local
zone for large workloads.
In this case, it is assumed that because the system is configured to try
hard to allocate hugepages or the vma is advised to explicitly want to try
hard for hugepages that remote allocation is better when local allocation
and memory compaction have both failed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory compaction has a couple significant drawbacks as the allocation
order increases, specifically:
- isolate_freepages() is responsible for finding free pages to use as
migration targets and is implemented as a linear scan of memory
starting at the end of a zone,
- failing order-0 watermark checks in memory compaction does not account
for how far below the watermarks the zone actually is: to enable
migration, there must be *some* free memory available. Per the above,
watermarks are not always suffficient if isolate_freepages() cannot
find the free memory but it could require hundreds of MBs of reclaim to
even reach this threshold (read: potentially very expensive reclaim with
no indication compaction can be successful), and
- if compaction at this order has failed recently so that it does not even
run as a result of deferred compaction, looping through reclaim can often
be pointless.
For hugepage allocations, these are quite substantial drawbacks because
these are very high order allocations (order-9 on x86) and falling back to
doing reclaim can potentially be *very* expensive without any indication
that compaction would even be successful.
Reclaim itself is unlikely to free entire pageblocks and certainly no
reliance should be put on it to do so in isolation (recall lumpy reclaim).
This means we should avoid reclaim and simply fail hugepage allocation if
compaction is deferred.
It is also not helpful to thrash a zone by doing excessive reclaim if
compaction may not be able to access that memory. If order-0 watermarks
fail and the allocation order is sufficiently large, it is likely better
to fail the allocation rather than thrashing the zone.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 92717d429b.
Since commit a8282608c8 ("Revert "mm, thp: restore node-local hugepage
allocations"") is reverted in this series, it is better to restore the
previous 5.2 behavior between the thp allocation and the page allocator
rather than to attempt any consolidation or cleanup for a policy that is
now reverted. It's less risky during an rc cycle and subsequent patches
in this series further modify the same policy that the pre-5.3 behavior
implements.
Consolidation and cleanup can be done subsequent to a sane default page
allocation strategy, so this patch reverts a cleanup done on a strategy
that is now reverted and thus is the least risky option.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit a8282608c8.
The commit references the original intended semantic for MADV_HUGEPAGE
which has subsequently taken on three unique purposes:
- enables or disables thp for a range of memory depending on the system's
config (is thp "enabled" set to "always" or "madvise"),
- determines the synchronous compaction behavior for thp allocations at
fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"),
and
- reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only
clear previous hugepage advice).
These are the three purposes that currently exist in 5.2 and over the
past several years that userspace has been written around. Adding a
NUMA locality preference adds a fourth dimension to an already conflated
advice mode.
Based on the semantic that MADV_HUGEPAGE has provided over the past
several years, there exist workloads that use the tunable based on these
principles: specifically that the allocation should attempt to
defragment a local node before falling back. It is agreed that remote
hugepages typically (but not always) have a better access latency than
remote native pages, although on Naples this is at parity for
intersocket.
The revert commit that this patch reverts allows hugepage allocation to
immediately allocate remotely when local memory is fragmented. This is
contrary to the semantic of MADV_HUGEPAGE over the past several years:
that is, memory compaction should be attempted locally before falling
back.
The performance degradation of remote hugepages over local hugepages on
Rome, for example, is 53.5% increased access latency. For this reason,
the goal is to revert back to the 5.2 and previous behavior that would
attempt local defragmentation before falling back. With the patch that
is reverted by this patch, we see performance degradations at the tail
because the allocator happily allocates the remote hugepage rather than
even attempting to make a local hugepage available.
zone_reclaim_mode is not a solution to this problem since it does not
only impact hugepage allocations but rather changes the memory
allocation strategy for *all* page allocations.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are many common parts between MADV_COLD and MADV_PAGEOUT.
This patch factor them out to save code duplication.
Link: http://lkml.kernel.org/r/20190726023435.214162-6-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: kbuild test robot <lkp@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a process expects no accesses to a certain memory range for a long
time, it could hint kernel that the pages can be reclaimed instantly but
data should be preserved for future use. This could reduce workingset
eviction so it ends up increasing performance.
This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
MADV_PAGEOUT can be used by a process to mark a memory range as not
expected to be used for a long time so that kernel reclaims *any LRU*
pages instantly. The hint can help kernel in deciding which pages to
evict proactively.
A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
intentionally because it's automatically bounded by PMD size. If PMD
size(e.g., 256) makes some trouble, we could fix it later by limit it to
SWAP_CLUSTER_MAX[1].
- man-page material
MADV_PAGEOUT (since Linux x.x)
Do not expect access in the near future so pages in the specified
regions could be reclaimed instantly regardless of memory pressure.
Thus, access in the range after successful operation could cause
major page fault but never lose the up-to-date contents unlike
MADV_DONTNEED. Pages belonging to a shared mapping are only processed
if a write access is allowed for the calling process.
MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
VM_PFNMAP pages.
[1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/
[minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
as default. It is for preventing to reclaim dirty pages when CMA try to
migrate pages. Strictly speaking, we don't need it because CMA didn't
allow to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.
Moreover, it has a problem to prevent anonymous pages's swap out even
though force_reclaim = true in shrink_page_list on upcoming patch. So
this patch makes references's default value to PAGEREF_RECLAIM and rename
force_reclaim with ignore_references to make it more clear.
This is a preparatory work for next patch.
Link: http://lkml.kernel.org/r/20190726023435.214162-3-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: kbuild test robot <lkp@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
- Background
The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start. While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.
To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon). They are likely to be killed by
lmkd if the system has to reclaim memory. In that sense they are similar
to entries in any other cache. Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.
- Problem
Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process. Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.
- Approach
The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.
To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space. MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.
This patch (of 5):
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.
This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.
It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
active file page -> inactive file LRU
active anon page -> inacdtive anon LRU
Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied. Even, it could give a bonus to make
them be reclaimed on swapless system. However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost. Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU. Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device. Let's start simpler way without adding
complexity at this moment. However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.
* man-page material
MADV_COLD (since Linux x.x)
Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies. In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.
MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There isn't a good reason to differentiate between the user address space
layout modification syscalls and the other memory permission/attributes
ones (e.g. mprotect, madvise) w.r.t. the tagged address ABI. Untag the
user addresses on entry to these functions.
Link: http://lkml.kernel.org/r/20190821164730.47450-2-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Dave P Martin <Dave.Martin@arm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.
get_vaddr_frames uses provided user pointers for vma lookups, which can
only by done with untagged pointers. Instead of locating and changing all
callers of this function, perform untagging in it.
Link: http://lkml.kernel.org/r/28f05e49c92b2a69c4703323d6c12208f3d881fe.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jens Wiklander <jens.wiklander@linaro.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.
mm/gup.c provides a kernel interface that accepts user addresses and
manipulates user pages directly (for example get_user_pages, that is used
by the futex syscall). Since a user can provided tagged addresses, we
need to handle this case.
Add untagging to gup.c functions that use user addresses for vma lookups.
Link: http://lkml.kernel.org/r/4731bddba3c938658c10ff4ed55cc01c60f4c8f8.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jens Wiklander <jens.wiklander@linaro.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.
This patch allows tagged pointers to be passed to the following memory
syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
mremap, msync, munlock, move_pages.
The mmap and mremap syscalls do not currently accept tagged addresses.
Architectures may interpret the tag as a background colour for the
corresponding vma.
Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jens Wiklander <jens.wiklander@linaro.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add RB_DECLARE_CALLBACKS_MAX, which generates augmented rbtree callbacks
for the case where the augmented value is a scalar whose definition
follows a max(f(node)) pattern. This actually covers all present uses of
RB_DECLARE_CALLBACKS, and saves some (source) code duplication in the
various RBCOMPUTE function definitions.
[walken@google.com: fix mm/vmalloc.c]
Link: http://lkml.kernel.org/r/CANN689FXgK13wDYNh1zKxdipeTuALG4eKvKpsdZqKFJ-rvtGiQ@mail.gmail.com
[walken@google.com: re-add check to check_augmented()]
Link: http://lkml.kernel.org/r/20190727022027.GA86863@google.com
Link: http://lkml.kernel.org/r/20190703040156.56953-3-walken@google.com
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
set_zspage_inuse() was introduced in the commit 4f42047bbd ("zsmalloc:
use accessor") but all the users of it were removed later by the commits,
bdb0af7ca8 ("zsmalloc: factor page chain functionality out")
3783689a1a ("zsmalloc: introduce zspage structure")
so the function can be safely removed now.
Link: http://lkml.kernel.org/r/1568658408-19374-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zswap_writeback_entry() maps a handle to read swpentry first, and
then in the most common case it would map the same handle again.
This is ok when zbud is the backend since its mapping callback is
plain and simple, but it slows things down for z3fold.
Since there's hardly a point in unmapping a handle _that_ fast as
zswap_writeback_entry() does when it reads swpentry, the
suggestion is to keep the handle mapped till the end.
Link: http://lkml.kernel.org/r/20190916004640.b453167d3556c4093af4cf7d@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As a zpool_driver, zsmalloc can allocate movable memory because it support
migate pages. But zbud and z3fold cannot allocate movable memory.
Add malloc_support_movable to zpool_driver. If a zpool_driver support
allocate movable memory, set it to true. And add
zpool_malloc_support_movable check malloc_support_movable to make sure if
a zpool support allocate movable memory.
Link: http://lkml.kernel.org/r/20190605100630.13293-1-teawaterz@linux.alibaba.com
Signed-off-by: Hui Zhu <teawaterz@linux.alibaba.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
madvise_behavior() converts -ENOMEM to -EAGAIN in several places using
identical code.
Move that code to a common error handling path.
No functional changes.
Link: http://lkml.kernel.org/r/1564640896-1210-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The AF_XDP sockets umem mapping interface uses XDP_UMEM_PGOFF_FILL_RING
and XDP_UMEM_PGOFF_COMPLETION_RING offsets. These offsets are
established already and are part of the configuration interface.
But for 32-bit systems, using AF_XDP socket configuration, these values
are too large to pass the maximum allowed file size verification. The
offsets can be tuned off, but instead of changing the existing
interface, let's extend the max allowed file size for sockets.
No one has been using this until this patch with 32 bits as without
this fix af_xdp sockets can't be used at all, so it unblocks af_xdp
socket usage for 32bit systems.
All list of mmap cbs for sockets was verified for side effects and all
of them contain dummy cb - sock_no_mmap() at this moment, except the
following:
xsk_mmap() - it's what this fix is needed for.
tcp_mmap() - doesn't have obvious issues with pgoff - no any references on it.
packet_mmap() - return -EINVAL if it's even set.
Link: http://lkml.kernel.org/r/20190812124326.32146-1-ivan.khoronzhuk@linaro.org
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When addr is out of range of the whole rb_tree, pprev will point to the
right-most node. rb_tree facility already provides a helper function,
rb_last(), to do this task. We can leverage this instead of
reimplementing it.
This patch refines find_vma_prev() with rb_last() to make it a little
nicer to read.
[akpm@linux-foundation.org: little cleanup, per Vlastimil]
Link: http://lkml.kernel.org/r/20190809001928.4950-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commits selects ARCH_HAS_ELF_RANDOMIZE when an arch uses the generic
topdown mmap layout functions so that this security feature is on by
default.
Note that this commit also removes the possibility for arm64 to have elf
randomization and no MMU: without MMU, the security added by randomization
is worth nothing.
Link: http://lkml.kernel.org/r/20190730055113.23635-6-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
arm64 handles top-down mmap layout in a way that can be easily reused by
other architectures, so make it available in mm. It then introduces a new
config ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT that can be set by other
architectures to benefit from those functions. Note that this new config
depends on MMU being enabled, if selected without MMU support, a warning
will be thrown.
Link: http://lkml.kernel.org/r/20190730055113.23635-5-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: James Hogan <jhogan@kernel.org>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Provide generic top-down mmap layout functions", v6.
This series introduces generic functions to make top-down mmap layout
easily accessible to architectures, in particular riscv which was the
initial goal of this series. The generic implementation was taken from
arm64 and used successively by arm, mips and finally riscv.
Note that in addition the series fixes 2 issues:
- stack randomization was taken into account even if not necessary.
- [1] fixed an issue with mmap base which did not take into account
randomization but did not report it to arm and mips, so by moving arm64
into a generic library, this problem is now fixed for both
architectures.
This work is an effort to factorize architecture functions to avoid code
duplication and oversights as in [1].
[1]: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1429066.html
This patch (of 14):
This preparatory commit moves this function so that further introduction
of generic topdown mmap layout is contained only in mm/util.c.
Link: http://lkml.kernel.org/r/20190730055113.23635-2-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
khugepaged needs exclusive mmap_sem to access page table. When it fails
to lock mmap_sem, the page will fault in as pte-mapped THP. As the page
is already a THP, khugepaged will not handle this pmd again.
This patch enables the khugepaged to retry collapse the page table.
struct mm_slot (in khugepaged.c) is extended with an array, containing
addresses of pte-mapped THPs. We use array here for simplicity. We can
easily replace it with more advanced data structures when needed.
In khugepaged_scan_mm_slot(), if the mm contains pte-mapped THP, we try to
collapse the page table.
Since collapse may happen at an later time, some pages may already fault
in. collapse_pte_mapped_thp() is added to properly handle these pages.
collapse_pte_mapped_thp() also double checks whether all ptes in this pmd
are mapping to the same THP. This is necessary because some subpage of
the THP may be replaced, for example by uprobe. In such cases, it is not
possible to collapse the pmd.
[kirill.shutemov@linux.intel.com: add comments for retract_page_tables()]
Link: http://lkml.kernel.org/r/20190816145443.6ard3iilytc6jlgv@box
Link: http://lkml.kernel.org/r/20190815164525.1848545-6-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a new foll_flag: FOLL_SPLIT_PMD. As the name says
FOLL_SPLIT_PMD splits huge pmd for given mm_struct, the underlining huge
page stays as-is.
FOLL_SPLIT_PMD is useful for cases where we need to use regular pages, but
would switch back to huge page and huge pmd on. One of such example is
uprobe. The following patches use FOLL_SPLIT_PMD in uprobe.
Link: http://lkml.kernel.org/r/20190815164525.1848545-4-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "THP aware uprobe", v13.
This patchset makes uprobe aware of THPs.
Currently, when uprobe is attached to text on THP, the page is split by
FOLL_SPLIT. As a result, uprobe eliminates the performance benefit of
THP.
This set makes uprobe THP-aware. Instead of FOLL_SPLIT, we introduces
FOLL_SPLIT_PMD, which only split PMD for uprobe.
After all uprobes within the THP are removed, the PTE-mapped pages are
regrouped as huge PMD.
This set (plus a few THP patches) is also available at
https://github.com/liu-song-6/linux/tree/uprobe-thp
This patch (of 6):
Move memcmp_pages() to mm/util.c and pages_identical() to mm.h, so that we
can use them in other files.
Link: http://lkml.kernel.org/r/20190815164525.1848545-2-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <matthew.wilcox@oracle.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently THP deferred split shrinker is not memcg aware, this may cause
premature OOM with some configuration. For example the below test would
run into premature OOM easily:
$ cgcreate -g memory:thp
$ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp transhuge-stress 4000
transhuge-stress comes from kernel selftest.
It is easy to hit OOM, but there are still a lot THP on the deferred split
queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.
Convert deferred split shrinker memcg aware by introducing per memcg
deferred split queue. The THP should be on either per node or per memcg
deferred split queue if it belongs to a memcg. When the page is
immigrated to the other memcg, it will be immigrated to the target memcg's
deferred split queue too.
Reuse the second tail page's deferred_list for per memcg list since the
same THP can't be on multiple deferred split queues.
[yang.shi@linux.alibaba.com: simplify deferred split queue dereference per Kirill Tkhai]
Link: http://lkml.kernel.org/r/1566496227-84952-5-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1565144277-36240-5-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently shrinker is just allocated and can work when memcg kmem is
enabled. But, THP deferred split shrinker is not slab shrinker, it
doesn't make too much sense to have such shrinker depend on memcg kmem.
It should be able to reclaim THP even though memcg kmem is disabled.
Introduce a new shrinker flag, SHRINKER_NONSLAB, for non-slab shrinker.
When memcg kmem is disabled, just such shrinkers can be called in
shrinking memcg slab.
[yang.shi@linux.alibaba.com: add comment]
Link: http://lkml.kernel.org/r/1566496227-84952-4-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1565144277-36240-4-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A later patch makes THP deferred split shrinker memcg aware, but it needs
page->mem_cgroup information in THP destructor, which is called after
mem_cgroup_uncharge() now.
So move mem_cgroup_uncharge() from __page_cache_release() to compound page
destructor, which is called by both THP and other compound pages except
HugeTLB. And call it in __put_single_page() for single order page.
Link: http://lkml.kernel.org/r/1565144277-36240-3-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Make deferred split shrinker memcg aware", v6.
Currently THP deferred split shrinker is not memcg aware, this may cause
premature OOM with some configuration. For example the below test would
run into premature OOM easily:
$ cgcreate -g memory:thp
$ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp transhuge-stress 4000
transhuge-stress comes from kernel selftest.
It is easy to hit OOM, but there are still a lot THP on the deferred split
queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.
Convert deferred split shrinker memcg aware by introducing per memcg
deferred split queue. The THP should be on either per node or per memcg
deferred split queue if it belongs to a memcg. When the page is
immigrated to the other memcg, it will be immigrated to the target memcg's
deferred split queue too.
Reuse the second tail page's deferred_list for per memcg list since the
same THP can't be on multiple deferred split queues.
Make deferred split shrinker not depend on memcg kmem since it is not
slab. It doesn't make sense to not shrink THP even though memcg kmem is
disabled.
With the above change the test demonstrated above doesn't trigger OOM even
though with cgroup.memory=nokmem.
This patch (of 4):
Put split_queue, split_queue_lock and split_queue_len into a struct in
order to reduce code duplication when we convert deferred_split to memcg
aware in the later patches.
Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In previous patch, an application could put part of its text section in
THP via madvise(). These THPs will be protected from writes when the
application is still running (TXTBSY). However, after the application
exits, the file is available for writes.
This patch avoids writes to file THP by dropping page cache for the file
when the file is open for write. A new counter nr_thps is added to struct
address_space. In do_dentry_open(), if the file is open for write and
nr_thps is non-zero, we drop page cache for the whole file.
Link: http://lkml.kernel.org/r/20190801184244.3169074-8-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is (hopefully) the first step to enable THP for non-shmem
filesystems.
This patch enables an application to put part of its text sections to THP
via madvise, for example:
madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);
We tried to reuse the logic for THP on tmpfs.
Currently, write is not supported for non-shmem THP. khugepaged will only
process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests
(see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is
execve(). This requirement limits non-shmem THP to text sections.
The next patch will handle writes, which would only happen when the all
the vmas with VM_DENYWRITE are unmapped.
An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
feature.
[songliubraving@fb.com: fix build without CONFIG_SHMEM]
Link: http://lkml.kernel.org/r/F53407FB-96CC-42E8-9862-105C92CC2B98@fb.com
[songliubraving@fb.com: fix double unlock in collapse_file()]
Link: http://lkml.kernel.org/r/B960CBFA-8EFC-4DA4-ABC5-1977FFF2CA57@fb.com
Link: http://lkml.kernel.org/r/20190801184244.3169074-7-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Next patch will add khugepaged support of non-shmem files. This patch
renames these two functions to reflect the new functionality:
collapse_shmem() => collapse_file()
khugepaged_scan_shmem() => khugepaged_scan_file()
Link: http://lkml.kernel.org/r/20190801184244.3169074-6-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation for non-shmem THP, this patch adds a few stats and exposes
them in /proc/meminfo, /sys/bus/node/devices/<node>/meminfo, and
/proc/<pid>/task/<tid>/smaps.
This patch is mostly a rewrite of Kirill A. Shutemov's earlier version:
https://lkml.kernel.org/r/20170126115819.58875-5-kirill.shutemov@linux.intel.com/
Link: http://lkml.kernel.org/r/20190801184244.3169074-5-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With THP, current check of offset:
VM_BUG_ON_PAGE(page->index != offset, page);
is no longer accurate. Update it to:
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
Link: http://lkml.kernel.org/r/20190801184244.3169074-4-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similar to previous patch, pagecache_get_page() avoids race condition with
truncate by checking page->mapping == mapping. This does not work for
compound pages. This patch let it check compound_head(page)->mapping
instead.
Link: http://lkml.kernel.org/r/20190801184244.3169074-3-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Enable THP for text section of non-shmem files", v10;
This patchset follows up discussion at LSF/MM 2019. The motivation is to
put text section of an application in THP, and thus reduces iTLB miss rate
and improves performance. Both Facebook and Oracle showed strong
interests to this feature.
To make reviews easier, this set aims a mininal valid product. Current
version of the work does not have any changes to file system specific
code. This comes with some limitations (discussed later).
This set enables an application to "hugify" its text section by simply
running something like:
madvise(0x600000, 0x80000, MADV_HUGEPAGE);
Before this call, the /proc/<pid>/maps looks like:
00400000-074d0000 r-xp 00000000 00:27 2006927 app
After this call, part of the text section is split out and mapped to
THP:
00400000-00425000 r-xp 00000000 00:27 2006927 app
00600000-00e00000 r-xp 00200000 00:27 2006927 app <<< on THP
00e00000-074d0000 r-xp 00a00000 00:27 2006927 app
Limitations:
1. This only works for text section (vma with VM_DENYWRITE).
2. Original limitation #2 is removed in v3.
We gated this feature with an experimental config, READ_ONLY_THP_FOR_FS.
Once we get better support on the write path, we can remove the config and
enable it by default.
Tested cases:
1. Tested with btrfs and ext4.
2. Tested with real work application (memcache like caching service).
3. Tested with "THP aware uprobe":
https://patchwork.kernel.org/project/linux-mm/list/?series=131339
This patch (of 7):
Currently, filemap_fault() avoids race condition with truncate by checking
page->mapping == mapping. This does not work for compound pages. This
patch let it check compound_head(page)->mapping instead.
Link: http://lkml.kernel.org/r/20190801184244.3169074-2-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages, the
pages will be interleaved between all nodes of the system. If nodes are
not equal, it is quite possible for one node to fill up before the others.
When this happens, the code still attempts to allocate pages from the
full node. This results in calls to direct reclaim and compaction which
slow things down considerably.
When allocating pool pages, note the state of the previous allocation for
each node. If previous allocation failed, do not use the aggressive retry
algorithm on successive attempts. The allocation will still succeed if
there is memory available, but it will not try as hard to free up memory.
Link: http://lkml.kernel.org/r/20190806014744.15446-5-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz reports that "hugetlb allocations could stall for minutes or
hours when should_compact_retry() would return true more often then it
should. Specifically, this was in the case where compact_result was
COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being
made."
The problem is that the compaction_withdrawn() test in
should_compact_retry() includes compaction outcomes that are only possible
on low compaction priority, and results in a retry without increasing the
priority. This may result in furter reclaim, and more incomplete
compaction attempts.
With this patch, compaction priority is raised when possible, or
should_compact_retry() returns false.
The COMPACT_SKIPPED result doesn't really fit together with the other
outcomes in compaction_withdrawn(), as that's a result caused by
insufficient order-0 pages, not due to low compaction priority. With this
patch, it is moved to a new compaction_needs_reclaim() function, and for
that outcome we keep the current logic of retrying if it looks like
reclaim will be able to help.
Link: http://lkml.kernel.org/r/20190806014744.15446-4-mike.kravetz@oracle.com
Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit "mm, reclaim: make should_continue_reclaim perform dryrun
detection", closer look at the function shows, that nr_reclaimed == 0
means the function will always return false. And since non-zero
nr_reclaimed implies non_zero nr_scanned, testing nr_scanned serves no
purpose, and so does the testing for __GFP_RETRY_MAYFAIL.
This patch thus cleans up the function to test only !nr_reclaimed upfront,
and remove the __GFP_RETRY_MAYFAIL test and nr_scanned parameter
completely. Comment is also updated, explaining that approximating "full
LRU list has been scanned" with nr_scanned == 0 didn't really work.
Link: http://lkml.kernel.org/r/20190806014744.15446-3-mike.kravetz@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "address hugetlb page allocation stalls", v2.
Allocation of hugetlb pages via sysctl or procfs can stall for minutes or
hours. A simple example on a two node system with 8GB of memory is as
follows:
echo 4096 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
echo 4096 > /proc/sys/vm/nr_hugepages
Obviously, both allocation attempts will fall short of their 8GB goal.
However, one or both of these commands may stall and not be interruptible.
The issues were initially discussed in mail thread [1] and RFC code at
[2].
This series addresses the issues causing the stalls. There are two
distinct fixes, a cleanup, and an optimization. The reclaim patch by
Hillf and compaction patch by Vlasitmil address corner cases in their
respective areas. hugetlb page allocation could stall due to either of
these issues. Vlasitmil added a cleanup patch after Hillf's
modifications. The hugetlb patch by Mike is an optimization suggested
during the debug and development process.
[1] http://lkml.kernel.org/r/d38a095e-dc39-7e82-bb76-2c9247929f07@oracle.com
[2] http://lkml.kernel.org/r/20190724175014.9935-1-mike.kravetz@oracle.com
This patch (of 4):
Address the issue of should_continue_reclaim returning true too often for
__GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned. This was
observed during hugetlb page allocation causing stalls for minutes or
hours.
We can stop reclaiming pages if compaction reports it can make a progress.
There might be side-effects for other high-order allocations that would
potentially benefit from reclaiming more before compaction so that they
would be faster and less likely to stall. However, the consequences of
premature/over-reclaim are considered worse.
We can also bail out of reclaiming pages if we know that there are not
enough inactive lru pages left to satisfy the costly allocation.
We can give up reclaiming pages too if we see dryrun occur, with the
certainty of plenty of inactive pages. IOW with dryrun detected, we are
sure we have reclaimed as many pages as we could.
Link: http://lkml.kernel.org/r/20190806014744.15446-2-mike.kravetz@oracle.com
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cgroup v1 memcg controller has exposed a dedicated kmem limit to users
which turned out to be really a bad idea because there are paths which
cannot shrink the kernel memory usage enough to get below the limit (e.g.
because the accounted memory is not reclaimable). There are cases when
the failure is even not allowed (e.g. __GFP_NOFAIL). This means that the
kmem limit is in excess to the hard limit without any way to shrink and
thus completely useless. OOM killer cannot be invoked to handle the
situation because that would lead to a premature oom killing.
As a result many places might see ENOMEM returning from kmalloc and result
in unexpected errors. E.g. a global OOM killer when there is a lot of
free memory because ENOMEM is translated into VM_FAULT_OOM in #PF path and
therefore pagefault_out_of_memory would result in OOM killer.
Please note that the kernel memory is still accounted to the overall limit
along with the user memory so removing the kmem specific limit should
still allow to contain kernel memory consumption. Unlike the kmem one,
though, it invokes memory reclaim and targeted memcg oom killing if
necessary.
Start the deprecation process by crying to the kernel log. Let's see
whether there are relevant usecases and simply return to EINVAL in the
second stage if nobody complains in few releases.
[akpm@linux-foundation.org: tweak documentation text]
Link: http://lkml.kernel.org/r/20190911151612.GI4023@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Thomas Lindroth <thomas.lindroth@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_id_get() was introduced in commit 73f576c04b ("mm:memcontrol:
fix cgroup creation failure after many small jobs").
Later, it no longer has any user since the commits,
1f47b61fb4 ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup")
58fa2a5512 ("mm: memcontrol: add sanity checks for memcg->id.ref on get/put")
so safe to remove it.
Link: http://lkml.kernel.org/r/1568648453-5482-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
constrained_alloc() calculates the size of the oom domain by using
node_spanned_pages which is incorrect because this is the full range of
the physical memory range that the numa node occupies rather than the
memory that backs that range which is represented by node_present_pages.
Sparsely populated nodes (e.g. after memory hot remove or simply sparse
due to memory layout) can have really a large difference between the two.
This shouldn't really cause any real user observable problems because the
oom calculates a ratio against totalpages and used memory cannot exceed
present pages but it is confusing and wrong from code point of view.
Link: http://lkml.kernel.org/r/20190829163443.899-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit ac311a14c6 ("oom: decouple mems_allowed from
oom_unkillable_task") changed has_intersects_mems_allowed() to
oom_cpuset_eligible(), but didn't change the comment.
Link: http://lkml.kernel.org/r/1566959929-10638-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For an OOM event: print oom_score_adj value for the OOM Killed process to
document what the oom score adjust value was at the time the process was
OOM Killed. The adjustment value can be set by user code and it affects
the resulting oom_score so it is used to influence kill process selection.
When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing
this value is the only documentation of the value for the process being
killed. Having this value on the Killed process message is useful to
document if a miscconfiguration occurred or to confirm that the
oom_score_adj configuration applies as expected.
An example which illustates both misconfiguration and validation that the
oom_score_adj was applied as expected is:
Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692
(systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB,
shmem-rss:0kB pgtables:22kB oom_score_adj:1000
The systemd-udevd is a critical system application that should have an
oom_score_adj of -1000. It was miconfigured to have a adjustment of 1000
making it a highly favored OOM kill target process. The output documents
both the misconfiguration and the fact that the process was correctly
targeted by OOM due to the miconfiguration. This can be quite helpful for
triage and problem determination.
The addition of the pgtables_bytes shows page table usage by the process
and is a useful measure of the memory size of the process.
Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.com
Signed-off-by: Edward Chron <echron@arista.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the event of an oom kill, useful information about the killed process
is printed to dmesg. Users, especially system administrators, will find
it useful to immediately see the UID of the process.
We already print uid when dumping eligible tasks so it is not overly hard
to find that information in the oom report. However this information is
unavailable when dumping of eligible tasks is disabled.
In the following example, abuse_the_ram is the name of a program that
attempts to iteratively allocate all available memory until it is stopped
by force.
Current message:
Out of memory: Killed process 35389 (abuse_the_ram)
total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB,
shmem-rss:0kB
Patched message:
Out of memory: Killed process 2739 (abuse_the_ram),
total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB,
shmem-rss:0kB, UID:0
[akpm@linux-foundation.org: s/UID %d/UID:%u/ in printk]
Link: http://lkml.kernel.org/r/1560362273-534-1-git-send-email-jsavitz@redhat.com
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Suggested-by: David Rientjes <rientjes@google.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1) task_nodes = cpuset_mems_allowed(current);
-> cpuset_mems_allowed() guaranteed to return some non-empty
subset of node_states[N_MEMORY].
2) nodes_and(*new, *new, task_nodes);
-> after nodes_and(), the 'new' should be empty or appropriate
nodemask(online node and with memory).
After 1) and 2), we could remove unnecessary check whether the 'new'
AND node_states[N_MEMORY] is empty.
Link: http://lkml.kernel.org/r/20190806023634.55356-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
total_{migrate,free}_scanned will be added to COMPACTMIGRATE_SCANNED and
COMPACTFREE_SCANNED in compact_zone(). We should clear them before
scanning a new zone. In the proc triggered compaction, we forgot clearing
them.
[laoar.shao@gmail.com: introduce a helper compact_zone_counters_init()]
Link: http://lkml.kernel.org/r/1563869295-25748-1-git-send-email-laoar.shao@gmail.com
[akpm@linux-foundation.org: expand compact_zone_counters_init() into its single callsite, per mhocko]
[vbabka@suse.cz: squash compact_zone() list_head init as well]
Link: http://lkml.kernel.org/r/1fb6f7da-f776-9e42-22f8-bbb79b030b98@suse.cz
[akpm@linux-foundation.org: kcompactd_do_work(): avoid unnecessary initialization of cc.zone]
Link: http://lkml.kernel.org/r/1563789275-9639-1-git-send-email-laoar.shao@gmail.com
Fixes: 7f354a548d ("mm, compaction: add vmstats for kcompactd work")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Yafang Shao <shaoyafang@didiglobal.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently there is a leak in init_z3fold_page() -- it allocates handles
from kmem cache even for headless pages, but then they are never used and
never freed, so eventually kmem cache may get exhausted. This patch
provides a fix for that.
Link: http://lkml.kernel.org/r/20190917185352.44cf285d3ebd9e64548de5de@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Reported-by: Markus Linnala <markus.linnala@gmail.com>
Tested-by: Markus Linnala <markus.linnala@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When compiling a kernel with W=1, there are several of those warnings due
to arm64 overriding a field on purpose. Just disable those warnings for
both GCC and Clang of this file, so it will help dig "gems" hidden in the
W=1 warnings by reducing some noises.
mm/init-mm.c:39:2: warning: initializer overrides prior initialization
of this subobject [-Winitializer-overrides]
INIT_MM_CONTEXT(init_mm)
^~~~~~~~~~~~~~~~~~~~~~~~
./arch/arm64/include/asm/mmu.h:133:9: note: expanded from macro
'INIT_MM_CONTEXT'
.pgd = init_pg_dir,
^~~~~~~~~~~
mm/init-mm.c:30:10: note: previous initialization is here
.pgd = swapper_pg_dir,
^~~~~~~~~~~~~~
Note: there is a side project trying to support explicitly allowing
specific initializer overrides in Clang, but there is no guarantee it
will happen or not.
https://github.com/ClangBuiltLinux/linux/issues/639
Link: http://lkml.kernel.org/r/1566920867-27453-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace open-coded bitmap array initialization of init_mm.cpu_bitmask with
neat CPU_BITS_NONE macro.
And, since init_mm.cpu_bitmask is statically set to zero, there is no way
to clear it again in start_kernel().
Link: http://lkml.kernel.org/r/1565703815-8584-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If !area->pages statement is true where memory allocation fails, area is
freed.
In this case 'area->pages = pages' should not executed. So move
'area->pages = pages' after if statement.
[akpm@linux-foundation.org: give area->pages the same treatment]
Link: http://lkml.kernel.org/r/20190830035716.GA190684@LGEARND20B15
Signed-off-by: Austin Kim <austindh.kim@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Roman Penyaev <rpenyaev@suse.de>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Objective
---------
The current implementation of struct vmap_area wasted space.
After applying this commit, sizeof(struct vmap_area) has been
reduced from 11 words to 8 words.
Description
-----------
1) Pack "subtree_max_size", "vm" and "purge_list". This is no problem
because
A) "subtree_max_size" is only used when vmap_area is in "free" tree
B) "vm" is only used when vmap_area is in "busy" tree
C) "purge_list" is only used when vmap_area is in vmap_purge_list
2) Eliminate "flags".
;Since only one flag VM_VM_AREA is being used, and the same thing can be
done by judging whether "vm" is NULL, then the "flags" can be eliminated.
Link: http://lkml.kernel.org/r/20190716152656.12255-3-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Suggested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The busy tree can be quite big, even though the area is freed or unmapped
it still stays there until "purge" logic removes it.
1) Optimize and reduce the size of "busy" tree by removing a node from
it right away as soon as user triggers free paths. It is possible to
do so, because the allocation is done using another augmented tree.
The vmalloc test driver shows the difference, for example the
"fix_size_alloc_test" is ~11% better comparing with default configuration:
sudo ./test_vmalloc.sh performance
<default>
Summary: fix_size_alloc_test loops: 1000000 avg: 993985 usec
Summary: full_fit_alloc_test loops: 1000000 avg: 973554 usec
Summary: long_busy_list_alloc_test loops: 1000000 avg: 12617652 usec
<default>
<this patch>
Summary: fix_size_alloc_test loops: 1000000 avg: 882263 usec
Summary: full_fit_alloc_test loops: 1000000 avg: 973407 usec
Summary: long_busy_list_alloc_test loops: 1000000 avg: 12593929 usec
<this patch>
2) Since the busy tree now contains allocated areas only and does not
interfere with lazily free nodes, introduce the new function
show_purge_info() that dumps "unpurged" areas that is propagated
through "/proc/vmallocinfo".
3) Eliminate VM_LAZY_FREE flag.
Link: http://lkml.kernel.org/r/20190716152656.12255-2-lpf.vector@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no possibility for memmap to be NULL in the current codebase.
This check was added in commit 95a4774d05 ("memory-hotplug: update
mce_bad_pages when removing the memory") where memmap was originally
inited to NULL, and only conditionally given a value.
The code that could have passed a NULL has been removed by commit
ba72b4c8cf ("mm/sparsemem: support sub-section hotplug"), so there is no
longer a possibility that memmap can be NULL.
Link: http://lkml.kernel.org/r/20190829035151.20975-1-alastair@d-silva.org
Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the function written to do it instead.
Link: http://lkml.kernel.org/r/20190827053656.32191-2-alastair@au1.ibm.com
Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__pfn_to_section is defined as __nr_to_section(pfn_to_section_nr(pfn)).
Since we already get section_nr, it is not necessary to get mem_section
from start_pfn. By doing so, we reduce one redundant operation.
Link: http://lkml.kernel.org/r/20190809010242.29797-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Tested-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The size argument passed into sparse_buffer_alloc() has already been
aligned with PAGE_SIZE or PMD_SIZE.
If the size after aligned is not power of 2 (e.g. 0x480000), the
PTR_ALIGN() will return wrong value. Use roundup to round sparsemap_buf
up to next multiple of size.
Link: http://lkml.kernel.org/r/20190705114826.28586-1-lecopzer.chen@mediatek.com
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Signed-off-by: Mark-PK Tsai <Mark-PK.Tsai@mediatek.com>
Cc: YJ Chiang <yj.chiang@mediatek.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sparse_buffer_alloc(xsize) gets the size of memory from sparsemap_buf
after being aligned with the size. However, the size is at least
PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION) and usually larger
than PAGE_SIZE.
Also, sparse_buffer_fini() only frees memory between sparsemap_buf and
sparsemap_buf_end, since sparsemap_buf may be changed by PTR_ALIGN()
first, the aligned space before sparsemap_buf is wasted and no one will
touch it.
In our ARM32 platform (without SPARSEMEM_VMEMMAP)
Sparse_buffer_init
Reserve d359c000 - d3e9c000 (9M)
Sparse_buffer_alloc
Alloc d3a00000 - d3E80000 (4.5M)
Sparse_buffer_fini
Free d3e80000 - d3e9c000 (~=100k)
The reserved memory between d359c000 - d3a00000 (~=4.4M) is unfreed.
In ARM64 platform (with SPARSEMEM_VMEMMAP)
sparse_buffer_init
Reserve ffffffc07d623000 - ffffffc07f623000 (32M)
Sparse_buffer_alloc
Alloc ffffffc07d800000 - ffffffc07f600000 (30M)
Sparse_buffer_fini
Free ffffffc07f600000 - ffffffc07f623000 (140K)
The reserved memory between ffffffc07d623000 - ffffffc07d800000
(~=1.9M) is unfreed.
Let's explicit free redundant aligned memory.
[arnd@arndb.de: mark sparse_buffer_free as __meminit]
Link: http://lkml.kernel.org/r/20190709185528.3251709-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20190705114730.28534-1-lecopzer.chen@mediatek.com
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Signed-off-by: Mark-PK Tsai <Mark-PK.Tsai@mediatek.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: YJ Chiang <yj.chiang@mediatek.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
walk_system_ram_range() will fail with -EINVAL in case
online_pages_range() was never called (== no resource applicable in the
range). Otherwise, we will always call online_pages_range() with nr_pages
> 0 and, therefore, have online_pages > 0.
Remove that special handling.
Link: http://lkml.kernel.org/r/20190814154109.3448-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit a9cd410a3d ("mm/page_alloc.c: memory hotplug: free pages as
higher order") assumed that any PFN we get via memory resources is aligned
to to MAX_ORDER - 1, I am not convinced that is always true. Let's play
safe, check the alignment and fallback to single pages.
akpm: warn in this situation so we get to find out if and why this ever
occurs.
[akpm@linux-foundation.org: add WARN_ON_ONCE()]
Link: http://lkml.kernel.org/r/20190814154109.3448-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
online_pages always corresponds to nr_pages. Simplify the code, getting
rid of online_pages_blocks(). Add some comments.
Link: http://lkml.kernel.org/r/20190814154109.3448-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
move_pfn_range_to_zone() will set all pages to PG_reserved via
memmap_init_zone(). The only way a page could no longer be reserved would
be if a MEM_GOING_ONLINE notifier would clear PG_reserved - which is not
done (the online_page callback is used for that purpose by e.g., Hyper-V
instead). walk_system_ram_range() will never call online_pages_range()
with duplicate PFNs, so drop the PageReserved() check.
This seems to be a leftover from ancient times where the memmap was
initialized when adding memory and we wanted to check for already onlined
memory.
Link: http://lkml.kernel.org/r/20190814154109.3448-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When offlining a node in try_offline_node(), pgdat is not released. So
that pgdat could be reused in hotadd_new_pgdat(). While we reallocate
pgdat->per_cpu_nodestats if this pgdat is reused.
This patch prevents the memory leak by just allocating per_cpu_nodestats
when it is a new pgdat.
Link: http://lkml.kernel.org/r/20190813020608.10194-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each memory block spans the same amount of sections/pages/bytes. The size
is determined before the first memory block is created. No need to store
what we can easily calculate - and the calculations even look simpler now.
Michal brought up the idea of variable-sized memory blocks. However, if
we ever implement something like this, we will need an API compatibility
switch and reworks at various places (most code assumes a fixed memory
block size). So let's cleanup what we have right now.
While at it, fix the variable naming in register_mem_sect_under_node() -
we no longer talk about a single section.
Link: http://lkml.kernel.org/r/20190809110200.2746-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's remove this indirection. We need the zone in the caller either way,
so let's just detect it there. Add some documentation for
move_pfn_range_to_zone() instead.
[akpm@linux-foundation.org: restore newline, per David]
Link: http://lkml.kernel.org/r/20190724142324.3686-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using %px to show the actual address in print_bad_pte()
to help us to debug issue.
Link: http://lkml.kernel.org/r/20190831011816.141002-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: remove quicklist page table caches".
A while ago Nicholas proposed to remove quicklist page table caches [1].
I've rebased his patch on the curren upstream and switched ia64 and sh to
use generic versions of PTE allocation.
[1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com
This patch (of 3):
Remove page table allocator "quicklists". These have been around for a
long time, but have not got much traction in the last decade and are only
used on ia64 and sh architectures.
The numbers in the initial commit look interesting but probably don't
apply anymore. If anybody wants to resurrect this it's in the git
history, but it's unhelpful to have this code and divergent allocator
behaviour for minor archs.
Also it might be better to instead make more general improvements to page
allocator if this is still so slow.
Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In our testing (camera recording), Miguel and Wei found
unmap_page_range() takes above 6ms with preemption disabled easily.
When I see that, the reason is it holds page table spinlock during
entire 512 page operation in a PMD. 6.2ms is never trivial for user
experince if RT task couldn't run in the time because it could make
frame drop or glitch audio problem.
I had a time to benchmark it via adding some trace_printk hooks between
pte_offset_map_lock and pte_unmap_unlock in zap_pte_range. The testing
device is 2018 premium mobile device.
I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
task runs on little core even though it doesn't have any IPI and LRU
lock contention. It's already too heavy.
If I remove activate_page, 35-40% overhead of zap_pte_range is gone so
most of overhead(about 0.7ms) comes from activate_page via
mark_page_accessed. Thus, if there are LRU contention, that 0.7ms could
accumulate up to several ms.
So this patch adds a check for need_resched() in the loop, and a
preemption point if necessary.
Link: http://lkml.kernel.org/r/20190731061440.GC155569@google.com
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Miguel de Dios <migueldedios@google.com>
Reported-by: Wei Wang <wvw@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since ptent will not be changed after previous assignment of entry, it is
not necessary to do the assignment again.
Link: http://lkml.kernel.org/r/20190708082740.21111-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[11~From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
Patch series "mm/gup: add make_dirty arg to put_user_pages_dirty_lock()",
v3.
There are about 50+ patches in my tree [2], and I'll be sending out the
remaining ones in a few more groups:
* The block/bio related changes (Jerome mostly wrote those, but I've had
to move stuff around extensively, and add a little code)
* mm/ changes
* other subsystem patches
* an RFC that shows the current state of the tracking patch set. That
can only be applied after all call sites are converted, but it's good to
get an early look at it.
This is part a tree-wide conversion, as described in fc1d8e7cca ("mm:
introduce put_user_page*(), placeholder versions").
This patch (of 3):
Provide more capable variation of put_user_pages_dirty_lock(), and delete
put_user_pages_dirty(). This is based on the following:
1. Lots of call sites become simpler if a bool is passed into
put_user_page*(), instead of making the call site choose which
put_user_page*() variant to call.
2. Christoph Hellwig's observation that set_page_dirty_lock() is
usually correct, and set_page_dirty() is usually a bug, or at least
questionable, within a put_user_page*() calling chain.
This leads to the following API choices:
* put_user_pages_dirty_lock(page, npages, make_dirty)
* There is no put_user_pages_dirty(). You have to
hand code that, in the rare case that it's
required.
[jhubbard@nvidia.com: remove unused variable in siw_free_plist()]
Link: http://lkml.kernel.org/r/20190729074306.10368-1-jhubbard@nvidia.com
Link: http://lkml.kernel.org/r/20190724044537.10458-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One of our services observed a high rate of cgroup OOM kills in the
presence of large amounts of clean cache. Debugging showed that the
culprit is the shared cgroup iteration in page reclaim.
Under high allocation concurrency, multiple threads enter reclaim at the
same time. Fearing overreclaim when we first switched from the single
global LRU to cgrouped LRU lists, we introduced a shared iteration state
for reclaim invocations - whether 1 or 20 reclaimers are active
concurrently, we only walk the cgroup tree once: the 1st reclaimer
reclaims the first cgroup, the second the second one etc. With more
reclaimers than cgroups, we start another walk from the top.
This sounded reasonable at the time, but the problem is that reclaim
concurrency doesn't scale with allocation concurrency. As reclaim
concurrency increases, the amount of memory individual reclaimers get to
scan gets smaller and smaller. Individual reclaimers may only see one
cgroup per cycle, and that may not have much reclaimable memory. We see
individual reclaimers declare OOM when there is plenty of reclaimable
memory available in cgroups they didn't visit.
This patch does away with the shared iterator, and every reclaimer is
allowed to scan the full cgroup tree and see all of reclaimable memory,
just like it would on a non-cgrouped system. This way, when OOM is
declared, we know that the reclaimer actually had a chance.
To still maintain fairness in reclaim pressure, disallow cgroup reclaim
from bailing out of the tree walk early. Kswapd and regular direct
reclaim already don't bail, so it's not clear why limit reclaim would have
to, especially since it only walks subtrees to begin with.
This change completely eliminates the OOM kills on our service, while
showing no signs of overreclaim - no increased scan rates, %sys time, or
abrupt free memory spikes. I tested across 100 machines that have 64G of
RAM and host about 300 cgroups each.
[ It's possible overreclaim never was a *practical* issue to begin
with - it was simply a concern we had on the mailing lists at the
time, with no real data to back it up. But we have also added more
bail-out conditions deeper inside reclaim (e.g. the proportional
exit in shrink_node_memcg) since. Regardless, now we have data that
suggests full walks are more reliable and scale just fine. ]
Link: http://lkml.kernel.org/r/20190812192316.13615-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 72f0184c8a ("mm, memcg: remove hotplug locking from try_charge")
introduced css_tryget()/css_put() calls in drain_all_stock(), which are
supposed to protect the target memory cgroup from being released during
the mem_cgroup_is_descendant() call.
However, it's not completely safe. In theory, memcg can go away between
reading stock->cached pointer and calling css_tryget().
This can happen if drain_all_stock() races with drain_local_stock()
performed on the remote cpu as a result of a work, scheduled by the
previous invocation of drain_all_stock().
The race is a bit theoretical and there are few chances to trigger it, but
the current code looks a bit confusing, so it makes sense to fix it
anyway. The code looks like as if css_tryget() and css_put() are used to
protect stocks drainage. It's not necessary because stocked pages are
holding references to the cached cgroup. And it obviously won't work for
works, scheduled on other cpus.
So, let's read the stock->cached pointer and evaluate the memory cgroup
inside a rcu read section, and get rid of css_tryget()/css_put() calls.
Link: http://lkml.kernel.org/r/20190802192241.3253165-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We're trying to use memory.high to limit workloads, but have found that
containment can frequently fail completely and cause OOM situations
outside of the cgroup. This happens especially with swap space -- either
when none is configured, or swap is full. These failures often also don't
have enough warning to allow one to react, whether for a human or for a
daemon monitoring PSI.
Here is output from a simple program showing how long it takes in usec
(column 2) to allocate a megabyte of anonymous memory (column 1) when a
cgroup is already beyond its memory high setting, and no swap is
available:
[root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
> --wait -t timeout 300 /root/mdf
[...]
95 1035
96 1038
97 1000
98 1036
99 1048
100 1590
101 1968
102 1776
103 1863
104 1757
105 1921
106 1893
107 1760
108 1748
109 1843
110 1716
111 1924
112 1776
113 1831
114 1766
115 1836
116 1588
117 1912
118 1802
119 1857
120 1731
[...]
[System OOM in 2-3 seconds]
The delay does go up extremely marginally past the 100MB memory.high
threshold, as now we spend time scanning before returning to usermode, but
it's nowhere near enough to contain growth. It also doesn't get worse the
more pages you have, since it only considers nr_pages.
The current situation goes against both the expectations of users of
memory.high, and our intentions as cgroup v2 developers. In
cgroup-v2.txt, we claim that we will throttle and only under "extreme
conditions" will memory.high protection be breached. Likewise, cgroup v2
users generally also expect that memory.high should throttle workloads as
they exceed their high threshold. However, as seen above, this isn't
always how it works in practice -- even on banal setups like those with no
swap, or where swap has become exhausted, we can end up with memory.high
being breached and us having no weapons left in our arsenal to combat
runaway growth with, since reclaim is futile.
It's also hard for system monitoring software or users to tell how bad the
situation is, as "high" events for the memcg may in some cases be benign,
and in others be catastrophic. The current status quo is that we fail
containment in a way that doesn't provide any advance warning that things
are about to go horribly wrong (for example, we are about to invoke the
kernel OOM killer).
This patch introduces explicit throttling when reclaim is failing to keep
memcg size contained at the memory.high setting. It does so by applying
an exponential delay curve derived from the memcg's overage compared to
memory.high. In the normal case where the memcg is either below or only
marginally over its memory.high setting, no throttling will be performed.
This composes well with system health monitoring and remediation, as these
allocator delays are factored into PSI's memory pressure calculations.
This both creates a mechanism system administrators or applications
consuming the PSI interface to trivially see that the memcg in question is
struggling and use that to make more reasonable decisions, and permits
them enough time to act. Either of these can act with significantly more
nuance than that we can provide using the system OOM killer.
This is a similar idea to memory.oom_control in cgroup v1 which would put
the cgroup to sleep if the threshold was violated, but it's also
significantly improved as it results in visible memory pressure, and also
doesn't schedule indefinitely, which previously made tracing and other
introspection difficult (ie. it's clamped at 2*HZ per allocation through
MEMCG_MAX_HIGH_DELAY_JIFFIES).
Contrast the previous results with a kernel with this patch:
[root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
> --wait -t timeout 300 /root/mdf
[...]
95 1002
96 1000
97 1002
98 1003
99 1000
100 1043
101 84724
102 330628
103 610511
104 1016265
105 1503969
106 2391692
107 2872061
108 3248003
109 4791904
110 5759832
111 6912509
112 8127818
113 9472203
114 12287622
115 12480079
116 14144008
117 15808029
118 16384500
119 16383242
120 16384979
[...]
As you can see, in the normal case, memory allocation takes around 1000
usec. However, as we exceed our memory.high, things start to increase
exponentially, but fairly leniently at first. Our first megabyte over
memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then the
next is almost an entire second. This gets worse until we reach our
eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
However, this is still making forward progress, so permits tracing or
further analysis with programs like GDB.
We use an exponential curve for our delay penalty for a few reasons:
1. We run mem_cgroup_handle_over_high to potentially do reclaim after
we've already performed allocations, which means that temporarily
going over memory.high by a small amount may be perfectly legitimate,
even for compliant workloads. We don't want to unduly penalise such
cases.
2. An exponential curve (as opposed to a static or linear delay) allows
ramping up memory pressure stats more gradually, which can be useful
to work out that you have set memory.high too low, without destroying
application performance entirely.
This patch expands on earlier work by Johannes Weiner. Thanks!
[akpm@linux-foundation.org: fix max() warning]
[akpm@linux-foundation.org: fix __udivdi3 ref on 32-bit]
[akpm@linux-foundation.org: fix it even more]
[chris@chrisdown.name: fix 64-bit divide even more]
Link: http://lkml.kernel.org/r/20190723180700.GA29459@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Transparent Huge Pages are currently stored in i_pages as pointers to
consecutive subpages. This patch changes that to storing consecutive
pointers to the head page in preparation for storing huge pages more
efficiently in i_pages.
Large parts of this are "inspired" by Kirill's patch
https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/
Kirill and Huang Ying contributed several fixes.
[willy@infradead.org: use compound_nr, squish uninit-var warning]
Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Kirill Shutemov <kirill@shutemov.name>
Reviewed-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Tested-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Tested-by: Qian Cai <cai@lca.pw>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This actually checks that writeback is needed or in progress.
Link: http://lkml.kernel.org/r/156378817069.1087.1302816672037672488.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Functions like filemap_write_and_wait_range() should do nothing if inode
has no dirty pages or pages currently under writeback. But they anyway
construct struct writeback_control and this does some atomic operations if
CONFIG_CGROUP_WRITEBACK=y - on fast path it locks inode->i_lock and
updates state of writeback ownership, on slow path might be more work.
Current this path is safely avoided only when inode mapping has no pages.
For example generic_file_read_iter() calls filemap_write_and_wait_range()
at each O_DIRECT read - pretty hot path.
This patch skips starting new writeback if mapping has no dirty tags set.
If writeback is already in progress filemap_write_and_wait_range() will
wait for it.
Link: http://lkml.kernel.org/r/156378816804.1087.8607636317907921438.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The debug_pagealloc functionality is useful to catch buggy page allocator
users that cause e.g. use after free or double free. When page
inconsistency is detected, debugging is often simpler by knowing the call
stack of process that last allocated and freed the page. When page_owner
is also enabled, we record the allocation stack trace, but not freeing.
This patch therefore adds recording of freeing process stack trace to page
owner info, if both page_owner and debug_pagealloc are configured and
enabled. With only page_owner enabled, this info is not useful for the
memory leak debugging use case. dump_page() is adjusted to print the
info. An example result of calling __free_pages() twice may look like
this (note the page last free stack trace):
BUG: Bad page state in process bash pfn:13d8f8
page:ffffc31984f63e00 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x1affff800000000()
raw: 01affff800000000 dead000000000100 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000
page dumped because: nonzero _refcount
page_owner tracks the page as freed
page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL)
prep_new_page+0x143/0x150
get_page_from_freelist+0x289/0x380
__alloc_pages_nodemask+0x13c/0x2d0
khugepaged+0x6e/0xc10
kthread+0xf9/0x130
ret_from_fork+0x3a/0x50
page last free stack trace:
free_pcp_prepare+0x134/0x1e0
free_unref_page+0x18/0x90
khugepaged+0x7b/0xc10
kthread+0xf9/0x130
ret_from_fork+0x3a/0x50
Modules linked in:
CPU: 3 PID: 271 Comm: bash Not tainted 5.3.0-rc4-2.g07a1a73-default+ #57
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x85/0xc0
bad_page.cold+0xba/0xbf
rmqueue_pcplist.isra.0+0x6c5/0x6d0
rmqueue+0x2d/0x810
get_page_from_freelist+0x191/0x380
__alloc_pages_nodemask+0x13c/0x2d0
__get_free_pages+0xd/0x30
__pud_alloc+0x2c/0x110
copy_page_range+0x4f9/0x630
dup_mmap+0x362/0x480
dup_mm+0x68/0x110
copy_process+0x19e1/0x1b40
_do_fork+0x73/0x310
__x64_sys_clone+0x75/0x80
do_syscall_64+0x6e/0x1e0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f10af854a10
...
Link: http://lkml.kernel.org/r/20190820131828.22684-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For debugging purposes it might be useful to keep the owner info even
after page has been freed, and include it in e.g. dump_page() when
detecting a bad page state. For that, change the PAGE_EXT_OWNER flag
meaning to "page owner info has been set at least once" and add new
PAGE_EXT_OWNER_ACTIVE for tracking whether page is supposed to be
currently tracked allocated or free. Adjust dump_page() accordingly,
distinguishing free and allocated pages. In the page_owner debugfs file,
keep printing only allocated pages so that existing scripts are not
confused, and also because free pages are irrelevant for the memory
statistics or leak detection that's the typical use case of the file,
anyway.
Link: http://lkml.kernel.org/r/20190820131828.22684-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "debug_pagealloc improvements through page_owner", v2.
The debug_pagealloc functionality serves a similar purpose on the page
allocator level that slub_debug does on the kmalloc level, which is to
detect bad users. One notable feature that slub_debug has is storing
stack traces of who last allocated and freed the object. On page level we
track allocations via page_owner, but that info is discarded when freeing,
and we don't track freeing at all. This series improves those aspects.
With both debug_pagealloc and page_owner enabled, we can then get bug
reports such as the example in Patch 4.
SLUB debug tracking additionally stores cpu, pid and timestamp. This could
be added later, if deemed useful enough to justify the additional page_ext
structure size.
This patch (of 3):
Currently, page owner info is only recorded for the first page of a
high-order allocation, and copied to tail pages in the event of a split
page. With the plan to keep previous owner info after freeing the page,
it would be benefical to record page owner for each subpage upon
allocation. This increases the overhead for high orders, but that should
be acceptable for a debugging option.
The order stored for each subpage is the order of the whole allocation.
This makes it possible to calculate the "head" pfn and to recognize "tail"
pages (quoted because not all high-order allocations are compound pages
with true head and tail pages). When reading the page_owner debugfs file,
keep skipping the "tail" pages so that stats gathered by existing scripts
don't get inflated.
Link: http://lkml.kernel.org/r/20190820131828.22684-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a cleanup patch that replaces two historical uses of
list_move_tail() with relatively recent add_page_to_lru_list_tail().
Link: http://lkml.kernel.org/r/20190716212436.7137-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace 1 << compound_order(page) with compound_nr(page). Minor
improvements in readability.
Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Make working with compound pages easier", v2.
These three patches add three helpers and convert the appropriate
places to use them.
This patch (of 3):
It's unnecessarily hard to find out the size of a potentially huge page.
Replace 'PAGE_SIZE << compound_order(page)' with page_size(page).
Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes gcc '-Wunused-but-set-variable' warning:
mm/rmap.c: In function page_mkclean_one:
mm/rmap.c:906:17: warning: variable cstart set but not used [-Wunused-but-set-variable]
It is not used any more since
commit cdb07bdea2 ("mm/rmap.c: remove redundant variable cend")
Link: http://lkml.kernel.org/r/20190724141453.38536-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add memory corruption identification at bug report for software tag-based
mode. The report shows whether it is "use-after-free" or "out-of-bound"
error instead of "invalid-access" error. This will make it easier for
programmers to see the memory corruption problem.
We extend the slab to store five old free pointer tag and free backtrace,
we can check if the tagged address is in the slab record and make a good
guess if the object is more like "use-after-free" or "out-of-bound".
therefore every slab memory corruption can be identified whether it's
"use-after-free" or "out-of-bound".
[aryabinin@virtuozzo.com: simplify & clenup code]
Link: https://lkml.kernel.org/r/3318f9d7-a760-3cc8-b700-f06108ae745f@virtuozzo.com]
Link: http://lkml.kernel.org/r/20190821180332.11450-1-aryabinin@virtuozzo.com
Signed-off-by: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The only way to obtain the current memory pool size for a running kernel
is to check the kernel config file which is inconvenient. Record it in
the kernel messages.
[akpm@linux-foundation.org: s/memory pool size/memory pool/available/, per Catalin]
Link: http://lkml.kernel.org/r/1565809631-28933-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently kmemleak uses a static early_log buffer to trace all memory
allocation/freeing before the slab allocator is initialised. Such early
log is replayed during kmemleak_init() to properly initialise the kmemleak
metadata for objects allocated up that point. With a memory pool that
does not rely on the slab allocator, it is possible to skip this early log
entirely.
In order to remove the early logging, consider kmemleak_enabled == 1 by
default while the kmem_cache availability is checked directly on the
object_cache and scan_area_cache variables. The RCU callback is only
invoked after object_cache has been initialised as we wouldn't have any
concurrent list traversal before this.
In order to reduce the number of callbacks before kmemleak is fully
initialised, move the kmemleak_init() call to mm_init().
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: remove WARN_ON(), per Catalin]
Link: http://lkml.kernel.org/r/20190812160642.52134-4-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a memory pool for struct kmemleak_object in case the normal
kmem_cache_alloc() fails under the gfp constraints passed by the caller.
The mem_pool[] array size is currently fixed at 16000.
We are not using the existing mempool kernel API since this requires
the slab allocator to be available (for pool->elements allocation). A
subsequent kmemleak patch will replace the static early log buffer with
the pool allocation introduced here and this functionality is required
to be available before the slab was initialised.
Link: http://lkml.kernel.org/r/20190812160642.52134-3-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: kmemleak: Use a memory pool for kmemleak object
allocations", v3.
Following the discussions on v2 of this patch(set) [1], this series takes
slightly different approach:
- it implements its own simple memory pool that does not rely on the
slab allocator
- drops the early log buffer logic entirely since it can now allocate
metadata from the memory pool directly before kmemleak is fully
initialised
- CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE option is renamed to
CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE
- moves the kmemleak_init() call earlier (mm_init())
- to avoid a separate memory pool for struct scan_area, it makes the
tool robust when such allocations fail as scan areas are rather an
optimisation
[1] http://lkml.kernel.org/r/20190727132334.9184-1-catalin.marinas@arm.com
This patch (of 3):
Object scan areas are an optimisation aimed to decrease the false
positives and slightly improve the scanning time of large objects known to
only have a few specific pointers. If a struct scan_area fails to
allocate, kmemleak can still function normally by scanning the full
object.
Introduce an OBJECT_FULL_SCAN flag and mark objects as such when scan_area
allocation fails.
Link: http://lkml.kernel.org/r/20190812160642.52134-2-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
tid_to_cpu() and tid_to_event() are only used in note_cmpxchg_failure()
when SLUB_DEBUG_CMPXCHG=y, so when SLUB_DEBUG_CMPXCHG=n by default, Clang
will complain that those unused functions.
Link: http://lkml.kernel.org/r/1568752232-5094-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memcg_cache_params structure is only embedded into the kmem_cache of
slab and slub allocators as defined in slab_def.h and slub_def.h and used
internally by mm code. There is no needed to expose it in a public
header. So move it from include/linux/slab.h to mm/slab.h. It is just a
refactoring patch with no code change.
In fact both the slub_def.h and slab_def.h should be moved into the mm
directory as well, but that will probably cause many merge conflicts.
Link: http://lkml.kernel.org/r/20190718180827.18758-1-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
file to shrink the slab by flushing out all the per-cpu slabs and free
slabs in partial lists. This can be useful to squeeze out a bit more
memory under extreme condition as well as making the active object counts
in /proc/slabinfo more accurate.
This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
option is usually not enabled and "slub_memcg_sysfs=1" not set. Even if
memcg sysfs is turned on, it is too cumbersome and impractical to manage
all those per-memcg sysfs files in a real production system.
So there is no practical way to shrink memcg caches. Fix this by enabling
a proper write to the shrink sysfs file of the root cache to scan all the
available memcg caches and shrink them as well. For a non-root memcg
cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is on), only that
cache will be shrunk when written.
On a 2-socket 64-core 256-thread arm64 system with 64k page after
a parallel kernel build, the the amount of memory occupied by slabs
before shrinking slabs were:
# grep task_struct /proc/slabinfo
task_struct 53137 53192 4288 61 4 : tunables 0 0
0 : slabdata 872 872 0
# grep "^S[lRU]" /proc/meminfo
Slab: 3936832 kB
SReclaimable: 399104 kB
SUnreclaim: 3537728 kB
After shrinking slabs (by echoing "1" to all shrink files):
# grep "^S[lRU]" /proc/meminfo
Slab: 1356288 kB
SReclaimable: 263296 kB
SUnreclaim: 1092992 kB
# grep task_struct /proc/slabinfo
task_struct 2764 6832 4288 61 4 : tunables 0 0
0 : slabdata 112 112 0
Link: http://lkml.kernel.org/r/20190723151445.7385-1-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
z3fold_page_reclaim()'s retry mechanism is broken: on a second iteration
it will have zhdr from the first one so that zhdr is no longer in line
with struct page. That leads to crashes when the system is stressed.
Fix that by moving zhdr assignment up.
While at it, protect against using already freed handles by using own
local slots structure in z3fold_page_reclaim().
Link: http://lkml.kernel.org/r/20190908162919.830388dc7404d1e2c80f4095@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Reported-by: Markus Linnala <markus.linnala@gmail.com>
Reported-by: Chris Murphy <bugzilla@colorremedies.com>
Reported-by: Agustin Dall'Alba <agustin@dallalba.com.ar>
Cc: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the original commit applied, z3fold_zpool_destroy() may get blocked
on wait_event() for indefinite time. Revert this commit for the time
being to get rid of this problem since the issue the original commit
addresses is less severe.
Link: http://lkml.kernel.org/r/20190910123142.7a9c8d2de4d0acbc0977c602@gmail.com
Fixes: d776aaa989 ("mm/z3fold.c: fix race between migration and destruction")
Reported-by: Agustín Dall'Alba <agustin@dallalba.com.ar>
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Jonathan Adams <jwadams@google.com>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is more cleanup and consolidation of the hmm APIs and the very
strongly related mmu_notifier interfaces. Many places across the tree
using these interfaces are touched in the process. Beyond that a cleanup
to the page walker API and a few memremap related changes round out the
series:
- General improvement of hmm_range_fault() and related APIs, more
documentation, bug fixes from testing, API simplification &
consolidation, and unused API removal
- Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE, and
make them internal kconfig selects
- Hoist a lot of code related to mmu notifier attachment out of drivers by
using a refcount get/put attachment idiom and remove the convoluted
mmu_notifier_unregister_no_release() and related APIs.
- General API improvement for the migrate_vma API and revision of its only
user in nouveau
- Annotate mmu_notifiers with lockdep and sleeping region debugging
Two series unrelated to HMM or mmu_notifiers came along due to
dependencies:
- Allow pagemap's memremap_pages family of APIs to work without providing
a struct device
- Make walk_page_range() and related use a constant structure for function
pointers
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl1/nnkACgkQOG33FX4g
mxqaRg//c6FqowV1pQlLutvAOAgMdpzfZ9eaaDKngy9RVQxz+k/MmJrdRH/p/mMA
Pq93A1XfwtraGKErHegFXGEDk4XhOustVAVFwvjyXO41dTUdoFVUkti6ftbrl/rS
6CT+X90jlvrwdRY7QBeuo7lxx7z8Qkqbk1O1kc1IOracjKfNJS+y6LTamy6weM3g
tIMHI65PkxpRzN36DV9uCN5dMwFzJ73DWHp1b0acnDIigkl6u5zp6orAJVWRjyQX
nmEd3/IOvdxaubAoAvboNS5CyVb4yS9xshWWMbH6AulKJv3Glca1Aa7QuSpBoN8v
wy4c9+umzqRgzgUJUe1xwN9P49oBNhJpgBSu8MUlgBA4IOc3rDl/Tw0b5KCFVfkH
yHkp8n6MP8VsRrzXTC6Kx0vdjIkAO8SUeylVJczAcVSyHIo6/JUJCVDeFLSTVymh
EGWJ7zX2iRhUbssJ6/izQTTQyCH3YIyZ5QtqByWuX2U7ZrfkqS3/EnBW1Q+j+gPF
Z2yW8iT6k0iENw6s8psE9czexuywa/Lttz94IyNlOQ8rJTiQqB9wLaAvg9hvUk7a
kuspL+JGIZkrL3ouCeO/VA6xnaP+Q7nR8geWBRb8zKGHmtWrb5Gwmt6t+vTnCC2l
olIDebrnnxwfBQhEJ5219W+M1pBpjiTpqK/UdBd92A4+sOOhOD0=
=FRGg
-----END PGP SIGNATURE-----
Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull hmm updates from Jason Gunthorpe:
"This is more cleanup and consolidation of the hmm APIs and the very
strongly related mmu_notifier interfaces. Many places across the tree
using these interfaces are touched in the process. Beyond that a
cleanup to the page walker API and a few memremap related changes
round out the series:
- General improvement of hmm_range_fault() and related APIs, more
documentation, bug fixes from testing, API simplification &
consolidation, and unused API removal
- Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
and make them internal kconfig selects
- Hoist a lot of code related to mmu notifier attachment out of
drivers by using a refcount get/put attachment idiom and remove the
convoluted mmu_notifier_unregister_no_release() and related APIs.
- General API improvement for the migrate_vma API and revision of its
only user in nouveau
- Annotate mmu_notifiers with lockdep and sleeping region debugging
Two series unrelated to HMM or mmu_notifiers came along due to
dependencies:
- Allow pagemap's memremap_pages family of APIs to work without
providing a struct device
- Make walk_page_range() and related use a constant structure for
function pointers"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
libnvdimm: Enable unit test infrastructure compile checks
mm, notifier: Catch sleeping/blocking for !blockable
kernel.h: Add non_block_start/end()
drm/radeon: guard against calling an unpaired radeon_mn_unregister()
csky: add missing brackets in a macro for tlb.h
pagewalk: use lockdep_assert_held for locking validation
pagewalk: separate function pointers from iterator data
mm: split out a new pagewalk.h header from mm.h
mm/mmu_notifiers: annotate with might_sleep()
mm/mmu_notifiers: prime lockdep
mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
mm/hmm: hmm_range_fault() infinite loop
mm/hmm: hmm_range_fault() NULL pointer bug
mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
mm/mmu_notifiers: remove unregister_no_release
RDMA/odp: remove ib_ucontext from ib_umem
RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
RDMA/mlx5: Use ib_umem_start instead of umem.address
...
- add dma-mapping and block layer helpers to take care of IOMMU
merging for mmc plus subsequent fixups (Yoshihiro Shimoda)
- rework handling of the pgprot bits for remapping (me)
- take care of the dma direct infrastructure for swiotlb-xen (me)
- improve the dma noncoherent remapping infrastructure (me)
- better defaults for ->mmap, ->get_sgtable and ->get_required_mask (me)
- cleanup mmaping of coherent DMA allocations (me)
- various misc cleanups (Andy Shevchenko, me)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl2CSucLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPfrhAAgXZA/EdFPvkkCoDrmgtf3XkudX9gajeCd9g4NZy6
ZBQElTVvm4S0sQj7IXgALnMumDMbbTibW5SQLX5GwQDe+XXBpZ8ajpAnJAXc8a5T
qaFQ4SInr4CgBZf9nZKDkbSBZ1Tu3AQm1c0QI8riRCkrVTuX4L06xpCef4Yh4mgO
rwWEjIioYpQiKZMmu98riXh3ZNfFG3mVJRhKt8B6XJbBgnUnjDOPYGgaUwp6CU20
tFBKL2GaaV0vdLJ5wYhIGXT4DJ8tp9T5n3IYGZv1Ux889RaZEHlCrMxzelYeDbCT
KhZbhcSECGnddsh73t/UX7/KhytuqnfKa9n+Xo6AWuA47xO4c36quOOcTk9M0vE5
TfGDmewgL6WIv4lzokpRn5EkfDhyL33j8eYJrJ8e0ldcOhSQIFk4ciXnf2stWi6O
JrlzzzSid+zXxu48iTfoPdnMr7psTpiMvvRvKfEeMp2FX9Fg6EdMzJYLTEl+COHB
0WwNacZmY3P01+b5EZXEgqKEZevIIdmPKbyM9rPtTjz8BjBwkABHTpN3fWbVBf7/
Ax6OPYyW40xp1fnJuzn89m3pdOxn88FpDdOaeLz892Zd+Qpnro1ayulnFspVtqGM
mGbzA9whILvXNRpWBSQrvr2IjqMRjbBxX3BVACl3MMpOChgkpp5iANNfSDjCftSF
Zu8=
=/wGv
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.4' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- add dma-mapping and block layer helpers to take care of IOMMU merging
for mmc plus subsequent fixups (Yoshihiro Shimoda)
- rework handling of the pgprot bits for remapping (me)
- take care of the dma direct infrastructure for swiotlb-xen (me)
- improve the dma noncoherent remapping infrastructure (me)
- better defaults for ->mmap, ->get_sgtable and ->get_required_mask
(me)
- cleanup mmaping of coherent DMA allocations (me)
- various misc cleanups (Andy Shevchenko, me)
* tag 'dma-mapping-5.4' of git://git.infradead.org/users/hch/dma-mapping: (41 commits)
mmc: renesas_sdhi_internal_dmac: Add MMC_CAP2_MERGE_CAPABLE
mmc: queue: Fix bigger segments usage
arm64: use asm-generic/dma-mapping.h
swiotlb-xen: merge xen_unmap_single into xen_swiotlb_unmap_page
swiotlb-xen: simplify cache maintainance
swiotlb-xen: use the same foreign page check everywhere
swiotlb-xen: remove xen_swiotlb_dma_mmap and xen_swiotlb_dma_get_sgtable
xen: remove the exports for xen_{create,destroy}_contiguous_region
xen/arm: remove xen_dma_ops
xen/arm: simplify dma_cache_maint
xen/arm: use dev_is_dma_coherent
xen/arm: consolidate page-coherent.h
xen/arm: use dma-noncoherent.h calls for xen-swiotlb cache maintainance
arm: remove wrappers for the generic dma remap helpers
dma-mapping: introduce a dma_common_find_pages helper
dma-mapping: always use VM_DMA_COHERENT for generic DMA remap
vmalloc: lift the arm flag for coherent mappings to common code
dma-mapping: provide a better default ->get_required_mask
dma-mapping: remove the dma_declare_coherent_memory export
remoteproc: don't allow modular build
...
Pull misc mount API conversions from Al Viro:
"Conversions to new API for shmem and friends and for mount_mtd()-using
filesystems.
As for the rest of the mount API conversions in -next, some of them
belong in the individual trees (e.g. binderfs one should definitely go
through android folks, after getting redone on top of their changes).
I'm going to drop those and send the rest (trivial ones + stuff ACKed
by maintainers) in a separate series - by that point they are
independent from each other.
Some stuff has already migrated into individual trees (NFS conversion,
for example, or FUSE stuff, etc.); those presumably will go through
the regular merges from corresponding trees."
* 'work.mount2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: Make fs_parse() handle fs_param_is_fd-type params better
vfs: Convert ramfs, shmem, tmpfs, devtmpfs, rootfs to use the new mount API
shmem_parse_one(): switch to use of fs_parse()
shmem_parse_options(): take handling a single option into a helper
shmem_parse_options(): don't bother with mpol in separate variable
shmem_parse_options(): use a separate structure to keep the results
make shmem_fill_super() static
make ramfs_fill_super() static
devtmpfs: don't mix {ramfs,shmem}_fill_super() with mount_single()
vfs: Convert squashfs to use the new mount API
mtd: Kill mount_mtd()
vfs: Convert jffs2 to use the new mount API
vfs: Convert cramfs to use the new mount API
vfs: Convert romfs to use the new mount API
vfs: Add a single-or-reconfig keying to vfs_get_super()
- Remove KM_SLEEP/KM_NOSLEEP.
- Ensure that memory buffers for IO are properly sector-aligned to avoid
problems that the block layer doesn't check.
- Make the bmap scrubber more efficient in its record checking.
- Don't crash xfs_db when superblock inode geometry is corrupt.
- Fix btree key helper functions.
- Remove unneeded error returns for things that can't fail.
- Fix buffer logging bugs in repair.
- Clean up iterator return values.
- Speed up directory entry creation.
- Enable allocation of xattr value memory buffer during lookup.
- Fix readahead racing with truncate/punch hole.
- Other minor cleanups.
- Fix one AGI/AGF deadlock with RENAME_WHITEOUT.
- More BUG -> WARN whackamole.
- Fix various problems with the log failing to advance under certain
circumstances, which results in stalls during mount.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl1yhwsACgkQ+H93GTRK
tOtTig//fYLgFVz3l6ffCb8WkJkmi7iWOJp3eLzK55+3W0++ThNsMlRTOWH7xCpZ
f+3LEvvm1ILBgf4XVlwUGt2HlLmNZeKYmiOl/jZxCH25KdfILRIyeyacAYf9vIWf
NQr5HOutsa1IfEDCiDwEnxuuVbgC+rN8j7Rlp/PpweXwRYjssqRWnGRgaZchLbyr
JZ40D9J1HLooY/yftKrgnxtfL4rmAhPoGdX3DnZmobHYRpFHrY31Ks24w6ogShDu
usczNeShXWlg31B4fVHo/rrVQ0xG77U+w/DTNvrAj0uvAlzvWVVibpaZjZtbhadO
NM0zOG41BY/ExBAHhpg0ieVdYI7wNEftF9gjyT7cXO9soD1mRgH6UKQMCm+o1frF
brtcpgQS2aEyGZaXGBIS23ziT/+LLGcav7LUeo7Rf6yiVoEA+FlsGaymC7l+FGCQ
lcgHdeRkeukdj+GJlmpiedb+Xya2g464CXswW7JtCghdNsypRsI4OdQQ2r8Du+w0
PUwfugv1cMAz99xfSZtSoTa7pimFxb6tHRcoqZVfQCefbKQ0VMJDU/AY7gQ2U3UM
PiFKXgPFo0p4tUvA/9ECTPcMDhMKMv200CGCJKXrokWwHtJ6jrAHb+EobjrfoiyX
+hkGEmzzt3vur7Zt2+YesCH3tZj1UfpsemOlorxYQk3hbsA9HEc=
=TZLp
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"For this cycle we have the usual pile of cleanups and bug fixes, some
performance improvements for online metadata scrubbing, massive
speedups in the directory entry creation code, some performance
improvement in the file ACL lookup code, a fix for a logging stall
during mount, and fixes for concurrency problems.
It has survived a couple of weeks of xfstests runs and merges cleanly.
Summary:
- Remove KM_SLEEP/KM_NOSLEEP.
- Ensure that memory buffers for IO are properly sector-aligned to
avoid problems that the block layer doesn't check.
- Make the bmap scrubber more efficient in its record checking.
- Don't crash xfs_db when superblock inode geometry is corrupt.
- Fix btree key helper functions.
- Remove unneeded error returns for things that can't fail.
- Fix buffer logging bugs in repair.
- Clean up iterator return values.
- Speed up directory entry creation.
- Enable allocation of xattr value memory buffer during lookup.
- Fix readahead racing with truncate/punch hole.
- Other minor cleanups.
- Fix one AGI/AGF deadlock with RENAME_WHITEOUT.
- More BUG -> WARN whackamole.
- Fix various problems with the log failing to advance under certain
circumstances, which results in stalls during mount"
* tag 'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (45 commits)
xfs: push the grant head when the log head moves forward
xfs: push iclog state cleaning into xlog_state_clean_log
xfs: factor iclog state processing out of xlog_state_do_callback()
xfs: factor callbacks out of xlog_state_do_callback()
xfs: factor debug code out of xlog_state_do_callback()
xfs: prevent CIL push holdoff in log recovery
xfs: fix missed wakeup on l_flush_wait
xfs: push the AIL in xlog_grant_head_wake
xfs: Use WARN_ON_ONCE for bailout mount-operation
xfs: Fix deadlock between AGI and AGF with RENAME_WHITEOUT
xfs: define a flags field for the AG geometry ioctl structure
xfs: add a xfs_valid_startblock helper
xfs: remove the unused XFS_ALLOC_USERDATA flag
xfs: cleanup xfs_fsb_to_db
xfs: fix the dax supported check in xfs_ioctl_setattr_dax_invalidate
xfs: Fix stale data exposure when readahead races with hole punch
fs: Export generic_fadvise()
mm: Handle MADV_WILLNEED through vfs_fadvise()
xfs: allocate xattr buffer on demand
xfs: consolidate attribute value copying
...
- Prohibit writing to active swap files and swap partitions.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl1cDHQACgkQ+H93GTRK
tOs1qw//aqXAQ7bpLDl7jx9CSuAighzKir0mHYFm9HUsnuRT6gLqIOVSeugoi8hY
tYhPNzcKHL39YDa1QfKo1RKW6uCwsECHT/5TebLxBkTL3vGGAenPchAcjj89SV54
lQ/h8O6hkDU+KCKC0kmDem7ma7DD5YZmWXDxW/HvnygjCnZ9BFaOeLQt/TPBmOmN
lozPHcdrxhIuCuSTMjIZRq27Zl6uzj5tr+FkT+FWiYDrGhgWT7is6o397SEm7yYT
3JqUQ+ZUOY4IwLlrWiVKqi0IqjvWqhaLzmjZaKF+YC8Ni0sdpaDdsE/uPSCyQ7k7
28qbfypnu7bswakjcekdSX2Dj/SZivFb8AzlqaSIMVlw4STFzjMMYMLib8/OlPES
z1pAjXHypLjNO3dNBYp/mRll+/BQ2NM6oCtnVVQGKVnlcx3oLo+n6JSRK8t74DTf
BkYu93aybBpfE49Fb3VQum+9okg9BdShRxvUp023/WTUaa8aUyIbizn3iTrke/sx
0bC+Vvdr33JZnoO8WKVzSd7COTHOTQ920NodTKAJ9bkF3WKyLM135ctavHrtdAg3
FHBXpN7AjbOaLovLpiy3eb//ghKJwgyhqbN6VCGTudC/nkaXq7y7M3DPbXxQYxom
yCA0qMByMg+cL4BtzS52QEda7xK1iiuQ/3jbdQ4lFuBHhwekVKs=
=Ag/B
-----END PGP SIGNATURE-----
Merge tag 'vfs-5.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull swap access updates from Darrick Wong:
"Prohibit writing to active swap files and swap partitions.
There's no non-malicious use case for allowing userspace to scribble
on storage that the kernel thinks it owns"
* tag 'vfs-5.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
vfs: don't allow writes to swap files
mm: set S_SWAPFILE on blockdev swap devices
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1/no0QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmo9EACFXMbdNmEEUMyRSdOkVLlr7ZlTyQi1tLpB
YESDPxdBfybzpi0qa8JSaysGIfvSkSjmSAqBqrWPmASOSOL6CK4bbA4fTYbgPplk
XeHUdgGiG34oCQUn8Xil5reYaTm7I6LQWnWTpVa5fIhAyUYaGJL+987ykoGmpQmB
Dvf3YSc+8H0RTp9PCMVd6UCGPkZbVlLImGad3PF5ULvTEaE4RCXC2aiAgh0p1l5A
J2CkRZ+/mio3zN2O4YN7VdPGfr1Wo1iZ834xbIGLegv1miHXagFk7jwTcC7zIt5t
oSnJnqIg3iCe7SpWt4Bkzw/zy/2UqaspifbCMgw8vychlViVRUHFO5h85Yboo7kQ
OMLEQPcwjm6dTHv5h1iXF9LW1O7NoiYmmgvApU9uOo1HUrl1X7PZ3JEfUsVHxkOO
T4D5igf0Krsl1eAbiwEUQzy7vFZ8PlRHqrHgK+fkyotzHu1BJR7OQkYygEfGFOB/
EfMxplGDpmibYGuWCwDX2bPAmLV3SPUQENReHrfPJRDt5TD1UkFpVGv/PLLhbr0p
cLYI78DKpDSigBpVMmwq5nTYpnex33eyDTTA8C0sakcsdzdmU5qv30y3wm4nTiep
f6gZo6IMXwRg/rCgVVrd9SKQAr/8wEzVlsDW3qyi2pVT8sHIgm0tFv7paihXGdDV
xsKgmTrQQQ==
=Qt+h
-----END PGP SIGNATURE-----
Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- Two NVMe pull requests:
- ana log parse fix from Anton
- nvme quirks support for Apple devices from Ben
- fix missing bio completion tracing for multipath stack devices
from Hannes and Mikhail
- IP TOS settings for nvme rdma and tcp transports from Israel
- rq_dma_dir cleanups from Israel
- tracing for Get LBA Status command from Minwoo
- Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
- Some consolidation between the fabrics transports for handling
the CAP register
- reset race with ns scanning fix for fabrics (move fabrics
commands to a dedicated request queue with a different lifetime
from the admin request queue)."
- controller reset and namespace scan races fixes
- nvme discovery log change uevent support
- naming improvements from Keith
- multiple discovery controllers reject fix from James
- some regular cleanups from various people
- Series fixing (and re-fixing) null_blk debug printing and nr_devices
checks (André)
- A few pull requests from Song, with fixes from Andy, Guoqing,
Guilherme, Neil, Nigel, and Yufen.
- REQ_OP_ZONE_RESET_ALL support (Chaitanya)
- Bio merge handling unification (Christoph)
- Pick default elevator correctly for devices with special needs
(Damien)
- Block stats fixes (Hou)
- Timeout and support devices nbd fixes (Mike)
- Series fixing races around elevator switching and device add/remove
(Ming)
- sed-opal cleanups (Revanth)
- Per device weight support for BFQ (Fam)
- Support for blk-iocost, a new model that can properly account cost of
IO workloads. (Tejun)
- blk-cgroup writeback fixes (Tejun)
- paride queue init fixes (zhengbin)
- blk_set_runtime_active() cleanup (Stanley)
- Block segment mapping optimizations (Bart)
- lightnvm fixes (Hans/Minwoo/YueHaibing)
- Various little fixes and cleanups
* tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits)
null_blk: format pr_* logs with pr_fmt
null_blk: match the type of parameter nr_devices
null_blk: do not fail the module load with zero devices
block: also check RQF_STATS in blk_mq_need_time_stamp()
block: make rq sector size accessible for block stats
bfq: Fix bfq linkage error
raid5: use bio_end_sector in r5_next_bio
raid5: remove STRIPE_OPS_REQ_PENDING
md: add feature flag MD_FEATURE_RAID0_LAYOUT
md/raid0: avoid RAID0 data corruption due to layout confusion.
raid5: don't set STRIPE_HANDLE to stripe which is in batch list
raid5: don't increment read_errors on EILSEQ return
nvmet: fix a wrong error status returned in error log page
nvme: send discovery log page change events to userspace
nvme: add uevent variables for controller devices
nvme: enable aen regardless of the presence of I/O queues
nvme-fabrics: allow discovery subsystems accept a kato
nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
nvme: Remove redundant assignment of cq vector
nvme: Assign subsys instance from first ctrl
...
Pull percpu updates from Dennis Zhou:
"A couple of updates to clean up the code with no change in behavior"
* 'for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu: Use struct_size() helper
percpu: fix typo in pcpu_setup_first_chunk() comment
percpu: Make pcpu_setup_first_chunk() void function
When running on a system with >512MB RAM with a 32-bit kernel built with:
CONFIG_DEBUG_VIRTUAL=y
CONFIG_HIGHMEM=y
CONFIG_HARDENED_USERCOPY=y
all execve()s will fail due to argv copying into kmap()ed pages, and on
usercopy checking the calls ultimately of virt_to_page() will be looking
for "bad" kmap (highmem) pointers due to CONFIG_DEBUG_VIRTUAL=y:
------------[ cut here ]------------
kernel BUG at ../arch/x86/mm/physaddr.c:83!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.3.0-rc8 #6
Hardware name: Dell Inc. Inspiron 1318/0C236D, BIOS A04 01/15/2009
EIP: __phys_addr+0xaf/0x100
...
Call Trace:
__check_object_size+0xaf/0x3c0
? __might_sleep+0x80/0xa0
copy_strings+0x1c2/0x370
copy_strings_kernel+0x2b/0x40
__do_execve_file+0x4ca/0x810
? kmem_cache_alloc+0x1c7/0x370
do_execve+0x1b/0x20
...
The check is from arch/x86/mm/physaddr.c:
VIRTUAL_BUG_ON((phys_addr >> PAGE_SHIFT) > max_low_pfn);
Due to the kmap() in fs/exec.c:
kaddr = kmap(kmapped_page);
...
if (copy_from_user(kaddr+offset, str, bytes_to_copy)) ...
Now we can fetch the correct page to avoid the pfn check. In both cases,
hardened usercopy will need to walk the page-span checker (if enabled)
to do sanity checking.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Fixes: f5509cc18d ("mm: Hardened usercopy")
Cc: Matthew Wilcox <willy@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/201909171056.7F2FFD17@keescook
Pull scheduler updates from Ingo Molnar:
- MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.
As perf and the scheduler is getting bigger and more complex,
document the status quo of current responsibilities and interests,
and spread the review pain^H^H^H^H fun via an increase in the Cc:
linecount generated by scripts/get_maintainer.pl. :-)
- Add another series of patches that brings the -rt (PREEMPT_RT) tree
closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
into a new CONFIG_PREEMPTION category that will allow the eventual
introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
to go though.
- Extend the CPU cgroup controller with uclamp.min and uclamp.max to
allow the finer shaping of CPU bandwidth usage.
- Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).
- Improve the behavior of high CPU count, high thread count
applications running under cpu.cfs_quota_us constraints.
- Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.
- Improve CPU isolation housekeeping CPU allocation NUMA locality.
- Fix deadline scheduler bandwidth calculations and logic when cpusets
rebuilds the topology, or when it gets deadline-throttled while it's
being offlined.
- Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
setscheduler() system calls without creating global serialization.
Add new synchronization between cpuset topology-changing events and
the deadline acceptance tests in setscheduler(), which were broken
before.
- Rework the active_mm state machine to be less confusing and more
optimal.
- Rework (simplify) the pick_next_task() slowpath.
- Improve load-balancing on AMD EPYC systems.
- ... and misc cleanups, smaller fixes and improvements - please see
the Git log for more details.
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
sched/psi: Correct overly pessimistic size calculation
sched/fair: Speed-up energy-aware wake-ups
sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
sched/uclamp: Update CPU's refcount on TG's clamp changes
sched/uclamp: Use TG's clamps to restrict TASK's clamps
sched/uclamp: Propagate system defaults to the root group
sched/uclamp: Propagate parent clamps
sched/uclamp: Extend CPU's cgroup controller
sched/topology: Improve load balancing on AMD EPYC systems
arch, ia64: Make NUMA select SMP
sched, perf: MAINTAINERS update, add submaintainers and reviewers
sched/fair: Use rq_lock/unlock in online_fair_sched_group
cpufreq: schedutil: fix equation in comment
sched: Rework pick_next_task() slow-path
sched: Allow put_prev_task() to drop rq->lock
sched/fair: Expose newidle_balance()
sched: Add task_struct pointer to sched_class::set_curr_task
sched: Rework CPU hotplug task selection
sched/{rt,deadline}: Fix set_next_task vs pick_next_task
sched: Fix kerneldoc comment for ia64_set_curr_task
...
Convert the ramfs, shmem, tmpfs, devtmpfs and rootfs filesystems to the new
internal mount API as the old one will be obsoleted and removed. This
allows greater flexibility in communication of mount parameters between
userspace, the VFS and the filesystem.
See Documentation/filesystems/mount_api.txt for more information.
Note that tmpfs is slightly tricky as it can contain embedded commas, so it
can't be trivially split up using strsep() to break on commas in
generic_parse_monolithic(). Instead, tmpfs has to supply its own generic
parser.
However, if tmpfs changes, then devtmpfs and rootfs, which are wrappers
around tmpfs or ramfs, must change too - and thus so must ramfs, so these
had to be converted also.
[AV: rewritten]
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hugh Dickins <hughd@google.com>
cc: linux-mm@kvack.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This thing will eventually become our ->parse_param(), while
shmem_parse_options() - ->parse_monolithic(). At that point
shmem_parse_options() will start calling vfs_parse_fs_string(),
rather than calling shmem_parse_one() directly.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and copy the data from it into sbinfo in the callers.
For use by remount we need to keep track whether there'd
been options setting max_inodes, max_blocks and huge resp.
and do the sanity checks (and copying) only if such options
had been seen. uid/gid/mode is ignored by remount and
NULL mpol is already explicitly treated as "ignore it",
so we don't need to keep track of those.
Note: theoretically, mpol_parse_string() may return NULL
not in case of error (for default policy), so the assumption
that NULL mpol means "change nothing" is incorrect. However,
that's the mainline behaviour and any changes belong in
a separate patch. If we go for that, we'll need to keep
track of having encountered mpol= option too.
[changes in remount logics from Hugh Dickins folded]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We need to make sure implementations don't cheat and don't have a possible
schedule/blocking point deeply burried where review can't catch it.
I'm not sure whether this is the best way to make sure all the
might_sleep() callsites trigger, and it's a bit ugly in the code flow.
But it gets the job done.
Inspired by an i915 patch series which did exactly that, because the rules
haven't been entirely clear to us.
Link: https://lore.kernel.org/r/20190826201425.17547-5-daniel.vetter@ffwll.ch
Reviewed-by: Christian König <christian.koenig@amd.com> (v1)
Reviewed-by: Jérôme Glisse <jglisse@redhat.com> (v4)
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use lockdep to check for held locks instead of using home grown asserts.
Link: https://lore.kernel.org/r/20190828141955.22210-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The mm_walk structure currently mixed data and code. Split out the
operations vectors into a new mm_walk_ops structure, and while we are
changing the API also declare the mm_walk structure inside the
walk_page_range and walk_page_vma functions.
Based on patch from Linus Torvalds.
Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add a new header for the two handful of users of the walk_page_range /
walk_page_vma interface instead of polluting all users of mm.h with it.
Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
We want to teach lockdep that mmu notifiers can be called from direct
reclaim paths, since on many CI systems load might never reach that
level (e.g. when just running fuzzer or small functional tests).
I've put the annotation into mmu_notifier_register since only when we have
mmu notifiers registered is there any point in teaching lockdep about
them. Also, we already have a kmalloc(, GFP_KERNEL), so this is safe.
Link: https://lore.kernel.org/r/20190826201425.17547-3-daniel.vetter@ffwll.ch
Suggested-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This is a similar idea to the fs_reclaim fake lockdep lock. It's fairly
easy to provoke a specific notifier to be run on a specific range: Just
prep it, and then munmap() it.
A bit harder, but still doable, is to provoke the mmu notifiers for all
the various callchains that might lead to them. But both at the same time
is really hard to reliably hit, especially when you want to exercise paths
like direct reclaim or compaction, where it's not easy to control what
exactly will be unmapped.
By introducing a lockdep map to tie them all together we allow lockdep to
see a lot more dependencies, without having to actually hit them in a
single challchain while testing.
On Jason's suggestion this is is rolled out for both
invalidate_range_start and invalidate_range_end. They both have the same
calling context, hence we can share the same lockdep map. Note that the
annotation for invalidate_ranage_start is outside of the
mm_has_notifiers(), to make sure lockdep is informed about all paths
leading to this context irrespective of whether mmu notifiers are present
for a given context. We don't do that on the invalidate_range_end side to
avoid paying the overhead twice, there the lockdep annotation is pushed
down behind the mm_has_notifiers() check.
Link: https://lore.kernel.org/r/20190826201425.17547-2-daniel.vetter@ffwll.ch
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct pcpu_alloc_info {
...
struct pcpu_group_info groups[];
};
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.
So, replace the following form:
sizeof(*ai) + nr_groups * sizeof(ai->groups[0])
with:
struct_size(ai, groups, nr_groups)
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
There is no reason to print warnings when balloon page allocation fails,
as they are expected and can be handled gracefully. Since VMware
balloon now uses balloon-compaction infrastructure, and suppressed these
warnings before, it is also beneficial to suppress these warnings to
keep the same behavior that the balloon had before.
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
The arm architecture had a VM_ARM_DMA_CONSISTENT flag to mark DMA
coherent remapping for a while. Lift this flag to common code so
that we can use it generically. We also check it in the only place
VM_USERMAP is directly check so that we can entirely replace that
flag as well (although I'm not even sure why we'd want to allow
remapping DMA appings, but I'd rather not change behavior).
Signed-off-by: Christoph Hellwig <hch@lst.de>
SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.
However, as is rather unfortunately explained in:
commit 32e45ff43e ("mm: increase RECLAIM_DISTANCE to 30")
the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.
Current AMD EPYC machines have the following NUMA node distances:
node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 16 32 32 32 32
1: 16 10 16 16 32 32 32 32
2: 16 16 10 16 32 32 32 32
3: 16 16 16 10 32 32 32 32
4: 32 32 32 32 10 16 16 16
5: 32 32 32 32 16 10 16 16
6: 32 32 32 32 16 16 10 16
7: 32 32 32 32 16 16 16 10
where 2 hops is 32.
The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.
For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,
$ numactl -C 0-7,32-39 ./spinner 16
causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.
Override node_reclaim_distance for AMD Zen.
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suravee.Suthikulpanit@amd.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas.Lendacky@amd.com
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Filesystems will need to call this function from their fadvise handlers.
CC: stable@vger.kernel.org
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Currently handling of MADV_WILLNEED hint calls directly into readahead
code. Handle it by calling vfs_fadvise() instead so that filesystem can
use its ->fadvise() callback to acquire necessary locks or otherwise
prepare for the request.
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Boaz Harrosh <boazh@netapp.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of using raw_cpu_read() use per_cpu() to read the actual data of
the corresponding cpu otherwise we will be reading the data of the
current cpu for the number of online CPUs.
Link: http://lkml.kernel.org/r/20190829203110.129263-1-shakeelb@google.com
Fixes: bb65f89b7d ("mm: memcontrol: flush percpu vmevents before releasing memcg")
Fixes: c350a99ea2 ("mm: memcontrol: flush percpu vmstats before releasing memcg")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adric Blake has noticed[1] the following warning:
WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245 set_task_reclaim_state+0x1e/0x40
[...]
Call Trace:
mem_cgroup_shrink_node+0x9b/0x1d0
mem_cgroup_soft_limit_reclaim+0x10c/0x3a0
balance_pgdat+0x276/0x540
kswapd+0x200/0x3f0
? wait_woken+0x80/0x80
kthread+0xfd/0x130
? balance_pgdat+0x540/0x540
? kthread_park+0x80/0x80
ret_from_fork+0x35/0x40
---[ end trace 727343df67b2398a ]---
which tells us that soft limit reclaim is about to overwrite the
reclaim_state configured up in the call chain (kswapd in this case but
the direct reclaim is equally possible). This means that reclaim stats
would get misleading once the soft reclaim returns and another reclaim
is done.
Fix the warning by dropping set_task_reclaim_state from the soft reclaim
which is always called with reclaim_state set up.
[1] http://lkml.kernel.org/r/CAE1jjeePxYPvw1mw2B3v803xHVR_BNnz0hQUY_JDMN8ny29M6w@mail.gmail.com
Link: http://lkml.kernel.org/r/20190828071808.20410-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Adric Blake <promarbler14@gmail.com>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hdanton@sina.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 766a4c19d8 ("mm/memcontrol.c: keep local VM counters in sync
with the hierarchical ones") effectively decreased the precision of
per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.
That's good for displaying in memory.stat, but brings a serious
regression into the reclaim process.
One issue I've discovered and debugged is the following:
lruvec_lru_size() can return 0 instead of the actual number of pages in
the lru list, preventing the kernel to reclaim last remaining pages.
Result is yet another dying memory cgroups flooding. The opposite is
also happening: scanning an empty lru list is the waste of cpu time.
Also, inactive_list_is_low() can return incorrect values, preventing the
active lru from being scanned and freed. It can fail both because the
size of active and inactive lists are inaccurate, and because the number
of workingset refaults isn't precise. In other words, the result is
pretty random.
I'm not sure, if using the approximate number of slab pages in
count_shadow_number() is acceptable, but issues described above are
enough to partially revert the patch.
Let's keep per-memcg vmstat_local batched (they are only used for
displaying stats to the userspace), but keep lruvec stats precise. This
change fixes the dead memcg flooding on my setup.
Link: http://lkml.kernel.org/r/20190817004726.2530670-1-guro@fb.com
Fixes: 766a4c19d8 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I've noticed that the "slab" value in memory.stat is sometimes 0, even
if some children memory cgroups have a non-zero "slab" value. The
following investigation showed that this is the result of the kmem_cache
reparenting in combination with the per-cpu batching of slab vmstats.
At the offlining some vmstat value may leave in the percpu cache, not
being propagated upwards by the cgroup hierarchy. It means that stats
on ancestor levels are lower than actual. Later when slab pages are
released, the precise number of pages is substracted on the parent
level, making the value negative. We don't show negative values, 0 is
printed instead.
To fix this issue, let's flush percpu slab memcg and lruvec stats on
memcg offlining. This guarantees that numbers on all ancestor levels
are accurate and match the actual number of outstanding slab pages.
Link: http://lkml.kernel.org/r/20190819202338.363363-3-guro@fb.com
Fixes: fb2f2b0adb ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cgroup foreign inode handling has quite a bit of heuristics and
internal states which sometimes makes it difficult to understand
what's going on. Add tracepoints to improve visibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No modular code uses these, which makes a lot of sense given the wrappers
around them are only called by core mm code.
Link: https://lore.kernel.org/r/20190828142109.29012-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Normally, callers to handle_mm_fault() are supposed to check the
vma->vm_flags first. hmm_range_fault() checks for VM_READ but doesn't
check for VM_WRITE if the caller requests a page to be faulted in with
write permission (via the hmm_range.pfns[] value). If the vma is write
protected, this can result in an infinite loop:
hmm_range_fault()
walk_page_range()
...
hmm_vma_walk_hole()
hmm_vma_walk_hole_()
hmm_vma_do_fault()
handle_mm_fault(FAULT_FLAG_WRITE)
/* returns VM_FAULT_WRITE */
/* returns -EBUSY */
/* returns -EBUSY */
/* returns -EBUSY */
/* loops on -EBUSY and range->valid */
Prevent this by checking for vma->vm_flags & VM_WRITE before calling
handle_mm_fault().
Link: https://lore.kernel.org/r/20190823221753.2514-3-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Although hmm_range_fault() calls find_vma() to make sure that a vma exists
before calling walk_page_range(), hmm_vma_walk_hole() can still be called
with walk->vma == NULL if the start and end address are not contained
within the vma range.
hmm_range_fault() /* calls find_vma() but no range check */
walk_page_range() /* calls find_vma(), sets walk->vma = NULL */
__walk_page_range()
walk_pgd_range()
walk_p4d_range()
walk_pud_range()
hmm_vma_walk_hole()
hmm_vma_walk_hole_()
hmm_vma_do_fault()
handle_mm_fault(vma=0)
Link: https://lore.kernel.org/r/20190823221753.2514-2-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There's an inherent mismatch between memcg and writeback. The former
trackes ownership per-page while the latter per-inode. This was a
deliberate design decision because honoring per-page ownership in the
writeback path is complicated, may lead to higher CPU and IO overheads
and deemed unnecessary given that write-sharing an inode across
different cgroups isn't a common use-case.
Combined with inode majority-writer ownership switching, this works
well enough in most cases but there are some pathological cases. For
example, let's say there are two cgroups A and B which keep writing to
different but confined parts of the same inode. B owns the inode and
A's memory is limited far below B's. A's dirty ratio can rise enough
to trigger balance_dirty_pages() sleeps but B's can be low enough to
avoid triggering background writeback. A will be slowed down without
a way to make writeback of the dirty pages happen.
This patch implements foreign dirty recording and foreign mechanism so
that when a memcg encounters a condition as above it can trigger
flushes on bdi_writebacks which can clean its pages. Please see the
comment on top of mem_cgroup_track_foreign_dirty_slowpath() for
details.
A reproducer follows.
write-range.c::
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
static const char *usage = "write-range FILE START SIZE\n";
int main(int argc, char **argv)
{
int fd;
unsigned long start, size, end, pos;
char *endp;
char buf[4096];
if (argc < 4) {
fprintf(stderr, usage);
return 1;
}
fd = open(argv[1], O_WRONLY);
if (fd < 0) {
perror("open");
return 1;
}
start = strtoul(argv[2], &endp, 0);
if (*endp != '\0') {
fprintf(stderr, usage);
return 1;
}
size = strtoul(argv[3], &endp, 0);
if (*endp != '\0') {
fprintf(stderr, usage);
return 1;
}
end = start + size;
while (1) {
for (pos = start; pos < end; ) {
long bread, bwritten = 0;
if (lseek(fd, pos, SEEK_SET) < 0) {
perror("lseek");
return 1;
}
bread = read(0, buf, sizeof(buf) < end - pos ?
sizeof(buf) : end - pos);
if (bread < 0) {
perror("read");
return 1;
}
if (bread == 0)
return 0;
while (bwritten < bread) {
long this;
this = write(fd, buf + bwritten,
bread - bwritten);
if (this < 0) {
perror("write");
return 1;
}
bwritten += this;
pos += bwritten;
}
}
}
}
repro.sh::
#!/bin/bash
set -e
set -x
sysctl -w vm.dirty_expire_centisecs=300000
sysctl -w vm.dirty_writeback_centisecs=300000
sysctl -w vm.dirtytime_expire_seconds=300000
echo 3 > /proc/sys/vm/drop_caches
TEST=/sys/fs/cgroup/test
A=$TEST/A
B=$TEST/B
mkdir -p $A $B
echo "+memory +io" > $TEST/cgroup.subtree_control
echo $((1<<30)) > $A/memory.high
echo $((32<<30)) > $B/memory.high
rm -f testfile
touch testfile
fallocate -l 4G testfile
echo "Starting B"
(echo $BASHPID > $B/cgroup.procs
pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) &
echo "Waiting 10s to ensure B claims the testfile inode"
sleep 5
sync
sleep 5
sync
echo "Starting A"
(echo $BASHPID > $A/cgroup.procs
pv < /dev/urandom | ./write-range testfile 0 $((2<<30)))
v2: Added comments explaining why the specific intervals are being used.
v3: Use 0 @nr when calling cgroup_writeback_by_id() to use best-effort
flushing while avoding possible livelocks.
v4: Use get_jiffies_64() and time_before/after64() instead of raw
jiffies_64 and arthimetic comparisons as suggested by Jan.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Separate out wb_get_lookup() which doesn't try to create one if there
isn't already one from wb_get_create(). This will be used by later
patches.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There currently is no way to universally identify and lookup a bdi
without holding a reference and pointer to it. This patch adds an
non-recycling bdi->id and implements bdi_get_by_id() which looks up
bdis by their ids. This will be used by memcg foreign inode flushing.
I left bdi_list alone for simplicity and because while rb_tree does
support rcu assignment it doesn't seem to guarantee lossless walk when
walk is racing aginst tree rebalance operations.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The code like this:
ptr = kmalloc(size, GFP_KERNEL);
page = virt_to_page(ptr);
offset = offset_in_page(ptr);
kfree(page_address(page) + offset);
may produce false-positive invalid-free reports on the kernel with
CONFIG_KASAN_SW_TAGS=y.
In the example above we lose the original tag assigned to 'ptr', so
kfree() gets the pointer with 0xFF tag. In kfree() we check that 0xFF
tag is different from the tag in shadow hence print false report.
Instead of just comparing tags, do the following:
1) Check that shadow doesn't contain KASAN_TAG_INVALID. Otherwise it's
double-free and it doesn't matter what tag the pointer have.
2) If pointer tag is different from 0xFF, make sure that tag in the
shadow is the same as in the pointer.
Link: http://lkml.kernel.org/r/20190819172540.19581-1-aryabinin@virtuozzo.com
Fixes: 7f94ffbc4c ("kasan: add hooks implementation for tag-based mode")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Walter Wu <walter-zh.wu@mediatek.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In zs_destroy_pool() we call flush_work(&pool->free_work). However, we
have no guarantee that migration isn't happening in the background at
that time.
Since migration can't directly free pages, it relies on free_work being
scheduled to free the pages. But there's nothing preventing an
in-progress migrate from queuing the work *after*
zs_unregister_migration() has called flush_work(). Which would mean
pages still pointing at the inode when we free it.
Since we know at destroy time all objects should be free, no new
migrations can come in (since zs_page_isolate() fails for fully-free
zspages). This means it is sufficient to track a "# isolated zspages"
count by class, and have the destroy logic ensure all such pages have
drained before proceeding. Keeping that state under the class spinlock
keeps the logic straightforward.
In this case a memory leak could lead to an eventual crash if compaction
hits the leaked page. This crash would only occur if people are
changing their zswap backend at runtime (which eventually starts
destruction).
Link: http://lkml.kernel.org/r/20190809181751.219326-2-henryburns@google.com
Fixes: 48b4800a1c ("zsmalloc: page migration support")
Signed-off-by: Henry Burns <henryburns@google.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Jonathan Adams <jwadams@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In zs_page_migrate() we call putback_zspage() after we have finished
migrating all pages in this zspage. However, the return value is
ignored. If a zs_free() races in between zs_page_isolate() and
zs_page_migrate(), freeing the last object in the zspage,
putback_zspage() will leave the page in ZS_EMPTY for potentially an
unbounded amount of time.
To fix this, we need to do the same thing as zs_page_putback() does:
schedule free_work to occur.
To avoid duplicated code, move the sequence to a new
putback_zspage_deferred() function which both zs_page_migrate() and
zs_page_putback() call.
Link: http://lkml.kernel.org/r/20190809181751.219326-1-henryburns@google.com
Fixes: 48b4800a1c ("zsmalloc: page migration support")
Signed-off-by: Henry Burns <henryburns@google.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Jonathan Adams <jwadams@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
THP splitting path is missing the split_page_owner() call that
split_page() has.
As a result, split THP pages are wrongly reported in the page_owner file
as order-9 pages. Furthermore when the former head page is freed, the
remaining former tail pages are not listed in the page_owner file at
all. This patch fixes that by adding the split_page_owner() call into
__split_huge_page().
Link: http://lkml.kernel.org/r/20190820131828.22684-2-vbabka@suse.cz
Fixes: a9627bc5e3 ("mm/page_owner: introduce split_page_owner and replace manual handling")
Reported-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similar to vmstats, percpu caching of local vmevents leads to an
accumulation of errors on non-leaf levels. This happens because some
leftovers may remain in percpu caches, so that they are never propagated
up by the cgroup tree and just disappear into nonexistence with on
releasing of the memory cgroup.
To fix this issue let's accumulate and propagate percpu vmevents values
before releasing the memory cgroup similar to what we're doing with
vmstats.
Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
only over online cpus.
Link: http://lkml.kernel.org/r/20190819202338.363363-4-guro@fb.com
Fixes: 42a3003535 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Percpu caching of local vmstats with the conditional propagation by the
cgroup tree leads to an accumulation of errors on non-leaf levels.
Let's imagine two nested memory cgroups A and A/B. Say, a process
belonging to A/B allocates 100 pagecache pages on the CPU 0. The percpu
cache will spill 3 times, so that 32*3=96 pages will be accounted to A/B
and A atomic vmstat counters, 4 pages will remain in the percpu cache.
Imagine A/B is nearby memory.max, so that every following allocation
triggers a direct reclaim on the local CPU. Say, each such attempt will
free 16 pages on a new cpu. That means every percpu cache will have -16
pages, except the first one, which will have 4 - 16 = -12. A/B and A
atomic counters will not be touched at all.
Now a user removes A/B. All percpu caches are freed and corresponding
vmstat numbers are forgotten. A has 96 pages more than expected.
As memory cgroups are created and destroyed, errors do accumulate. Even
1-2 pages differences can accumulate into large numbers.
To fix this issue let's accumulate and propagate percpu vmstat values
before releasing the memory cgroup. At this point these numbers are
stable and cannot be changed.
Since on cpu hotplug we do flush percpu vmstats anyway, we can iterate
only over online cpus.
Link: http://lkml.kernel.org/r/20190819202338.363363-2-guro@fb.com
Fixes: 42a3003535 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit 907ec5fca3 ("mm: zero remaining unavailable struct
pages"), struct page of reserved memory is zeroed. This causes
page->flags to be 0 and fixes issues related to reading
/proc/kpageflags, for example, of reserved memory.
The VM_BUG_ON() in move_freepages_block(), however, assumes that
page_zone() is meaningful even for reserved memory. That assumption is
no longer true after the aforementioned commit.
There's no reason why move_freepages_block() should be testing the
legitimacy of page_zone() for reserved memory; its scope is limited only
to pages on the zone's freelist.
Note that pfn_valid() can be true for reserved memory: there is a
backing struct page. The check for page_to_nid(page) is also buggy but
reserved memory normally only appears on node 0 so the zeroing doesn't
affect this.
Move the debug checks to after verifying PageBuddy is true. This
isolates the scope of the checks to only be for buddy pages which are on
the zone's freelist which move_freepages_block() is operating on. In
this case, an incorrect node or zone is a bug worthy of being warned
about (and the examination of struct page is acceptable bcause this
memory is not reserved).
Why does move_freepages_block() gets called on reserved memory? It's
simply math after finding a valid free page from the per-zone free area
to use as fallback. We find the beginning and end of the pageblock of
the valid page and that can bring us into memory that was reserved per
the e820. pfn_valid() is still true (it's backed by a struct page), but
since it's zero'd we shouldn't make any inferences here about comparing
its node or zone. The current node check just happens to succeed most
of the time by luck because reserved memory typically appears on node 0.
The fix here is to validate that we actually have buddy pages before
testing if there's any type of zone or node strangeness going on.
We noticed it almost immediately after bringing 907ec5fca3 in on
CONFIG_DEBUG_VM builds. It depends on finding specific free pages in
the per-zone free area where the math in move_freepages() will bring the
start or end pfn into reserved memory and wanting to claim that entire
pageblock as a new migratetype. So the path will be rare, require
CONFIG_DEBUG_VM, and require fallback to a different migratetype.
Some struct pages were already zeroed from reserve pages before
907ec5fca3c so it theoretically could trigger before this commit. I
think it's rare enough under a config option that most people don't run
that others may not have noticed. I wouldn't argue against a stable tag
and the backport should be easy enough, but probably wouldn't single out
a commit that this is fixing.
Mel said:
: The overhead of the debugging check is higher with this patch although
: it'll only affect debug builds and the path is not particularly hot.
: If this was a concern, I think it would be reasonable to simply remove
: the debugging check as the zone boundaries are checked in
: move_freepages_block and we never expect a zone/node to be smaller than
: a pageblock and stuck in the middle of another zone.
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1908122036560.10779@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In z3fold_destroy_pool() we call destroy_workqueue(&pool->compact_wq).
However, we have no guarantee that migration isn't happening in the
background at that time.
Migration directly calls queue_work_on(pool->compact_wq), if destruction
wins that race we are using a destroyed workqueue.
Link: http://lkml.kernel.org/r/20190809213828.202833-1-henryburns@google.com
Signed-off-by: Henry Burns <henryburns@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Jonathan Adams <jwadams@google.com>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hmm_range_fault() may return NULL pages because some of the pfns are equal
to HMM_PFN_NONE. This happens randomly under memory pressure. The reason
is during the swapped out page pte path, hmm_vma_handle_pte() doesn't
update the fault variable from cpu_flags, so it failed to call
hmm_vam_do_fault() to swap the page in.
The fix is to call hmm_pte_need_fault() to update fault variable.
Fixes: 74eee180b9 ("mm/hmm/mirror: device page fault handler")
Link: https://lore.kernel.org/r/20190815205227.7949-1-Philip.Yang@amd.com
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
mmu_notifier_unregister_no_release() and mmu_notifier_call_srcu() no
longer have any users, they have all been converted to use
mmu_notifier_put().
So delete this difficult to use interface.
Link: https://lore.kernel.org/r/20190806231548.25242-12-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
From rdma.git
Jason Gunthorpe says:
====================
This is a collection of general cleanups for ODP to clarify some of the
flows around umem creation and use of the interval tree.
====================
The branch is based on v5.3-rc5 due to dependencies, and is being taken
into hmm.git due to dependencies in the next patches.
* odp_fixes:
RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
RDMA/mlx5: Use ib_umem_start instead of umem.address
RDMA/core: Make invalidate_range a device operation
RDMA/odp: Use kvcalloc for the dma_list and page_list
RDMA/odp: Check for overflow when computing the umem_odp end
RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
RDMA/odp: Split creating a umem_odp from ib_umem_get
RDMA/odp: Make the three ways to create a umem_odp clear
RMDA/odp: Consolidate umem_odp initialization
RDMA/odp: Make it clearer when a umem is an implicit ODP umem
RDMA/odp: Iterate over the whole rbtree directly
RDMA/odp: Use the common interval tree library instead of generic
RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Don't let userspace write to an active swap file because the kernel
effectively has a long term lease on the storage and things could get
seriously corrupted if we let this happen.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Set S_SWAPFILE on block device inodes so that they have the same
protections as a swap flie.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The dev field in struct dev_pagemap is only used to print dev_name in two
places, which are at best nice to have. Just remove the field and thus
the name in those two messages.
Link: https://lore.kernel.org/r/20190818090557.17853-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Just a bit of paranoia, since if we start pushing this deep into
callchains it's hard to spot all places where an mmu notifier
implementation might fail when it's not allowed to.
Inspired by some confusion we had discussing i915 mmu notifiers and
whether we could use the newly-introduced return value to handle some
corner cases. Until we realized that these are only for when a task has
been killed by the oom reaper.
An alternative approach would be to split the callback into two versions,
one with the int return value, and the other with void return value like
in older kernels. But that's a lot more churn for fairly little gain I
think.
Summary from the m-l discussion on why we want something at warning level:
This allows automated tooling in CI to catch bugs without humans having to
look at everything. If we just upgrade the existing pr_info to a pr_warn,
then we'll have false positives. And as-is, no one will ever spot the
problem since it's lost in the massive amounts of overall dmesg noise.
Link: https://lore.kernel.org/r/20190814202027.18735-2-daniel.vetter@ffwll.ch
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
CONFIG_MIGRATE_VMA_HELPER guards helpers that are required for proper
devic private memory support. Remove the option and just check for
CONFIG_DEVICE_PRIVATE instead.
Link: https://lore.kernel.org/r/20190814075928.23766-11-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
No one ever checks this flag, and we could easily get that information
from the page if needed.
Link: https://lore.kernel.org/r/20190814075928.23766-10-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There isn't any good reason to pass callbacks to migrate_vma. Instead
we can just export the three steps done by this function to drivers and
let them sequence the operation without callbacks. This removes a lot
of boilerplate code as-is, and will allow the drivers to drastically
improve code flow and error handling further on.
Link: https://lore.kernel.org/r/20190814075928.23766-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This is a significant simplification, it eliminates all the remaining
'hmm' stuff in mm_struct, eliminates krefing along the critical notifier
paths, and takes away all the ugly locking and abuse of page_table_lock.
mmu_notifier_get() provides the single struct hmm per struct mm which
eliminates mm->hmm.
It also directly guarantees that no mmu_notifier op callback is callable
while concurrent free is possible, this eliminates all the krefs inside
the mmu_notifier callbacks.
The remaining krefs in the range code were overly cautious, drivers are
already not permitted to free the mirror while a range exists.
Link: https://lore.kernel.org/r/20190806231548.25242-6-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Many places in the kernel have a flow where userspace will create some
object and that object will need to connect to the subsystem's
mmu_notifier subscription for the duration of its lifetime.
In this case the subsystem is usually tracking multiple mm_structs and it
is difficult to keep track of what struct mmu_notifier's have been
allocated for what mm's.
Since this has been open coded in a variety of exciting ways, provide core
functionality to do this safely.
This approach uses the struct mmu_notifier_ops * as a key to determine if
the subsystem has a notifier registered on the mm or not. If there is a
registration then the existing notifier struct is returned, otherwise the
ops->alloc_notifiers() is used to create a new per-subsystem notifier for
the mm.
The destroy side incorporates an async call_srcu based destruction which
will avoid bugs in the callers such as commit 6d7c3cde93 ("mm/hmm: fix
use after free with struct hmm in the mmu notifiers").
Since we are inside the mmu notifier core locking is fairly simple, the
allocation uses the same approach as for mmu_notifier_mm, the write side
of the mmap_sem makes everything deterministic and we only need to do
hlist_add_head_rcu() under the mm_take_all_locks(). The new users count
and the discoverability in the hlist is fully serialized by the
mmu_notifier_mm->lock.
Link: https://lore.kernel.org/r/20190806231548.25242-4-jgg@ziepe.ca
Co-developed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
A prior commit e0f3c3f78d ("mm/mmu_notifier: init notifier if necessary")
made an attempt at doing this, but had to be reverted as calling
the GFP_KERNEL allocator under the i_mmap_mutex causes deadlock, see
commit 35cfa2b0b4 ("mm/mmu_notifier: allocate mmu_notifier in advance").
However, we can avoid that problem by doing the allocation only under
the mmap_sem, which is already happening.
Since all writers to mm->mmu_notifier_mm hold the write side of the
mmap_sem reading it under that sem is deterministic and we can use that to
decide if the allocation path is required, without speculation.
The actual update to mmu_notifier_mm must still be done under the
mm_take_all_locks() to ensure read-side coherency.
Link: https://lore.kernel.org/r/20190806231548.25242-3-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This simplifies the code to not have so many one line functions and extra
logic. __mmu_notifier_register() simply becomes the entry point to
register the notifier, and the other one calls it under lock.
Also add a lockdep_assert to check that the callers are holding the lock
as expected.
Link: https://lore.kernel.org/r/20190806231548.25242-2-jgg@ziepe.ca
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Li Wang discovered that LTP/move_page12 V2 sometimes triggers SIGBUS in
the kernel-v5.2.3 testing. This is caused by a race between hugetlb
page migration and page fault.
If a hugetlb page can not be allocated to satisfy a page fault, the task
is sent SIGBUS. This is normal hugetlbfs behavior. A hugetlb fault
mutex exists to prevent two tasks from trying to instantiate the same
page. This protects against the situation where there is only one
hugetlb page, and both tasks would try to allocate. Without the mutex,
one would fail and SIGBUS even though the other fault would be
successful.
There is a similar race between hugetlb page migration and fault.
Migration code will allocate a page for the target of the migration. It
will then unmap the original page from all page tables. It does this
unmap by first clearing the pte and then writing a migration entry. The
page table lock is held for the duration of this clear and write
operation. However, the beginnings of the hugetlb page fault code
optimistically checks the pte without taking the page table lock. If
clear (as it can be during the migration unmap operation), a hugetlb
page allocation is attempted to satisfy the fault. Note that the page
which will eventually satisfy this fault was already allocated by the
migration code. However, the allocation within the fault path could
fail which would result in the task incorrectly being sent SIGBUS.
Ideally, we could take the hugetlb fault mutex in the migration code
when modifying the page tables. However, locks must be taken in the
order of hugetlb fault mutex, page lock, page table lock. This would
require significant rework of the migration code. Instead, the issue is
addressed in the hugetlb fault code. After failing to allocate a huge
page, take the page table lock and check for huge_pte_none before
returning an error. This is the same check that must be made further in
the code even if page allocation is successful.
Link: http://lkml.kernel.org/r/20190808000533.7701-1-mike.kravetz@oracle.com
Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Li Wang <liwang@redhat.com>
Tested-by: Li Wang <liwang@redhat.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Cyril Hrubis <chrubis@suse.cz>
Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Chinner reported a problem pointing a finger at commit 1c30844d2d
("mm: reclaim small amounts of memory when an external fragmentation
event occurs").
The report is extensive:
https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/
and it's worth recording the most relevant parts (colorful language and
typos included).
When running a simple, steady state 4kB file creation test to
simulate extracting tarballs larger than memory full of small
files into the filesystem, I noticed that once memory fills up
the cache balance goes to hell.
The workload is creating one dirty cached inode for every dirty
page, both of which should require a single IO each to clean and
reclaim, and creation of inodes is throttled by the rate at which
dirty writeback runs at (via balance dirty pages). Hence the ingest
rate of new cached inodes and page cache pages is identical and
steady. As a result, memory reclaim should quickly find a steady
balance between page cache and inode caches.
The moment memory fills, the page cache is reclaimed at a much
faster rate than the inode cache, and evidence suggests that
the inode cache shrinker is not being called when large batches
of pages are being reclaimed. In roughly the same time period
that it takes to fill memory with 50% pages and 50% slab caches,
memory reclaim reduces the page cache down to just dirty pages
and slab caches fill the entirety of memory.
The LRU is largely full of dirty pages, and we're getting spikes
of random writeback from memory reclaim so it's all going to shit.
Behaviour never recovers, the page cache remains pinned at just
dirty pages, and nothing I could tune would make any difference.
vfs_cache_pressure makes no difference - I would set it so high
it should trim the entire inode caches in a single pass, yet it
didn't do anything. It was clear from tracing and live telemetry
that the shrinkers were pretty much not running except when
there was absolutely no memory free at all, and then they did
the minimum necessary to free memory to make progress.
So I went looking at the code, trying to find places where pages
got reclaimed and the shrinkers weren't called. There's only one
- kswapd doing boosted reclaim as per commit 1c30844d2d ("mm:
reclaim small amounts of memory when an external fragmentation
event occurs").
The watermark boosting introduced by the commit is triggered in response
to an allocation "fragmentation event". The boosting was not intended
to target THP specifically and triggers even if THP is disabled.
However, with Dave's perfectly reasonable workload, fragmentation events
can be very common given the ratio of slab to page cache allocations so
boosting remains active for long periods of time.
As high-order allocations might use compaction and compaction cannot
move slab pages the decision was made in the commit to special-case
kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
reclaiming slab does not directly help compaction.
As Dave notes, this decision means that slab can be artificially
protected for long periods of time and messes up the balance with slab
and page caches.
Removing the special casing can still indirectly help avoid
fragmentation by avoiding fragmentation-causing events due to slab
allocation as pages from a slab pageblock will have some slab objects
freed. Furthermore, with the special casing, reclaim behaviour is
unpredictable as kswapd sometimes examines slab and sometimes does not
in a manner that is tricky to tune or analyse.
This patch removes the special casing. The downside is that this is not
a universal performance win. Some benchmarks that depend on the
residency of data when rereading metadata may see a regression when slab
reclaim is restored to its original behaviour. Similarly, some
benchmarks that only read-once or write-once may perform better when
page reclaim is too aggressive. The primary upside is that slab
shrinker is less surprising (arguably more sane but that's a matter of
opinion), behaves consistently regardless of the fragmentation state of
the system and properly obeys VM sysctls.
A fsmark benchmark configuration was constructed similar to what Dave
reported and is codified by the mmtest configuration
config-io-fsmark-small-file-stream. It was evaluated on a 1-socket
machine to avoid dealing with NUMA-related issues and the timing of
reclaim. The storage was an SSD Samsung Evo and a fresh trimmed XFS
filesystem was used for the test data.
This is not an exact replication of Dave's setup. The configuration
scales its parameters depending on the memory size of the SUT to behave
similarly across machines. The parameters mean the first sample
reported by fs_mark is using 50% of RAM which will barely be throttled
and look like a big outlier. Dave used fake NUMA to have multiple
kswapd instances which I didn't replicate. Finally, the number of
iterations differ from Dave's test as the target disk was not large
enough. While not identical, it should be representative.
fsmark
5.3.0-rc3 5.3.0-rc3
vanilla shrinker-v1r1
Min 1-files/sec 4444.80 ( 0.00%) 4765.60 ( 7.22%)
1st-qrtle 1-files/sec 5005.10 ( 0.00%) 5091.70 ( 1.73%)
2nd-qrtle 1-files/sec 4917.80 ( 0.00%) 4855.60 ( -1.26%)
3rd-qrtle 1-files/sec 4667.40 ( 0.00%) 4831.20 ( 3.51%)
Max-1 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
Max-5 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
Max-10 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
Max-90 1-files/sec 4649.60 ( 0.00%) 4780.70 ( 2.82%)
Max-95 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%)
Max-99 1-files/sec 4491.00 ( 0.00%) 4768.20 ( 6.17%)
Max 1-files/sec 11421.50 ( 0.00%) 9999.30 ( -12.45%)
Hmean 1-files/sec 5004.75 ( 0.00%) 5075.96 ( 1.42%)
Stddev 1-files/sec 1778.70 ( 0.00%) 1369.66 ( 23.00%)
CoeffVar 1-files/sec 33.70 ( 0.00%) 26.05 ( 22.71%)
BHmean-99 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%)
BHmean-95 1-files/sec 5053.72 ( 0.00%) 5101.52 ( 0.95%)
BHmean-90 1-files/sec 5107.05 ( 0.00%) 5131.41 ( 0.48%)
BHmean-75 1-files/sec 5208.45 ( 0.00%) 5206.68 ( -0.03%)
BHmean-50 1-files/sec 5405.53 ( 0.00%) 5381.62 ( -0.44%)
BHmean-25 1-files/sec 6179.75 ( 0.00%) 6095.14 ( -1.37%)
5.3.0-rc3 5.3.0-rc3
vanillashrinker-v1r1
Duration User 501.82 497.29
Duration System 4401.44 4424.08
Duration Elapsed 8124.76 8358.05
This is showing a slight skew for the max result representing a large
outlier for the 1st, 2nd and 3rd quartile are similar indicating that
the bulk of the results show little difference. Note that an earlier
version of the fsmark configuration showed a regression but that
included more samples taken while memory was still filling.
Note that the elapsed time is higher. Part of this is that the
configuration included time to delete all the test files when the test
completes -- the test automation handles the possibility of testing
fsmark with multiple thread counts. Without the patch, many of these
objects would be memory resident which is part of what the patch is
addressing.
There are other important observations that justify the patch.
1. With the vanilla kernel, the number of dirty pages in the system is
very low for much of the test. With this patch, dirty pages is
generally kept at 10% which matches vm.dirty_background_ratio which
is normal expected historical behaviour.
2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
0.95 for much of the test i.e. Slab is being left alone and
dominating memory consumption. With the patch applied, the ratio
varies between 0.35 and 0.45 with the bulk of the measured ratios
roughly half way between those values. This is a different balance to
what Dave reported but it was at least consistent.
3. Slabs are scanned throughout the entire test with the patch applied.
The vanille kernel has periods with no scan activity and then
relatively massive spikes.
4. Without the patch, kswapd scan rates are very variable. With the
patch, the scan rates remain quite steady.
4. Overall vmstats are closer to normal expectations
5.3.0-rc3 5.3.0-rc3
vanilla shrinker-v1r1
Ops Direct pages scanned 99388.00 328410.00
Ops Kswapd pages scanned 45382917.00 33451026.00
Ops Kswapd pages reclaimed 30869570.00 25239655.00
Ops Direct pages reclaimed 74131.00 5830.00
Ops Kswapd efficiency % 68.02 75.45
Ops Kswapd velocity 5585.75 4002.25
Ops Page reclaim immediate 1179721.00 430927.00
Ops Slabs scanned 62367361.00 73581394.00
Ops Direct inode steals 2103.00 1002.00
Ops Kswapd inode steals 570180.00 5183206.00
o Vanilla kernel is hitting direct reclaim more frequently,
not very much in absolute terms but the fact the patch
reduces it is interesting
o "Page reclaim immediate" in the vanilla kernel indicates
dirty pages are being encountered at the tail of the LRU.
This is generally bad and means in this case that the LRU
is not long enough for dirty pages to be cleaned by the
background flush in time. This is much reduced by the
patch.
o With the patch, kswapd is reclaiming 10 times more slab
pages than with the vanilla kernel. This is indicative
of the watermark boosting over-protecting slab
A more complete set of tests were run that were part of the basis for
introducing boosting and while there are some differences, they are well
within tolerances.
Bottom line, the special casing kswapd to avoid slab behaviour is
unpredictable and can lead to abnormal results for normal workloads.
This patch restores the expected behaviour that slab and page cache is
balanced consistently for a workload with a steady allocation ratio of
slab/pagecache pages. It also means that if there are workloads that
favour the preservation of slab over pagecache that it can be tuned via
vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
the parameter when boosting is active.
Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net
Fixes: 1c30844d2d ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 2f0799a0ff ("mm, thp: restore node-local
hugepage allocations").
commit 2f0799a0ff was rightfully applied to avoid the risk of a
severe regression that was reported by the kernel test robot at the end
of the merge window. Now we understood the regression was a false
positive and was caused by a significant increase in fairness during a
swap trashing benchmark. So it's safe to re-apply the fix and continue
improving the code from there. The benchmark that reported the
regression is very useful, but it provides a meaningful result only when
there is no significant alteration in fairness during the workload. The
removal of __GFP_THISNODE increased fairness.
__GFP_THISNODE cannot be used in the generic page faults path for new
memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
behavior significantly deviates from what the MPOL_DEFAULT semantics are
supposed to be for THP and 4k allocations alike.
Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
set to "madvise") has never meant to provide an implicit MPOL_BIND on
the "current" node the task is running on, causing swap storms and
providing a much more aggressive behavior than even zone_reclaim_node =
3.
Any workload who could have benefited from __GFP_THISNODE has now to
enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided
the zone_reclaim_mode behavior, but it only did so if THP was enabled:
if THP was disabled, there would have been no chance to get any 4k page
from the current node if the current node was full of pagecache, which
further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE.
MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode
semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3
must work exactly the same with MADV_HUGEPAGE set or not.
The performance characteristic of memory depends on the hardware
details. The numbers below are obtained on Naples/EPYC architecture and
the N/A projection extends them to show what we should aim for in the
future as a good THP NUMA locality default. The benchmark used
exercises random memory seeks (note: the cost of the page faults is not
part of the measurement).
D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
0% | +43% | +45% | +106% | +131% | +224% | N/A | N/A
D0 means distance zero (i.e. local memory), D1 means distance one (i.e.
intra socket memory), D2 means distance two (i.e. inter socket memory),
etc...
For the guest physical memory allocated by qemu and for guest mode
kernel the performance characteristic of RAM is more complex and an
ideal default could be:
D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
0% | +58% | +101% | N/A | +222% | N/A | N/A | N/A
NOTE: the N/A are projections and haven't been measured yet, the
measurement in this case is done on a 1950x with only two NUMA nodes.
The THP case here means THP was used both in the host and in the guest.
After applying this commit the THP NUMA locality order that we'll get
out of MADV_HUGEPAGE is this:
D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...
Before this commit it was:
D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...
Even if we ignore the breakage of large workloads that can't fit in a
single node that the __GFP_THISNODE implicit "current node" mbind
caused, the THP NUMA locality order provided by __GFP_THISNODE was still
not the one we shall aim for in the long term (i.e. the first one at
the top).
After this commit is applied, we can introduce a new allocator multi
order API and to replace those two alloc_pages_vmas calls in the page
fault path, with a single multi order call:
unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
page = alloc_pages_multi_order(..., &order);
if (!page)
goto out;
if (!(order & (1 << 0))) {
VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
/* THP fault */
} else {
VM_WARN_ON(order != 1 << 0);
/* 4k fallback */
}
The page allocator logic has to be altered so that when it fails on any
zone with order 9, it has to try again with a order 0 before falling
back to the next zone in the zonelist.
After that we need to do more measurements and evaluate if adding an
opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
with "DN+1 THP | DN 4k" at every NUMA distance crossing.
Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".
The fixes for what was originally reported as "pathological THP
behavior" we rightfully reverted to be sure not to introduced
regressions at end of a merge window after a severe regression report
from the kernel bot. We can safely re-apply them now that we had time
to analyze the problem.
The mm process worked fine, because the good fixes were eventually
committed upstream without excessive delay.
The regression reported by the kernel bot however forced us to revert
the good fixes to be sure not to introduce regressions and to give us
the time to analyze the issue further. The silver lining is that this
extra time allowed to think more at this issue and also plan for a
future direction to improve things further in terms of THP NUMA
locality.
This patch (of 2):
This reverts commit 356ff8a9a7 ("Revert "mm, thp: consolidate THP
gfp handling into alloc_hugepage_direct_gfpmask"). So it reapplies
89c83fb539 ("mm, thp: consolidate THP gfp handling into
alloc_hugepage_direct_gfpmask").
Consolidation of the THP allocation flags at the same place was meant to
be a clean up to easier handle otherwise scattered code which is
imposing a maintenance burden. There were no real problems observed
with the gfp mask consolidation but the reversion was rushed through
without a larger consensus regardless.
This patch brings the consolidation back because this should make the
long term maintainability easier as well as it should allow future
changes to be less error prone.
[mhocko@kernel.org: changelog additions]
Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memcg counters for shadow nodes are broken because the memcg pointer is
obtained in a wrong way. The following approach is used:
virt_to_page(xa_node)->mem_cgroup
Since commit 4d96ba3530 ("mm: memcg/slab: stop setting
page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
set for slab pages, so memcg_from_slab_page() should be used instead.
Also I doubt that it ever worked correctly: virt_to_head_page() should
be used instead of virt_to_page(). Otherwise objects residing on tail
pages are not accounted, because only the head page contains a valid
mem_cgroup pointer. That was a case since the introduction of these
counters by the commit 68d48e6a2d ("mm: workingset: add vmstat counter
for shadow nodes").
Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
Fixes: 4d96ba3530 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, when checking to see if accessing n bytes starting at address
"ptr" will cause a wraparound in the memory addresses, the check in
check_bogus_address() adds an extra byte, which is incorrect, as the
range of addresses that will be accessed is [ptr, ptr + (n - 1)].
This can lead to incorrectly detecting a wraparound in the memory
address, when trying to read 4 KB from memory that is mapped to the the
last possible page in the virtual address space, when in fact, accessing
that range of memory would not cause a wraparound to occur.
Use the memory range that will actually be accessed when considering if
accessing a certain amount of bytes will cause the memory address to
wrap around.
Link: http://lkml.kernel.org/r/1564509253-23287-1-git-send-email-isaacm@codeaurora.org
Fixes: f5509cc18d ("mm: Hardened usercopy")
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Co-developed-by: Prasad Sodagudi <psodagud@codeaurora.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Trilok Soni <tsoni@codeaurora.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If an error occurs during kmemleak_init() (e.g. kmem cache cannot be
created), kmemleak is disabled but kmemleak_early_log remains enabled.
Subsequently, when the .init.text section is freed, the log_early()
function no longer exists. To avoid a page fault in such scenario,
ensure that kmemleak_disable() also disables early logging.
Link: http://lkml.kernel.org/r/20190731152302.42073-1-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recent changes to the vmalloc code by commit 68ad4a3304
("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
cause spurious percpu allocation failures. These, in turn, can result
in panic()s in the slub code. One such possible panic was reported by
Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
Another related panic observed is,
RIP: 0033:0x7f46f7441b9b
Call Trace:
dump_stack+0x61/0x80
pcpu_alloc.cold.30+0x22/0x4f
mem_cgroup_css_alloc+0x110/0x650
cgroup_apply_control_enable+0x133/0x330
cgroup_mkdir+0x41b/0x500
kernfs_iop_mkdir+0x5a/0x90
vfs_mkdir+0x102/0x1b0
do_mkdirat+0x7d/0xf0
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9
VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
uses two lists (vmap_area_list & free_vmap_area_list) to track the used
and free VM areas in VMALLOC space. And pcpu_get_vm_areas(offsets[],
sizes[], nr_vms, align) function is used for allocating congruent VM
areas for percpu memory allocator. In order to not conflict with
VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the
VMALLOC space. So the search for free vm_area for the given requirement
starts near VMALLOC_END and moves upwards towards VMALLOC_START.
Prior to commit 68ad4a3304, the search for free vm_area in
pcpu_get_vm_areas() involves following two main steps.
Step 1:
Find a aligned "base" adress near VMALLOC_END.
va = free vm area near VMALLOC_END
Step 2:
Loop through number of requested vm_areas and check,
Step 2.1:
if (base < VMALLOC_START)
1. fail with error
Step 2.2:
// end is offsets[area] + sizes[area]
if (base + end > va->vm_end)
1. Move the base downwards and repeat Step 2
Step 2.3:
if (base + start < va->vm_start)
1. Move to previous free vm_area node, find aligned
base address and repeat Step 2
But Commit 68ad4a3304 removed Step 2.2 and modified Step 2.3 as below:
Step 2.3:
if (base + start < va->vm_start || base + end > va->vm_end)
1. Move to previous free vm_area node, find aligned
base address and repeat Step 2
Above change is the root cause of spurious percpu memory allocation
failures. For example, consider a case where a relatively large vm_area
(~ 30 TB) was ignored in free vm_area search because it did not pass the
base + end < vm->vm_end boundary check. Ignoring such large free
vm_area's would lead to not finding free vm_area within boundary of
VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.
So modify the search algorithm to include Step 2.2.
Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com
Fixes: 68ad4a3304 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: sathyanarayanan kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The constraint from the zpool use of z3fold_destroy_pool() is there are
no outstanding handles to memory (so no active allocations), but it is
possible for there to be outstanding work on either of the two wqs in
the pool.
Calling z3fold_deregister_migration() before the workqueues are drained
means that there can be allocated pages referencing a freed inode,
causing any thread in compaction to be able to trip over the bad pointer
in PageMovable().
Link: http://lkml.kernel.org/r/20190726224810.79660-2-henryburns@google.com
Fixes: 1f862989b0 ("mm/z3fold.c: support page migration")
Signed-off-by: Henry Burns <henryburns@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Jonathan Adams <jwadams@google.com>
Cc: Vitaly Vul <vitaly.vul@sony.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The constraint from the zpool use of z3fold_destroy_pool() is there are
no outstanding handles to memory (so no active allocations), but it is
possible for there to be outstanding work on either of the two wqs in
the pool.
If there is work queued on pool->compact_workqueue when it is called,
z3fold_destroy_pool() will do:
z3fold_destroy_pool()
destroy_workqueue(pool->release_wq)
destroy_workqueue(pool->compact_wq)
drain_workqueue(pool->compact_wq)
do_compact_page(zhdr)
kref_put(&zhdr->refcount)
__release_z3fold_page(zhdr, ...)
queue_work_on(pool->release_wq, &pool->work) *BOOM*
So compact_wq needs to be destroyed before release_wq.
Link: http://lkml.kernel.org/r/20190726224810.79660-1-henryburns@google.com
Fixes: 5d03a66139 ("mm/z3fold.c: use kref to prevent page free/compact race")
Signed-off-by: Henry Burns <henryburns@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Jonathan Adams <jwadams@google.com>
Cc: Vitaly Vul <vitaly.vul@sony.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk
Cc: Henry Burns <henrywolfeburns@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When running syzkaller internally, we ran into the below bug on 4.9.x
kernel:
kernel BUG at mm/huge_memory.c:2124!
invalid opcode: 0000 [#1] SMP KASAN
CPU: 0 PID: 1518 Comm: syz-executor107 Not tainted 4.9.168+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
task: ffff880067b34900 task.stack: ffff880068998000
RIP: split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
Call Trace:
split_huge_page include/linux/huge_mm.h:100 [inline]
queue_pages_pte_range+0x7e1/0x1480 mm/mempolicy.c:538
walk_pmd_range mm/pagewalk.c:50 [inline]
walk_pud_range mm/pagewalk.c:90 [inline]
walk_pgd_range mm/pagewalk.c:116 [inline]
__walk_page_range+0x44a/0xdb0 mm/pagewalk.c:208
walk_page_range+0x154/0x370 mm/pagewalk.c:285
queue_pages_range+0x115/0x150 mm/mempolicy.c:694
do_mbind mm/mempolicy.c:1241 [inline]
SYSC_mbind+0x3c3/0x1030 mm/mempolicy.c:1370
SyS_mbind+0x46/0x60 mm/mempolicy.c:1352
do_syscall_64+0x1d2/0x600 arch/x86/entry/common.c:282
entry_SYSCALL_64_after_swapgs+0x5d/0xdb
Code: c7 80 1c 02 00 e8 26 0a 76 01 <0f> 0b 48 c7 c7 40 46 45 84 e8 4c
RIP [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
RSP <ffff88006899f980>
with the below test:
uint64_t r[1] = {0xffffffffffffffff};
int main(void)
{
syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
intptr_t res = 0;
res = syscall(__NR_socket, 0x11, 3, 0x300);
if (res != -1)
r[0] = res;
*(uint32_t*)0x20000040 = 0x10000;
*(uint32_t*)0x20000044 = 1;
*(uint32_t*)0x20000048 = 0xc520;
*(uint32_t*)0x2000004c = 1;
syscall(__NR_setsockopt, r[0], 0x107, 0xd, 0x20000040, 0x10);
syscall(__NR_mmap, 0x20fed000, 0x10000, 0, 0x8811, r[0], 0);
*(uint64_t*)0x20000340 = 2;
syscall(__NR_mbind, 0x20ff9000, 0x4000, 0x4002, 0x20000340, 0x45d4, 3);
return 0;
}
Actually the test does:
mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
socket(AF_PACKET, SOCK_RAW, 768) = 3
setsockopt(3, SOL_PACKET, PACKET_TX_RING, {block_size=65536, block_nr=1, frame_size=50464, frame_nr=1}, 16) = 0
mmap(0x20fed000, 65536, PROT_NONE, MAP_SHARED|MAP_FIXED|MAP_POPULATE|MAP_DENYWRITE, 3, 0) = 0x20fed000
mbind(..., MPOL_MF_STRICT|MPOL_MF_MOVE) = 0
The setsockopt() would allocate compound pages (16 pages in this test)
for packet tx ring, then the mmap() would call packet_mmap() to map the
pages into the user address space specified by the mmap() call.
When calling mbind(), it would scan the vma to queue the pages for
migration to the new node. It would split any huge page since 4.9
doesn't support THP migration, however, the packet tx ring compound
pages are not THP and even not movable. So, the above bug is triggered.
However, the later kernel is not hit by this issue due to commit
d44d363f65 ("mm: don't assume anonymous pages have SwapBacked flag"),
which just removes the PageSwapBacked check for a different reason.
But, there is a deeper issue. According to the semantic of mbind(), it
should return -EIO if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and
MPOL_MF_STRICT was also specified, but the kernel was unable to move all
existing pages in the range. The tx ring of the packet socket is
definitely not movable, however, mbind() returns success for this case.
Although the most socket file associates with non-movable pages, but XDP
may have movable pages from gup. So, it sounds not fine to just check
the underlying file type of vma in vma_migratable().
Change migrate_page_add() to check if the page is movable or not, if it
is unmovable, just return -EIO. But do not abort pte walk immediately,
since there may be pages off LRU temporarily. We should migrate other
pages if MPOL_MF_MOVE* is specified. Set has_unmovable flag if some
paged could not be not moved, then return -EIO for mbind() eventually.
With this change the above test would return -EIO as expected.
[yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
Link: http://lkml.kernel.org/r/1563556862-54056-3-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1561162809-59140-3-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should
try best to migrate misplaced pages, if some of the pages could not be
migrated, then return -EIO.
There are three different sub-cases:
1. vma is not migratable
2. vma is migratable, but there are unmovable pages
3. vma is migratable, pages are movable, but migrate_pages() fails
If #1 happens, kernel would just abort immediately, then return -EIO,
after a7f40cfe3b ("mm: mempolicy: make mbind() return -EIO when
MPOL_MF_STRICT is specified").
If #3 happens, kernel would set policy and migrate pages with
best-effort, but won't rollback the migrated pages and reset the policy
back.
Before that commit, they behaves in the same way. It'd better to keep
their behavior consistent. But, rolling back the migrated pages and
resetting the policy back sounds not feasible, so just make #1 behave as
same as #3.
Userspace will know that not everything was successfully migrated (via
-EIO), and can take whatever steps it deems necessary - attempt
rollback, determine which exact page(s) are violating the policy, etc.
Make queue_pages_range() return 1 to indicate there are unmovable pages
or vma is not migratable.
The #2 is not handled correctly in the current kernel, the following
patch will fix it.
[yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
Link: http://lkml.kernel.org/r/1563556862-54056-2-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1561162809-59140-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When migrating an anonymous private page to a ZONE_DEVICE private page,
the source page->mapping and page->index fields are copied to the
destination ZONE_DEVICE struct page and the page_mapcount() is
increased. This is so rmap_walk() can be used to unmap and migrate the
page back to system memory.
However, try_to_unmap_one() computes the subpage pointer from a swap pte
which computes an invalid page pointer and a kernel panic results such
as:
BUG: unable to handle page fault for address: ffffea1fffffffc8
Currently, only single pages can be migrated to device private memory so
no subpage computation is needed and it can be set to "page".
[rcampbell@nvidia.com: add comment]
Link: http://lkml.kernel.org/r/20190724232700.23327-4-rcampbell@nvidia.com
Link: http://lkml.kernel.org/r/20190719192955.30462-4-rcampbell@nvidia.com
Fixes: a5430dda8a ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a ZONE_DEVICE private page is freed, the page->mapping field can be
set. If this page is reused as an anonymous page, the previous value
can prevent the page from being inserted into the CPU's anon rmap table.
For example, when migrating a pte_none() page to device memory:
migrate_vma(ops, vma, start, end, src, dst, private)
migrate_vma_collect()
src[] = MIGRATE_PFN_MIGRATE
migrate_vma_prepare()
/* no page to lock or isolate so OK */
migrate_vma_unmap()
/* no page to unmap so OK */
ops->alloc_and_copy()
/* driver allocates ZONE_DEVICE page for dst[] */
migrate_vma_pages()
migrate_vma_insert_page()
page_add_new_anon_rmap()
__page_set_anon_rmap()
/* This check sees the page's stale mapping field */
if (PageAnon(page))
return
/* page->mapping is not updated */
The result is that the migration appears to succeed but a subsequent CPU
fault will be unable to migrate the page back to system memory or worse.
Clear the page->mapping field when freeing the ZONE_DEVICE page so stale
pointer data doesn't affect future page use.
Link: http://lkml.kernel.org/r/20190719192955.30462-3-rcampbell@nvidia.com
Fixes: b7a523109f ("mm: don't clear ->mapping in hmm_devmem_free")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Kara <jack@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, attempts to shutdown and re-enable a device-dax instance
trigger:
Missing reference count teardown definition
WARNING: CPU: 37 PID: 1608 at mm/memremap.c:211 devm_memremap_pages+0x234/0x850
[..]
RIP: 0010:devm_memremap_pages+0x234/0x850
[..]
Call Trace:
dev_dax_probe+0x66/0x190 [device_dax]
really_probe+0xef/0x390
driver_probe_device+0xb4/0x100
device_driver_attach+0x4f/0x60
Given that the setup path initializes pgmap->ref, arrange for it to be
also torn down so devm_memremap_pages() is ready to be called again and
not be mistaken for the 3rd-party per-cpu-ref case.
Fixes: 24917f6b10 ("memremap: provide an optional internal refcount in struct dev_pagemap")
Reported-by: Fan Du <fan.du@intel.com>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/156530042781.2068700.8733813683117819799.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Make HMM_MIRROR an option that is selected by drivers wanting to use it
instead of a user visible option as it is just a low-level implementation
detail.
Link: https://lore.kernel.org/r/20190806160554.14046-15-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There isn't really any architecture specific code in this page table walk
implementation, so drop the dependencies.
Link: https://lore.kernel.org/r/20190806160554.14046-14-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Stub out the whole function and assign NULL to the .hugetlb_entry method
if CONFIG_HUGETLB_PAGE is not set, as the method won't ever be called in
that case.
Link: https://lore.kernel.org/r/20190806160554.14046-13-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Stub out the whole function when CONFIG_TRANSPARENT_HUGEPAGE is not set to
make the function easier to read.
Link: https://lore.kernel.org/r/20190806160554.14046-12-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
We only need the special pud_entry walker if PUD-sized hugepages and pte
mappings are supported, else the common pagewalk code will take care of
the iteration. Not implementing this callback reduced the amount of code
compiled for non-x86 platforms, and also fixes compile failures with other
architectures when helpers like pud_pfn are not implemented.
Link: https://lore.kernel.org/r/20190806160554.14046-11-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
pte_index is an internal arch helper in various architectures, without
consistent semantics. Open code that calculation of a PMD index based on
the virtual address instead.
Link: https://lore.kernel.org/r/20190806160554.14046-10-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The pagewalk code already passes the value as the hmask parameter.
Link: https://lore.kernel.org/r/20190806160554.14046-9-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
All users pass PAGE_SIZE here, and if we wanted to support single entries
for huge pages we should really just add a HMM_FAULT_HUGEPAGE flag instead
that uses the huge page size instead of having the caller calculate that
size once, just for the hmm code to verify it.
Link: https://lore.kernel.org/r/20190806160554.14046-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The start, end and page_shift values are all saved in the range structure,
so we might as well use that for argument passing.
Link: https://lore.kernel.org/r/20190806160554.14046-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
memremap.c implements MM functionality for ZONE_DEVICE, so it really
should be in the mm/ directory, not the kernel/ one.
Link: http://lkml.kernel.org/r/20190722094143.18387-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
return is unneeded in void function
Link: http://lkml.kernel.org/r/20190723130814.21826-1-houweitaoo@gmail.com
Signed-off-by: Weitao Hou <houweitaoo@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When CONFIG_MIGRATE_VMA_HELPER is enabled, migrate_vma() calls
migrate_vma_collect() which initializes a struct mm_walk but didn't
initialize mm_walk.pud_entry. (Found by code inspection) Use a C
structure initialization to make sure it is set to NULL.
Link: http://lkml.kernel.org/r/20190719233225.12243-1-rcampbell@nvidia.com
Fixes: 8763cb45ab ("mm/migrate: new memory migration helper for use with device memory")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"howaboutsynergy" reported via kernel buzilla number 204165 that
compact_zone_order was consuming 100% CPU during a stress test for
prolonged periods of time. Specifically the following command, which
should exit in 10 seconds, was taking an excessive time to finish while
the CPU was pegged at 100%.
stress -m 220 --vm-bytes 1000000000 --timeout 10
Tracing indicated a pattern as follows
stress-3923 [007] 519.106208: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106212: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106216: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106219: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106223: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106227: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106231: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106235: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106238: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106242: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
Note that compaction is entered in rapid succession while scanning and
isolating nothing. The problem is that when a task that is compacting
receives a fatal signal, it retries indefinitely instead of exiting
while making no progress as a fatal signal is pending.
It's not easy to trigger this condition although enabling zswap helps on
the basis that the timing is altered. A very small window has to be hit
for the problem to occur (signal delivered while compacting and
isolating a PFN for migration that is not aligned to SWAP_CLUSTER_MAX).
This was reproduced locally -- 16G single socket system, 8G swap, 30%
zswap configured, vm-bytes 22000000000 using Colin Kings stress-ng
implementation from github running in a loop until the problem hits).
Tracing recorded the problem occurring almost 200K times in a short
window. With this patch, the problem hit 4 times but the task existed
normally instead of consuming CPU.
This problem has existed for some time but it was made worse by commit
cf66f0700c ("mm, compaction: do not consider a need to reschedule as
contention"). Before that commit, if the same condition was hit then
locks would be quickly contended and compaction would exit that way.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204165
Link: http://lkml.kernel.org/r/20190718085708.GE24383@techsingularity.net
Fixes: cf66f0700c ("mm, compaction: do not consider a need to reschedule as contention")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [5.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
buffer_migrate_page_norefs() can race with bh users in the following
way:
CPU1 CPU2
buffer_migrate_page_norefs()
buffer_migrate_lock_buffers()
checks bh refs
spin_unlock(&mapping->private_lock)
__find_get_block()
spin_lock(&mapping->private_lock)
grab bh ref
spin_unlock(&mapping->private_lock)
move page do bh work
This can result in various issues like lost updates to buffers (i.e.
metadata corruption) or use after free issues for the old page.
This patch closes the race by holding mapping->private_lock while the
mapping is being moved to a new page. Ordinarily, a reference can be
taken outside of the private_lock using the per-cpu BH LRU but the
references are checked and the LRU invalidated if necessary. The
private_lock is held once the references are known so the buffer lookup
slow path will spin on the private_lock. Between the page lock and
private_lock, it should be impossible for other references to be
acquired and updates to happen during the migration.
A user had reported data corruption issues on a distribution kernel with
a similar page migration implementation as mainline. The data
corruption could not be reproduced with this patch applied. A small
number of migration-intensive tests were run and no performance problems
were noted.
[mgorman@techsingularity.net: Changelog, removed tracing]
Link: http://lkml.kernel.org/r/20190718090238.GF24383@techsingularity.net
Fixes: 89cb0888ca "mm: migrate: provide buffer_migrate_page_norefs()"
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shakeel Butt reported premature oom on kernel with
"cgroup_disable=memory" since mem_cgroup_is_root() returns false even
though memcg is actually NULL. The drop_caches is also broken.
It is because commit aeed1d325d ("mm/vmscan.c: generalize
shrink_slab() calls in shrink_node()") removed the !memcg check before
!mem_cgroup_is_root(). And, surprisingly root memcg is allocated even
though memory cgroup is disabled by kernel boot parameter.
Add mem_cgroup_disabled() check to make reclaimer work as expected.
Link: http://lkml.kernel.org/r/1563385526-20805-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: aeed1d325d ("mm/vmscan.c: generalize shrink_slab() calls in shrink_node()")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reported-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jan Hadrava <had@kam.mff.cuni.cz>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org> [4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When running ltp's oom test with kmemleak enabled, the below warning was
triggerred since kernel detects __GFP_NOFAIL & ~__GFP_DIRECT_RECLAIM is
passed in:
WARNING: CPU: 105 PID: 2138 at mm/page_alloc.c:4608 __alloc_pages_nodemask+0x1c31/0x1d50
Modules linked in: loop dax_pmem dax_pmem_core ip_tables x_tables xfs virtio_net net_failover virtio_blk failover ata_generic virtio_pci virtio_ring virtio libata
CPU: 105 PID: 2138 Comm: oom01 Not tainted 5.2.0-next-20190710+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:__alloc_pages_nodemask+0x1c31/0x1d50
...
kmemleak_alloc+0x4e/0xb0
kmem_cache_alloc+0x2a7/0x3e0
mempool_alloc_slab+0x2d/0x40
mempool_alloc+0x118/0x2b0
bio_alloc_bioset+0x19d/0x350
get_swap_bio+0x80/0x230
__swap_writepage+0x5ff/0xb20
The mempool_alloc_slab() clears __GFP_DIRECT_RECLAIM, however kmemleak
has __GFP_NOFAIL set all the time due to d9570ee3bd ("kmemleak:
allow to coexist with fault injection"). But, it doesn't make any sense
to have __GFP_NOFAIL and ~__GFP_DIRECT_RECLAIM specified at the same
time.
According to the discussion on the mailing list, the commit should be
reverted for short term solution. Catalin Marinas would follow up with
a better solution for longer term.
The failure rate of kmemleak metadata allocation may increase in some
circumstances, but this should be expected side effect.
Link: http://lkml.kernel.org/r/1563299431-111710-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: d9570ee3bd ("kmemleak: allow to coexist with fault injection")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To properly clear the slab on free with slab_want_init_on_free, we walk
the list of free objects using get_freepointer/set_freepointer.
The value we get from get_freepointer may not be valid. This isn't an
issue since an actual value will get written later but this means
there's a chance of triggering a bug if we use this value with
set_freepointer:
kernel BUG at mm/slub.c:306!
invalid opcode: 0000 [#1] PREEMPT PTI
CPU: 0 PID: 0 Comm: swapper Not tainted 5.2.0-05754-g6471384a #4
RIP: 0010:kfree+0x58a/0x5c0
Code: 48 83 05 78 37 51 02 01 0f 0b 48 83 05 7e 37 51 02 01 48 83 05 7e 37 51 02 01 48 83 05 7e 37 51 02 01 48 83 05 d6 37 51 02 01 <0f> 0b 48 83 05 d4 37 51 02 01 48 83 05 d4 37 51 02 01 48 83 05 d4
RSP: 0000:ffffffff82603d90 EFLAGS: 00010002
RAX: ffff8c3976c04320 RBX: ffff8c3976c04300 RCX: 0000000000000000
RDX: ffff8c3976c04300 RSI: 0000000000000000 RDI: ffff8c3976c04320
RBP: ffffffff82603db8 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8c3976c04320 R11: ffffffff8289e1e0 R12: ffffd52cc8db0100
R13: ffff8c3976c01a00 R14: ffffffff810f10d4 R15: ffff8c3976c04300
FS: 0000000000000000(0000) GS:ffffffff8266b000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8c397ffff000 CR3: 0000000125020000 CR4: 00000000000406b0
Call Trace:
apply_wqattrs_prepare+0x154/0x280
apply_workqueue_attrs_locked+0x4e/0xe0
apply_workqueue_attrs+0x36/0x60
alloc_workqueue+0x25a/0x6d0
workqueue_init_early+0x246/0x348
start_kernel+0x3c7/0x7ec
x86_64_start_reservations+0x40/0x49
x86_64_start_kernel+0xda/0xe4
secondary_startup_64+0xb6/0xc0
Modules linked in:
---[ end trace f67eb9af4d8d492b ]---
Fix this by ensuring the value we set with set_freepointer is either NULL
or another value in the chain.
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Laura Abbott <labbott@redhat.com>
Fixes: 6471384af2 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the locking around nouveau's use of the hmm_range_* APIs. It works
correctly in the success case, but many of the the edge cases have missing
unlocks or double unlocks.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl07BG0ACgkQOG33FX4g
mxr/RA//b1t3rTjyYlzEGpCFouDAJrV8mRrmPZtywzSxhiyKgylWiQ9D5HyAZ8ZG
evEF1xFe0PcTKiieqnZCJBPh864t+yt9Mm45MpWamBNoHx7WPSdeOMbSDUNvQR+H
8aWTGBZvdKlqpwD63yvk7C6jkZ6vXDNYROnM395gzlfmaVGBeLygXqcKUkiW1x+D
1CK+KsBldacxH/gE2X966mXxG46/5VL8KDVoo4VVnpLMDRdRs6zbIBRj7l9+hWbh
2HABQyvDJW4tYmUW5iHAoLV2fAIE/nJMprEabXvd6rFAPwbryBroguXffGqkIaa0
Ce1LIhiakCUniK2XgP2W/+KwJQBNp3hQjJr+ip7hgQCtzcD8zRYSxDt5gUtbjpGd
4JfXrRVrfa08/hBe4adPfE5W5mW3oyEyRHldToT0SrywIY8sTLjN7RdCMwOqrxoR
QkgqDISLqJab1OQEPHr7QgsgO2c2k19yPpckSZJ+IIldpNtLa9V+eif85NZ/esOd
2GTWph3UQiACp9fLgEIAvJUnZ0blZpYq9TYshWWYkO34M+KgBdqOn1cQhZH+4rWb
0Ed/jGdIaPZZ7XaLDgz5e7jl+t+kmSBdqSQtunF4bbu7AwR/zt3es0jq2vFoD451
syF2vSVKyoBZMESX8X0O2cv+HHpN5oqH1XLI1ABOO09X9lxAPl4=
=ZdrW
-----END PGP SIGNATURE-----
Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull HMM fixes from Jason Gunthorpe:
"Fix the locking around nouveau's use of the hmm_range_* APIs. It works
correctly in the success case, but many of the the edge cases have
missing unlocks or double unlocks.
The diffstat is a bit big as Christoph did a comprehensive job to move
the obsolete API from the core header and into the driver before
fixing its flow, but the risk of regression from this code motion is
low"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
nouveau: unlock mmap_sem on all errors from nouveau_range_fault
nouveau: remove the block parameter to nouveau_range_fault
mm/hmm: move hmm_vma_range_done and hmm_vma_fault to nouveau
mm/hmm: always return EBUSY for invalid ranges in hmm_range_{fault,snapshot}
Fixes in the iommu and balloon devices.
Disable the meta-data optimization for now - I hope we can get it fixed
shortly, but there's no point in making users suffer crashes while we
are working on that.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJdPV3yAAoJECgfDbjSjVRp5qAIAIbzdgGkkuill7++e05fo3zJ
Vus5ApnFb+VopuiKFAxHyrRhvFun2dftcpOEFC6qpZ1xMcErRa1JTDp+Z70gLPcf
ZYrT7WoJv202cTQLjlrKwMA4C+hNTGf86KZWls+uzTXngbsrzib99M89wjOTP6UW
fslOtznbaHw/oPqQSiL40vNUEhU6thnvSxWpaIGJTnU9cx508Q7dE8TpLA5UpuNj
0y0+0HJrwlNdO2CSOay+dLEkZ/3M0vbXxwcmMNwoPIOx3N58ScCTLF3w6/Zuudco
XGhUzY6K5UqonVRVoxXMsQru9ZiAhKGMnf3+ugUojm+riPFOrWBbMNkU7mmNIo0=
=nw3y
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio/vhost fixes from Michael Tsirkin:
- Fixes in the iommu and balloon devices.
- Disable the meta-data optimization for now - I hope we can get it
fixed shortly, but there's no point in making users suffer crashes
while we are working on that.
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: disable metadata prefetch optimization
iommu/virtio: Update to most recent specification
balloon: fix up comments
mm/balloon_compaction: avoid duplicate page removal
Since hmm_range_fault() doesn't use the struct hmm_range vma field, remove
it.
Link: https://lore.kernel.org/r/20190726005650.2566-8-rcampbell@nvidia.com
Suggested-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
walk_page_range() will only call hmm_vma_walk_hugetlb_entry() for
hugetlbfs pages and doesn't call hmm_vma_walk_pmd() in this case.
Therefore, it is safe to remove the check for vma->vm_flags & VM_HUGETLB
in hmm_vma_walk_pmd().
Link: https://lore.kernel.org/r/20190726005650.2566-7-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add a HMM_FAULT_SNAPSHOT flag so that hmm_range_snapshot can be merged
into the almost identical hmm_range_fault function.
Link: https://lore.kernel.org/r/20190726005650.2566-5-rcampbell@nvidia.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This allows easier expansion to other flags, and also makes the callers a
little easier to read.
Link: https://lore.kernel.org/r/20190726005650.2566-4-rcampbell@nvidia.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
A few more comments and minor programming style clean ups. There should
be no functional changes.
Link: https://lore.kernel.org/r/20190726005650.2566-3-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The hmm_mirror_ops callback function sync_cpu_device_pagetables() passes a
struct hmm_update which is a simplified version of struct
mmu_notifier_range. This is unnecessary so replace hmm_update with
mmu_notifier_range directly.
Link: https://lore.kernel.org/r/20190726005650.2566-2-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
[jgg: white space tuning]
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The magic dropping of mmap_sem when handle_mm_fault returns VM_FAULT_RETRY
is rather subtile. Add a comment explaining it.
Link: https://lore.kernel.org/r/20190724065258.16603-8-hch@lst.de
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
[hch: wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
We should not have two different error codes for the same
condition. EAGAIN must be reserved for the FAULT_FLAG_ALLOW_RETRY retry
case and signals to the caller that the mmap_sem has been unlocked.
Use EBUSY for the !valid case so that callers can get the locking right.
Link: https://lore.kernel.org/r/20190724065258.16603-2-hch@lst.de
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
[jgg: elaborated commit message]
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Lots of comments bitrotted. Fix them up.
Fixes: 418a3ab1e7 (mm/balloon_compaction: List interfaces)
Reviewed-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Acked-by: Nadav Amit <namit@vmware.com>
A #GP is reported in the guest when requesting balloon inflation via
virtio-balloon. The reason is that the virtio-balloon driver has
removed the page from its internal page list (via balloon_page_pop),
but balloon_page_enqueue_one also calls "list_del" to do the removal.
This is necessary when it's used from balloon_page_enqueue_list, but
not from balloon_page_enqueue.
Move list_del to balloon_page_enqueue, and update comments accordingly.
Fixes: 418a3ab1e7 (mm/balloon_compaction: List interfaces)
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
On x86-32 with PTI enabled, parts of the kernel page-tables are not shared
between processes. This can cause mappings in the vmalloc/ioremap area to
persist in some page-tables after the region is unmapped and released.
When the region is re-used the processes with the old mappings do not fault
in the new mappings but still access the old ones.
This causes undefined behavior, in reality often data corruption, kernel
oopses and panics and even spontaneous reboots.
Fix this problem by activly syncing unmaps in the vmalloc/ioremap area to
all page-tables in the system before the regions can be re-used.
References: https://bugzilla.suse.com/show_bug.cgi?id=1118689
Fixes: 5d72b4fba4 ('x86, mm: support huge I/O mapping capability I/F')
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20190719184652.11391-4-joro@8bytes.org
Pull vfs mount updates from Al Viro:
"The first part of mount updates.
Convert filesystems to use the new mount API"
* 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
mnt_init(): call shmem_init() unconditionally
constify ksys_mount() string arguments
don't bother with registering rootfs
init_rootfs(): don't bother with init_ramfs_fs()
vfs: Convert smackfs to use the new mount API
vfs: Convert selinuxfs to use the new mount API
vfs: Convert securityfs to use the new mount API
vfs: Convert apparmorfs to use the new mount API
vfs: Convert openpromfs to use the new mount API
vfs: Convert xenfs to use the new mount API
vfs: Convert gadgetfs to use the new mount API
vfs: Convert oprofilefs to use the new mount API
vfs: Convert ibmasmfs to use the new mount API
vfs: Convert qib_fs/ipathfs to use the new mount API
vfs: Convert efivarfs to use the new mount API
vfs: Convert configfs to use the new mount API
vfs: Convert binfmt_misc to use the new mount API
convenience helper: get_tree_single()
convenience helper get_tree_nodev()
vfs: Kill sget_userns()
...
Merge yet more updates from Andrew Morton:
"The rest of MM and a kernel-wide procfs cleanup.
Summary of the more significant patches:
- Patch series "mm/memory_hotplug: Factor out memory block
devicehandling", v3. David Hildenbrand.
Some spring-cleaning of the memory hotplug code, notably in
drivers/base/memory.c
- "mm: thp: fix false negative of shmem vma's THP eligibility". Yang
Shi.
Fix /proc/pid/smaps output for THP pages used in shmem.
- "resource: fix locking in find_next_iomem_res()" + 1. Nadav Amit.
Bugfix and speedup for kernel/resource.c
- Patch series "mm: Further memory block device cleanups", David
Hildenbrand.
More spring-cleaning of the memory hotplug code.
- Patch series "mm: Sub-section memory hotplug support". Dan
Williams.
Generalise the memory hotplug code so that pmem can use it more
completely. Then remove the hacks from the libnvdimm code which
were there to work around the memory-hotplug code's constraints.
- "proc/sysctl: add shared variables for range check", Matteo Croce.
We have about 250 instances of
int zero;
...
.extra1 = &zero,
in the tree. This is a tree-wide sweep to make all those private
"zero"s and "one"s use global variables.
Alas, it isn't practical to make those two global integers const"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (38 commits)
proc/sysctl: add shared variables for range check
mm: migrate: remove unused mode argument
mm/sparsemem: cleanup 'section number' data types
libnvdimm/pfn: stop padding pmem namespaces to section alignment
libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields
mm/devm_memremap_pages: enable sub-section remap
mm: document ZONE_DEVICE memory-model implications
mm/sparsemem: support sub-section hotplug
mm/sparsemem: prepare for sub-section ranges
mm: kill is_dev_zone() helper
mm/hotplug: kill is_dev_zone() usage in __remove_pages()
mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()
mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal
mm/sparsemem: add helpers track active portions of a section at boot
mm/sparsemem: introduce a SECTION_IS_EARLY flag
mm/sparsemem: introduce struct mem_section_usage
drivers/base/memory.c: get rid of find_memory_block_hinted()
mm/memory_hotplug: move and simplify walk_memory_blocks()
mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns
mm: make register_mem_sect_under_node() static
...
migrate_page_move_mapping() doesn't use the mode argument. Remove it
and update callers accordingly.
Link: http://lkml.kernel.org/r/20190508210301.8472-1-keith.busch@intel.com
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David points out that there is a mixture of 'int' and 'unsigned long'
usage for section number data types. Update the memory hotplug path to
use 'unsigned long' consistently for section numbers.
[akpm@linux-foundation.org: fix printk format]
Link: http://lkml.kernel.org/r/156107543656.1329419.11505835211949439815.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The libnvdimm sub-system has suffered a series of hacks and broken
workarounds for the memory-hotplug implementation's awkward
section-aligned (128MB) granularity.
For example the following backtrace is emitted when attempting
arch_add_memory() with physical address ranges that intersect 'System
RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:
# cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
100000000-1ffffffff : System RAM
200000000-303ffffff : Persistent Memory (legacy)
304000000-43fffffff : System RAM
440000000-23ffffffff : Persistent Memory
2400000000-43bfffffff : Persistent Memory
2400000000-43bfffffff : namespace2.0
WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
[..]
RIP: 0010:add_pages+0x5c/0x60
[..]
Call Trace:
devm_memremap_pages+0x460/0x6e0
pmem_attach_disk+0x29e/0x680 [nd_pmem]
? nd_dax_probe+0xfc/0x120 [libnvdimm]
nvdimm_bus_probe+0x66/0x160 [libnvdimm]
It was discovered that the problem goes beyond RAM vs PMEM collisions as
some platform produce PMEM vs PMEM collisions within a given section.
The libnvdimm workaround for that case revealed that the libnvdimm
section-alignment-padding implementation has been broken for a long
while.
A fix for that long-standing breakage introduces as many problems as it
solves as it would require a backward-incompatible change to the
namespace metadata interpretation. Instead of that dubious route [1],
address the root problem in the memory-hotplug implementation.
Note that EEXIST is no longer treated as success as that is how
sparse_add_section() reports subsection collisions, it was also obviated
by recent changes to perform the request_region() for 'System RAM'
before arch_add_memory() in the add_memory() sequence.
[1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[osalvador@suse.de: fix deactivate_section for early sections]
Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de
Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prepare the memory hot-{add,remove} paths for handling sub-section
ranges by plumbing the starting page frame and number of pages being
handled through arch_{add,remove}_memory() to
sparse_{add,remove}_one_section().
This is simply plumbing, small cleanups, and some identifier renames.
No intended functional changes.
Link: http://lkml.kernel.org/r/156092353780.979959.9713046515562743194.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Given there are no more usages of is_dev_zone() outside of 'ifdef
CONFIG_ZONE_DEVICE' protection, kill off the compilation helper.
Link: http://lkml.kernel.org/r/156092353211.979959.1489004866360828964.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The zone type check was a leftover from the cleanup that plumbed altmap
through the memory hotplug path, i.e. commit da024512a1 "mm: pass the
vmem_altmap to arch_remove_memory and __remove_pages".
Link: http://lkml.kernel.org/r/156092352642.979959.6664333788149363039.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow sub-section sized ranges to be added to the memmap.
populate_section_memmap() takes an explict pfn range rather than
assuming a full section, and those parameters are plumbed all the way
through to vmmemap_populate(). There should be no sub-section usage in
current deployments. New warnings are added to clarify which memmap
allocation paths are sub-section capable.
Link: http://lkml.kernel.org/r/156092352058.979959.6551283472062305149.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sub-section hotplug support reduces the unit of operation of hotplug
from section-sized-units (PAGES_PER_SECTION) to sub-section-sized units
(PAGES_PER_SUBSECTION). Teach shrink_{zone,pgdat}_span() to consider
PAGES_PER_SUBSECTION boundaries as the points where pfn_valid(), not
valid_section(), can toggle.
[osalvador@suse.de: fix shrink_{zone,node}_span]
Link: http://lkml.kernel.org/r/20190717090725.23618-3-osalvador@suse.de
Link: http://lkml.kernel.org/r/156092351496.979959.12703722803097017492.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
sub-section active bitmask, each bit representing a PMD_SIZE span of the
architecture's memory hotplug section size.
The implications of a partially populated section is that pfn_valid()
needs to go beyond a valid_section() check and either determine that the
section is an "early section", or read the sub-section active ranges
from the bitmask. The expectation is that the bitmask (subsection_map)
fits in the same cacheline as the valid_section() / early_section()
data, so the incremental performance overhead to pfn_valid() should be
negligible.
The rationale for using early_section() to short-ciruit the
subsection_map check is that there are legacy code paths that use
pfn_valid() at section granularity before validating the pfn against
pgdat data. So, the early_section() check allows those traditional
assumptions to persist while also permitting subsection_map to tell the
truth for purposes of populating the unused portions of early sections
with PMEM and other ZONE_DEVICE mappings.
Link: http://lkml.kernel.org/r/156092350874.979959.18185938451405518285.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Qian Cai <cai@lca.pw>
Tested-by: Jane Chu <jane.chu@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation for sub-section hotplug, track whether a given section
was created during early memory initialization, or later via memory
hotplug. This distinction is needed to maintain the coarse expectation
that pfn_valid() returns true for any pfn within a given section even if
that section has pages that are reserved from the page allocator.
For example one of the of goals of subsection hotplug is to support
cases where the system physical memory layout collides System RAM and
PMEM within a section. Several pfn_valid() users expect to just check
if a section is valid, but they are not careful to check if the given
pfn is within a "System RAM" boundary and instead expect pgdat
information to further validate the pfn.
Rather than unwind those paths to make their pfn_valid() queries more
precise a follow on patch uses the SECTION_IS_EARLY flag to maintain the
traditional expectation that pfn_valid() returns true for all early
sections.
Link: https://lore.kernel.org/lkml/1560366952-10660-1-git-send-email-cai@lca.pw/
Link: http://lkml.kernel.org/r/156092350358.979959.5817209875548072819.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Qian Cai <cai@lca.pw>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: Sub-section memory hotplug support", v10.
The memory hotplug section is an arbitrary / convenient unit for memory
hotplug. 'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace. The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory'
use cases, persistent memory (pmem) in particular. Recall that pmem
uses devm_memremap_pages(), and subsequently arch_add_memory(), to
allocate a 'struct page' memmap for pmem. However, it does not use the
'bottom half' of memory hotplug, i.e. never marks pmem pages online and
never exposes the userspace memblock interface for pmem. This leaves an
opening to redress the section-size constraint.
To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory(). Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the
next. Device failure (intermittent or permanent) and physical
reconfiguration are events that can cause the platform firmware to
change the physical placement of pmem on a subsequent boot, and device
failure is an everyday event in a data-center.
It turns out that sections are only a hard requirement of the
user-facing interface for memory hotplug and with a bit more
infrastructure sub-section arch_add_memory() support can be added for
kernel internal usages like devm_memremap_pages(). Here is an analysis
of the current design assumptions in the current code and how they are
addressed in the new implementation:
Current design assumptions:
- Sections that describe boot memory (early sections) are never
unplugged / removed.
- pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
valid_section() check
- __add_pages() and helper routines assume all operations occur in
PAGES_PER_SECTION units.
- The memblock sysfs interface only comprehends full sections
New design assumptions:
- Sections are instrumented with a sub-section bitmask to track (on
x86) individual 2MB sub-divisions of a 128MB section.
- Partially populated early sections can be extended with additional
sub-sections, and those sub-sections can be removed with
arch_remove_memory(). With this in place we no longer lose usable
memory capacity to padding.
- pfn_valid() is updated to look deeper than valid_section() to also
check the active-sub-section mask. This indication is in the same
cacheline as the valid_section() so the performance impact is
expected to be negligible. So far the lkp robot has not reported any
regressions.
- Outside of the core vmemmap population routines which are replaced,
other helper routines like shrink_{zone,pgdat}_span() are updated to
handle the smaller granularity. Core memory hotplug routines that
deal with online memory are not touched.
- The existing memblock sysfs user api guarantees / assumptions are not
touched since this capability is limited to !online
!memblock-sysfs-accessible sections.
Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them. The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt. Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3]. In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.
These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].
[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: https://github.com/pmem/ndctl/issues/76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c
This patch (of 13):
Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.
A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap. Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap. The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.
SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users. Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.
The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section. The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.
Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.
Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's move walk_memory_blocks() to the place where memory block logic
resides and simplify it. While at it, add a type for the callback
function.
Link: http://lkml.kernel.org/r/20190614100114.311-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
walk_memory_range() was once used to iterate over sections. Now, it
iterates over memory blocks. Rename the function, fixup the
documentation.
Also, pass start+size instead of PFNs, which is what most callers
already have at hand. (we'll rework link_mem_sections() most probably
soon)
Follow-up patches will rework, simplify, and move walk_memory_blocks()
to drivers/base/memory.c.
Note: walk_memory_blocks() only works correctly right now if the
start_pfn is aligned to a section start. This is the case right now,
but we'll generalize the function in a follow up patch so the semantics
match the documentation.
[akpm@linux-foundation.org: remove unused variable]
Link: http://lkml.kernel.org/r/20190614100114.311-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: Further memory block device cleanups", v1.
Some further cleanups around memory block devices. Especially, clean up
and simplify walk_memory_range(). Including some other minor cleanups.
This patch (of 6):
We are using a mixture of "int" and "unsigned long". Let's make this
consistent by using "unsigned long" everywhere. We'll do the same with
memory block ids next.
While at it, turn the "unsigned long i" in removable_show() into an int
- sections_per_block is an int.
[akpm@linux-foundation.org: s/unsigned long i/unsigned long nr/]
[david@redhat.com: v3]
Link: http://lkml.kernel.org/r/20190620183139.4352-2-david@redhat.com
Link: http://lkml.kernel.org/r/20190614100114.311-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 7635d9cbe8 ("mm, thp, proc: report THP eligibility for each
vma") introduced THPeligible bit for processes' smaps. But, when
checking the eligibility for shmem vma, __transparent_hugepage_enabled()
is called to override the result from shmem_huge_enabled(). It may
result in the anonymous vma's THP flag override shmem's. For example,
running a simple test which create THP for shmem, but with anonymous THP
disabled, when reading the process's smaps, it may show:
7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
Size: 4096 kB
...
[snip]
...
ShmemPmdMapped: 4096 kB
...
[snip]
...
THPeligible: 0
And, /proc/meminfo does show THP allocated and PMD mapped too:
ShmemHugePages: 4096 kB
ShmemPmdMapped: 4096 kB
This doesn't make too much sense. The shmem objects should be treated
separately from anonymous THP. Calling shmem_huge_enabled() with
checking MMF_DISABLE_THP sounds good enough. And, we could skip stack
and dax vma check since we already checked if the vma is shmem already.
Also check if vma is suitable for THP by calling
transhuge_vma_suitable().
And minor fix to smaps output format and documentation.
Link: http://lkml.kernel.org/r/1560401041-32207-3-git-send-email-yang.shi@linux.alibaba.com
Fixes: 7635d9cbe8 ("mm, thp, proc: report THP eligibility for each vma")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
transhuge_vma_suitable() was only available for shmem THP, but anonymous
THP has the same check except pgoff check. And, it will be used for THP
eligible check in the later patch, so make it available for all kind of
THPs. This also helps reduce code duplication slightly.
Since anonymous THP doesn't have to check pgoff, so make pgoff check
shmem vma only.
And regroup some functions in include/linux/mm.h to solve compile issue
since transhuge_vma_suitable() needs call vma_is_anonymous() which was
defined after huge_mm.h is included.
[akpm@linux-foundation.org: fix typo]
[yang.shi@linux.alibaba.com: v4]
Link: http://lkml.kernel.org/r/1563400758-124759-2-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1560401041-32207-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In case of NODE_NOT_IN_PAGE_FLAGS is set, we store section's node id in
section_to_node_table[]. While for hot-add memory, this is missed.
Without this information, page_to_nid() may not give the right node id.
BTW, current online_pages works because it leverages nid in
memory_block. But the granularity of node id should be mem_section
wide.
Link: http://lkml.kernel.org/r/20190618005537.18878-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The parameter is unused, so let's drop it. Memory removal paths should
never care about zones. This is the job of memory offlining and will
require more refactorings.
Link: http://lkml.kernel.org/r/20190527111152.16324-12-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chintan Pandya <cpandya@codeaurora.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jun Yao <yaojun8558363@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's factor out removing of memory block devices, which is only
necessary for memory added via add_memory() and friends that created
memory block devices. Remove the devices before calling
arch_remove_memory().
This finishes factoring out memory block device handling from
arch_add_memory() and arch_remove_memory().
Link: http://lkml.kernel.org/r/20190527111152.16324-10-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chintan Pandya <cpandya@codeaurora.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jun Yao <yaojun8558363@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No longer needed, the callers of arch_add_memory() can handle this
manually.
Link: http://lkml.kernel.org/r/20190527111152.16324-9-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chintan Pandya <cpandya@codeaurora.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jun Yao <yaojun8558363@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only memory to be added to the buddy and to be onlined/offlined by user
space using /sys/devices/system/memory/... needs (and should have!)
memory block devices.
Factor out creation of memory block devices. Create all devices after
arch_add_memory() succeeded. We can later drop the want_memblock
parameter, because it is now effectively stale.
Only after memory block devices have been added, memory can be onlined
by user space. This implies, that memory is not visible to user space
at all before arch_add_memory() succeeded.
While at it
- use WARN_ON_ONCE instead of BUG_ON in moved unregister_memory()
- introduce find_memory_block_by_id() to search via block id
- Use find_memory_block_by_id() in init_memory_block() to catch
duplicates
Link: http://lkml.kernel.org/r/20190527111152.16324-8-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Qian Cai <cai@lca.pw>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chintan Pandya <cpandya@codeaurora.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jun Yao <yaojun8558363@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want to improve error handling while adding memory by allowing to use
arch_remove_memory() and __remove_pages() even if
CONFIG_MEMORY_HOTREMOVE is not set to e.g., implement something like:
arch_add_memory()
rc = do_something();
if (rc) {
arch_remove_memory();
}
We won't get rid of CONFIG_MEMORY_HOTREMOVE for now, as it will require
quite some dependencies for memory offlining.
Link: http://lkml.kernel.org/r/20190527111152.16324-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chintan Pandya <cpandya@codeaurora.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jun Yao <yaojun8558363@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/memory_hotplug: Factor out memory block devicehandling", v3.
We only want memory block devices for memory to be onlined/offlined
(add/remove from the buddy). This is required so user space can
online/offline memory and kdump gets notified about newly onlined
memory.
Let's factor out creation/removal of memory block devices. This helps
to further cleanup arch_add_memory/arch_remove_memory() and to make
implementation of new features easier - especially sub-section memory
hot add from Dan.
Anshuman Khandual is currently working on arch_remove_memory(). I added
a temporary solution via "arm64/mm: Add temporary arch_remove_memory()
implementation", that is sufficient as a firsts tep in the context of
this series. (we don't cleanup page tables in case anything goes wrong
already)
Did a quick sanity test with DIMM plug/unplug, making sure all devices
and sysfs links properly get added/removed. Compile tested on s390x and
x86-64.
This patch (of 11):
By converting start and size to page granularity, we actually ignore
unaligned parts within a page instead of properly bailing out with an
error.
Link: http://lkml.kernel.org/r/20190527111152.16324-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chintan Pandya <cpandya@codeaurora.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jun Yao <yaojun8558363@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Add user space specific memory reading for kprobes
- Allow kprobes to be executed earlier in boot
The rest are mostly just various clean ups and small fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXS88txQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhaPAQDHaAmu6wXtZjZE6GU4ZP61UNgDECmZ
4wlGrNc1AAlqAQD/QC8339p37aDCp9n27VY1wmJwF3nca+jAHfQLqWkkYgw=
=n/tz
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The main changes in this release include:
- Add user space specific memory reading for kprobes
- Allow kprobes to be executed earlier in boot
The rest are mostly just various clean ups and small fixes"
* tag 'trace-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
tracing: Make trace_get_fields() global
tracing: Let filter_assign_type() detect FILTER_PTR_STRING
tracing: Pass type into tracing_generic_entry_update()
ftrace/selftest: Test if set_event/ftrace_pid exists before writing
ftrace/selftests: Return the skip code when tracing directory not configured in kernel
tracing/kprobe: Check registered state using kprobe
tracing/probe: Add trace_event_call accesses APIs
tracing/probe: Add probe event name and group name accesses APIs
tracing/probe: Add trace flag access APIs for trace_probe
tracing/probe: Add trace_event_file access APIs for trace_probe
tracing/probe: Add trace_event_call register API for trace_probe
tracing/probe: Add trace_probe init and free functions
tracing/uprobe: Set print format when parsing command
tracing/kprobe: Set print format right after parsed command
kprobes: Fix to init kprobes in subsys_initcall
tracepoint: Use struct_size() in kmalloc()
ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS
ftrace: Enable trampoline when rec count returns back to one
tracing/kprobe: Do not run kprobe boot tests if kprobe_event is on cmdline
tracing: Make a separate config for trace event self tests
...
Merge more updates from Andrew Morton:
"VM:
- z3fold fixes and enhancements by Henry Burns and Vitaly Wool
- more accurate reclaimed slab caches calculations by Yafang Shao
- fix MAP_UNINITIALIZED UAPI symbol to not depend on config, by
Christoph Hellwig
- !CONFIG_MMU fixes by Christoph Hellwig
- new novmcoredd parameter to omit device dumps from vmcore, by
Kairui Song
- new test_meminit module for testing heap and pagealloc
initialization, by Alexander Potapenko
- ioremap improvements for huge mappings, by Anshuman Khandual
- generalize kprobe page fault handling, by Anshuman Khandual
- device-dax hotplug fixes and improvements, by Pavel Tatashin
- enable synchronous DAX fault on powerpc, by Aneesh Kumar K.V
- add pte_devmap() support for arm64, by Robin Murphy
- unify locked_vm accounting with a helper, by Daniel Jordan
- several misc fixes
core/lib:
- new typeof_member() macro including some users, by Alexey Dobriyan
- make BIT() and GENMASK() available in asm, by Masahiro Yamada
- changed LIST_POISON2 on x86_64 to 0xdead000000000122 for better
code generation, by Alexey Dobriyan
- rbtree code size optimizations, by Michel Lespinasse
- convert struct pid count to refcount_t, by Joel Fernandes
get_maintainer.pl:
- add --no-moderated switch to skip moderated ML's, by Joe Perches
misc:
- ptrace PTRACE_GET_SYSCALL_INFO interface
- coda updates
- gdb scripts, various"
[ Using merge message suggestion from Vlastimil Babka, with some editing - Linus ]
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (100 commits)
fs/select.c: use struct_size() in kmalloc()
mm: add account_locked_vm utility function
arm64: mm: implement pte_devmap support
mm: introduce ARCH_HAS_PTE_DEVMAP
mm: clean up is_device_*_page() definitions
mm/mmap: move common defines to mman-common.h
mm: move MAP_SYNC to asm-generic/mman-common.h
device-dax: "Hotremove" persistent memory that is used like normal RAM
mm/hotplug: make remove_memory() interface usable
device-dax: fix memory and resource leak if hotplug fails
include/linux/lz4.h: fix spelling and copy-paste errors in documentation
ipc/mqueue.c: only perform resource calculation if user valid
include/asm-generic/bug.h: fix "cut here" for WARN_ON for __WARN_TAINT architectures
scripts/gdb: add helpers to find and list devices
scripts/gdb: add lx-genpd-summary command
drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
kernel/pid.c: convert struct pid count to refcount_t
drivers/rapidio/devices/rio_mport_cdev.c: NUL terminate some strings
select: shift restore_saved_sigmask_unless() into poll_select_copy_remaining()
select: change do_poll() to return -ERESTARTNOHAND rather than -EINTR
...
locked_vm accounting is done roughly the same way in five places, so
unify them in a helper.
Include the helper's caller in the debug print to distinguish between
callsites.
Error codes stay the same, so user-visible behavior does too. The one
exception is that the -EPERM case in tce_account_locked_vm is removed
because Alexey has never seen it triggered.
[daniel.m.jordan@oracle.com: v3]
Link: http://lkml.kernel.org/r/20190529205019.20927-1-daniel.m.jordan@oracle.com
[sfr@canb.auug.org.au: fix mm/util.c]
Link: http://lkml.kernel.org/r/20190524175045.26897-1-daniel.m.jordan@oracle.com
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Alan Tull <atull@kernel.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Moritz Fischer <mdf@kernel.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Steve Sistare <steven.sistare@oracle.com>
Cc: Wu Hao <hao.wu@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ARCH_HAS_ZONE_DEVICE is somewhat meaningless in itself, and combined
with the long-out-of-date comment can lead to the impression than an
architecture may just enable it (since __add_pages() now "comprehends
device memory" for itself) and expect things to work.
In practice, however, ZONE_DEVICE users have little chance of
functioning correctly without __HAVE_ARCH_PTE_DEVMAP, so let's clean
that up the same way as ARCH_HAS_PTE_SPECIAL and make it the proper
dependency so the real situation is clearer.
Link: http://lkml.kernel.org/r/87554aa78478a02a63f2c4cf60a847279ae3eb3b.1558547956.git.robin.murphy@arm.com
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presently the remove_memory() interface is inherently broken. It tries
to remove memory but panics if some memory is not offline. The problem
is that it is impossible to ensure that all memory blocks are offline as
this function also takes lock_device_hotplug that is required to change
memory state via sysfs.
So, between calling this function and offlining all memory blocks there
is always a window when lock_device_hotplug is released, and therefore,
there is always a chance for a panic during this window.
Make this interface to return an error if memory removal fails. This
way it is safe to call this function without panicking machine, and also
makes it symmetric to add_memory() which already returns an error.
Link: http://lkml.kernel.org/r/20190517215438.6487-3-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can't expose UAPI symbols differently based on CONFIG_ symbols, as
userspace won't have them available. Instead always define the flag,
but only respect it based on the config option.
Link: http://lkml.kernel.org/r/20190703122359.18200-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The description of cma_declare_contiguous() indicates that if the
'fixed' argument is true the reserved contiguous area must be exactly at
the address of the 'base' argument.
However, the function currently allows the 'base', 'size', and 'limit'
arguments to be silently adjusted to meet alignment constraints. This
commit enforces the documented behavior through explicit checks that
return an error if the region does not fit within a specified region.
Link: http://lkml.kernel.org/r/1561422051-16142-1-git-send-email-opendmb@gmail.com
Fixes: 5ea3b1b2f8 ("cma: add placement specifier for "cma=" kernel parameter")
Signed-off-by: Doug Berger <opendmb@gmail.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Yue Hu <huyue2@yulong.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Peng Fan <peng.fan@nxp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
z3fold_page_migration() calls memcpy(new_zhdr, zhdr, PAGE_SIZE).
However, zhdr contains fields that can't be directly coppied over (ex:
list_head, a circular linked list). We only need to initialize the
linked lists in new_zhdr, as z3fold_isolate_page() already ensures that
these lists are empty
Additionally it is possible that zhdr->work has been placed in a
workqueue. In this case we shouldn't migrate the page, as zhdr->work
references zhdr as opposed to new_zhdr.
Link: http://lkml.kernel.org/r/20190716000520.230595-1-henryburns@google.com
Fixes: 1f862989b0 ("mm/z3fold.c: support page migration")
Signed-off-by: Henry Burns <henryburns@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Vitaly Vul <vitaly.vul@sony.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Jonathan Adams <jwadams@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
z3fold_page_migrate() will never succeed because it attempts to acquire
a lock that has already been taken by migrate.c in __unmap_and_move().
__unmap_and_move() migrate.c
trylock_page(oldpage)
move_to_new_page(oldpage_newpage)
a_ops->migrate_page(oldpage, newpage)
z3fold_page_migrate(oldpage, newpage)
trylock_page(oldpage)
Link: http://lkml.kernel.org/r/20190710213238.91835-1-henryburns@google.com
Fixes: 1f862989b0 ("mm/z3fold.c: support page migration")
Signed-off-by: Henry Burns <henryburns@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Vitaly Vul <vitaly.vul@sony.com>
Cc: Jonathan Adams <jwadams@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Snild Dolkow <snild@sony.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Six sites are presently altering current->reclaim_state. There is a
risk that one function stomps on a caller's value. Use a helper
function to catch such errors.
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are six different reclaim paths by now:
- kswapd reclaim path
- node reclaim path
- hibernate preallocate memory reclaim path
- direct reclaim path
- memcg reclaim path
- memcg softlimit reclaim path
The slab caches reclaimed in these paths are only calculated in the
above three paths.
There're some drawbacks if we don't calculate the reclaimed slab caches.
- The sc->nr_reclaimed isn't correct if there're some slab caches
relcaimed in this path.
- The slab caches may be reclaimed thoroughly if there're lots of
reclaimable slab caches and few page caches.
Let's take an easy example for this case. If one memcg is full of
slab caches and the limit of it is 512M, in other words there're
approximately 512M slab caches in this memcg. Then the limit of the
memcg is reached and the memcg reclaim begins, and then in this memcg
reclaim path it will continuesly reclaim the slab caches until the
sc->priority drops to 0. After this reclaim stops, you will find
there're few slab caches left, which is less than 20M in my test
case. While after this patch applied the number is greater than 300M
and the sc->priority only drops to 3.
Link: http://lkml.kernel.org/r/1561112086-6169-3-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/vmscan: calculate reclaimed slab in all reclaim paths".
This patchset is to fix the issues in doing shrink slab.
There're six different reclaim paths by now,
- kswapd reclaim path
- node reclaim path
- hibernate preallocate memory reclaim path
- direct reclaim path
- memcg reclaim path
- memcg softlimit reclaim path
The slab caches reclaimed in these paths are only calculated in the
above three paths. The issues are detailed explained in patch #2. We
should calculate the reclaimed slab caches in every reclaim path. In
order to do it, the struct reclaim_state is placed into the struct
shrink_control.
In node reclaim path, there'is another issue about shrinking slab, which
is adressed in "mm/vmscan: shrink slab in node reclaim"
(https://lore.kernel.org/linux-mm/1559874946-22960-1-git-send-email-laoar.shao@gmail.com/).
This patch (of 2):
The struct reclaim_state is used to record how many slab caches are
reclaimed in one reclaim path. The struct shrink_control is used to
control one reclaim path. So we'd better put reclaim_state into
shrink_control.
[laoar.shao@gmail.com: remove reclaim_state assignment from __perform_reclaim()]
Link: http://lkml.kernel.org/r/1561381582-13697-1-git-send-email-laoar.shao@gmail.com
Link: http://lkml.kernel.org/r/1561112086-6169-2-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit 815744d751 ("mm: memcontrol: don't batch updates of local
VM stats and events"), the local VM counter are not in sync with the
hierarchical ones.
Below is one example in a leaf memcg on my server (with 8 CPUs):
inactive_file 3567570944
total_inactive_file 3568029696
We find that the deviation is very great because the 'val' in
__mod_memcg_state() is in pages while the effective value in
memcg_stat_show() is in bytes.
So the maximum of this deviation between local VM stats and total VM
stats can be (32 * number_of_cpu * PAGE_SIZE), that may be an
unacceptably great value.
We should keep the local VM stats in sync with the total stats. In
order to keep this behavior the same across counters, this patch updates
__mod_lruvec_state() and __count_memcg_events() as well.
Link: http://lkml.kernel.org/r/1562851979-10610-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yafang Shao <shaoyafang@didiglobal.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One of the gfp flags used to show that a page is movable is
__GFP_HIGHMEM. Currently z3fold_alloc() fails when __GFP_HIGHMEM is
passed. Now that z3fold pages are movable, we allow __GFP_HIGHMEM. We
strip the movability related flags from the call to kmem_cache_alloc()
for our slots since it is a kernel allocation.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20190712222118.108192-1-henryburns@google.com
Signed-off-by: Henry Burns <henryburns@google.com>
Acked-by: Vitaly Wool <vitalywool@gmail.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A comment referred to a non-existent function alloc_cma(), which should
have been cma_alloc().
Link: http://lkml.kernel.org/r/20190712085549.5920-1-ryh.szk.cmnty@gmail.com
Signed-off-by: Ryohei Suzuki <ryh.szk.cmnty@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clang gets rather confused about two variables in the same special
section when one of them is not initialized, leading to an assembler
warning later:
/tmp/slab_common-18f869.s: Assembler messages:
/tmp/slab_common-18f869.s:7526: Warning: ignoring changed section attributes for .data..ro_after_init
Adding an initialization to kmalloc_caches is rather silly here
but does avoid the issue.
Link: https://bugs.llvm.org/show_bug.cgi?id=42570
Link: http://lkml.kernel.org/r/20190712090455.266021-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When CONFIG_SYSFS is disabled but CONFIG_TMPFS is enabled, we get a
warning about shmem_parse_huge() never being called:
mm/shmem.c:417:12: error: unused function 'shmem_parse_huge' [-Werror,-Wunused-function]
static int shmem_parse_huge(const char *str)
Change the #ifdef so we no longer build this function in that configuration.
Link: http://lkml.kernel.org/r/20190712091141.673355-1-arnd@arndb.de
Fixes: 144df3b288c4 ("vfs: Convert ramfs, shmem, tmpfs, devtmpfs, rootfs to use the new mount API")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As reported by Henry Burns:
Running z3fold stress testing with address sanitization showed zhdr->slots
was being used after it was freed.
z3fold_free(z3fold_pool, handle)
free_handle(handle)
kmem_cache_free(pool->c_handle, zhdr->slots)
release_z3fold_page_locked_list(kref)
__release_z3fold_page(zhdr, true)
zhdr_to_pool(zhdr)
slots_to_pool(zhdr->slots) *BOOM*
To fix this, add pointer to the pool back to z3fold_header and modify
zhdr_to_pool to return zhdr->pool.
Link: http://lkml.kernel.org/r/20190708134808.e89f3bfadd9f6ffd7eff9ba9@gmail.com
Fixes: 7c2b8baa61 ("mm/z3fold.c: add structure for buddy handles")
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Reported-by: Henry Burns <henryburns@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Jonathan Adams <jwadams@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The stuff under sysctl describes /sys interface from userspace
point of view. So, add it to the admin-guide and remove the
:orphan: from its index file.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Rename the /proc/sys/ documentation files to ReST, using the
README file as a template for an index.rst, adding the other
files there via TOC markup.
Despite being written on different times with different
styles, try to make them somewhat coherent with a similar
look and feel, ensuring that they'll look nice as both
raw text file and as via the html output produced by the
Sphinx build system.
At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Improvements and bug fixes for the hmm interface in the kernel:
- Improve clarity, locking and APIs related to the 'hmm mirror' feature
merged last cycle. In linux-next we now see AMDGPU and nouveau to be
using this API.
- Remove old or transitional hmm APIs. These are hold overs from the past
with no users, or APIs that existed only to manage cross tree conflicts.
There are still a few more of these cleanups that didn't make the merge
window cut off.
- Improve some core mm APIs:
* export alloc_pages_vma() for driver use
* refactor into devm_request_free_mem_region() to manage
DEVICE_PRIVATE resource reservations
* refactor duplicative driver code into the core dev_pagemap
struct
- Remove hmm wrappers of improved core mm APIs, instead have drivers use
the simplified API directly
- Remove DEVICE_PUBLIC
- Simplify the kconfig flow for the hmm users and core code
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl0k1zkACgkQOG33FX4g
mxrO+w//QF/yI/9Hh30RWEBq8W107cODkDlaT0Z/7cVEXfGetZzIUpqzxnJofRfQ
xTw1XmYkc9WpJe/mTTuFZFewNQwWuMM6X0Xi25fV438/Y64EclevlcJTeD49TIH1
CIMsz8bX7CnCEq5sz+UypLg9LPnaD9L/JLyuSbyjqjms/o+yzqa7ji7p/DSINuhZ
Qva9OZL1ZSEDJfNGi8uGpYBqryHoBAonIL12R9sCF5pbJEnHfWrH7C06q7AWOAjQ
4vjN/p3F4L9l/v2IQ26Kn/S0AhmN7n3GT//0K66e2gJPfXa8fxRKGuFn/Kd79EGL
YPASn5iu3cM23up1XkbMNtzacL8yiIeTOcMdqw26OaOClojy/9OJduv5AChe6qL/
VUQIAn1zvPsJTyC5U7mhmkrGuTpP6ivHpxtcaUp+Ovvi1cyK40nLCmSNvLnbN5ES
bxbb0SjE4uupDG5qU6Yct/hFp6uVMSxMqXZOb9Xy8ZBkbMsJyVOLj71G1/rVIfPU
hO1AChX5CRG1eJoMo6oBIpiwmSvcOaPp3dqIOQZvwMOqrO869LR8qv7RXyh/g9gi
FAEKnwLl4GK3YtEO4Kt/1YI5DXYjSFUbfgAs0SPsRKS6hK2+RgRk2M/B/5dAX0/d
lgOf9WPODPwiSXBYLtJB8qHVDX0DIY8faOyTx6BYIKClUtgbBI8=
=wKvp
-----END PGP SIGNATURE-----
Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull HMM updates from Jason Gunthorpe:
"Improvements and bug fixes for the hmm interface in the kernel:
- Improve clarity, locking and APIs related to the 'hmm mirror'
feature merged last cycle. In linux-next we now see AMDGPU and
nouveau to be using this API.
- Remove old or transitional hmm APIs. These are hold overs from the
past with no users, or APIs that existed only to manage cross tree
conflicts. There are still a few more of these cleanups that didn't
make the merge window cut off.
- Improve some core mm APIs:
- export alloc_pages_vma() for driver use
- refactor into devm_request_free_mem_region() to manage
DEVICE_PRIVATE resource reservations
- refactor duplicative driver code into the core dev_pagemap
struct
- Remove hmm wrappers of improved core mm APIs, instead have drivers
use the simplified API directly
- Remove DEVICE_PUBLIC
- Simplify the kconfig flow for the hmm users and core code"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
mm: remove the HMM config option
mm: sort out the DEVICE_PRIVATE Kconfig mess
mm: simplify ZONE_DEVICE page private data
mm: remove hmm_devmem_add
mm: remove hmm_vma_alloc_locked_page
nouveau: use devm_memremap_pages directly
nouveau: use alloc_page_vma directly
PCI/P2PDMA: use the dev_pagemap internal refcount
device-dax: use the dev_pagemap internal refcount
memremap: provide an optional internal refcount in struct dev_pagemap
memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
memremap: remove the data field in struct dev_pagemap
memremap: add a migrate_to_ram method to struct dev_pagemap_ops
memremap: lift the devmap_enable manipulation into devm_memremap_pages
memremap: pass a struct dev_pagemap to ->kill and ->cleanup
memremap: move dev_pagemap callbacks into a separate structure
memremap: validate the pagemap type passed to devm_memremap_pages
mm: factor out a devm_request_free_mem_region helper
mm: export alloc_pages_vma
...
Here is the "big" driver core and debugfs changes for 5.3-rc1
It's a lot of different patches, all across the tree due to some api
changes and lots of debugfs cleanups. Because of this, there is going
to be some merge issues with your tree at the moment, I'll follow up
with the expected resolutions to make it easier for you.
Other than the debugfs cleanups, in this set of changes we have:
- bus iteration function cleanups (will cause build warnings
with s390 and coresight drivers in your tree)
- scripts/get_abi.pl tool to display and parse Documentation/ABI
entries in a simple way
- cleanups to Documenatation/ABI/ entries to make them parse
easier due to typos and other minor things
- default_attrs use for some ktype users
- driver model documentation file conversions to .rst
- compressed firmware file loading
- deferred probe fixes
All of these have been in linux-next for a while, with a bunch of merge
issues that Stephen has been patient with me for. Other than the merge
issues, functionality is working properly in linux-next :)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXSgpnQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykcwgCfS30OR4JmwZydWGJ7zK/cHqk+KjsAnjOxjC1K
LpRyb3zX29oChFaZkc5a
=XrEZ
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core and debugfs updates from Greg KH:
"Here is the "big" driver core and debugfs changes for 5.3-rc1
It's a lot of different patches, all across the tree due to some api
changes and lots of debugfs cleanups.
Other than the debugfs cleanups, in this set of changes we have:
- bus iteration function cleanups
- scripts/get_abi.pl tool to display and parse Documentation/ABI
entries in a simple way
- cleanups to Documenatation/ABI/ entries to make them parse easier
due to typos and other minor things
- default_attrs use for some ktype users
- driver model documentation file conversions to .rst
- compressed firmware file loading
- deferred probe fixes
All of these have been in linux-next for a while, with a bunch of
merge issues that Stephen has been patient with me for"
* tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (102 commits)
debugfs: make error message a bit more verbose
orangefs: fix build warning from debugfs cleanup patch
ubifs: fix build warning after debugfs cleanup patch
driver: core: Allow subsystems to continue deferring probe
drivers: base: cacheinfo: Ensure cpu hotplug work is done before Intel RDT
arch_topology: Remove error messages on out-of-memory conditions
lib: notifier-error-inject: no need to check return value of debugfs_create functions
swiotlb: no need to check return value of debugfs_create functions
ceph: no need to check return value of debugfs_create functions
sunrpc: no need to check return value of debugfs_create functions
ubifs: no need to check return value of debugfs_create functions
orangefs: no need to check return value of debugfs_create functions
nfsd: no need to check return value of debugfs_create functions
lib: 842: no need to check return value of debugfs_create functions
debugfs: provide pr_fmt() macro
debugfs: log errors when something goes wrong
drivers: s390/cio: Fix compilation warning about const qualifiers
drivers: Add generic helper to match by of_node
driver_find_device: Unify the match function with class_find_device()
bus_find_device: Unify the match callback with class_find_device
...
Merge updates from Andrew Morton:
"Am experimenting with splitting MM up into identifiable subsystems
perhaps with a view to gitifying it in complex ways. Also with more
verbose "incoming" emails.
Most of MM is here and a few other trees.
Subsystems affected by this patch series:
- hotfixes
- iommu
- scripts
- arch/sh
- ocfs2
- mm:slab-generic
- mm:slub
- mm:kmemleak
- mm:kasan
- mm:cleanups
- mm:debug
- mm:pagecache
- mm:swap
- mm:memcg
- mm:gup
- mm:pagemap
- mm:infrastructure
- mm:vmalloc
- mm:initialization
- mm:pagealloc
- mm:vmscan
- mm:tools
- mm:proc
- mm:ras
- mm:oom-kill
hotfixes:
mm: vmscan: scan anonymous pages on file refaults
mm/nvdimm: add is_ioremap_addr and use that to check ioremap address
mm/memcontrol: fix wrong statistics in memory.stat
mm/z3fold.c: lock z3fold page before __SetPageMovable()
nilfs2: do not use unexported cpu_to_le32()/le32_to_cpu() in uapi header
MAINTAINERS: nilfs2: update email address
iommu:
include/linux/dmar.h: replace single-char identifiers in macros
scripts:
scripts/decode_stacktrace: match basepath using shell prefix operator, not regex
scripts/decode_stacktrace: look for modules with .ko.debug extension
scripts/spelling.txt: drop "sepc" from the misspelling list
scripts/spelling.txt: add spelling fix for prohibited
scripts/decode_stacktrace: Accept dash/underscore in modules
scripts/spelling.txt: add more spellings to spelling.txt
arch/sh:
arch/sh/configs/sdk7786_defconfig: remove CONFIG_LOGFS
sh: config: remove left-over BACKLIGHT_LCD_SUPPORT
sh: prevent warnings when using iounmap
ocfs2:
fs: ocfs: fix spelling mistake "hearbeating" -> "heartbeat"
ocfs2/dlm: use struct_size() helper
ocfs2: add last unlock times in locking_state
ocfs2: add locking filter debugfs file
ocfs2: add first lock wait time in locking_state
ocfs: no need to check return value of debugfs_create functions
fs/ocfs2/dlmglue.c: unneeded variable: "status"
ocfs2: use kmemdup rather than duplicating its implementation
mm:slab-generic:
Patch series "mm/slab: Improved sanity checking":
mm/slab: validate cache membership under freelist hardening
mm/slab: sanity-check page type when looking up cache
lkdtm/heap: add tests for freelist hardening
mm:slub:
mm/slub.c: avoid double string traverse in kmem_cache_flags()
slub: don't panic for memcg kmem cache creation failure
mm:kmemleak:
mm/kmemleak.c: fix check for softirq context
mm/kmemleak.c: change error at _write when kmemleak is disabled
docs: kmemleak: add more documentation details
mm:kasan:
mm/kasan: print frame description for stack bugs
Patch series "Bitops instrumentation for KASAN", v5:
lib/test_kasan: add bitops tests
x86: use static_cpu_has in uaccess region to avoid instrumentation
asm-generic, x86: add bitops instrumentation for KASAN
Patch series "mm/kasan: Add object validation in ksize()", v3:
mm/kasan: introduce __kasan_check_{read,write}
mm/kasan: change kasan_check_{read,write} to return boolean
lib/test_kasan: Add test for double-kzfree detection
mm/slab: refactor common ksize KASAN logic into slab_common.c
mm/kasan: add object validation in ksize()
mm:cleanups:
include/linux/pfn_t.h: remove pfn_t_to_virt()
Patch series "remove ARCH_SELECT_MEMORY_MODEL where it has no effect":
arm: remove ARCH_SELECT_MEMORY_MODEL
s390: remove ARCH_SELECT_MEMORY_MODEL
sparc: remove ARCH_SELECT_MEMORY_MODEL
mm/gup.c: make follow_page_mask() static
mm/memory.c: trivial clean up in insert_page()
mm: make !CONFIG_HUGE_PAGE wrappers into static inlines
include/linux/mm_types.h: ifdef struct vm_area_struct::swap_readahead_info
mm: remove the account_page_dirtied export
mm/page_isolation.c: change the prototype of undo_isolate_page_range()
include/linux/vmpressure.h: use spinlock_t instead of struct spinlock
mm: remove the exporting of totalram_pages
include/linux/pagemap.h: document trylock_page() return value
mm:debug:
mm/failslab.c: by default, do not fail allocations with direct reclaim only
Patch series "debug_pagealloc improvements":
mm, debug_pagelloc: use static keys to enable debugging
mm, page_alloc: more extensive free page checking with debug_pagealloc
mm, debug_pagealloc: use a page type instead of page_ext flag
mm:pagecache:
Patch series "fix filler_t callback type mismatches", v2:
mm/filemap.c: fix an overly long line in read_cache_page
mm/filemap: don't cast ->readpage to filler_t for do_read_cache_page
jffs2: pass the correct prototype to read_cache_page
9p: pass the correct prototype to read_cache_page
mm/filemap.c: correct the comment about VM_FAULT_RETRY
mm:swap:
mm, swap: fix race between swapoff and some swap operations
mm/swap_state.c: simplify total_swapcache_pages() with get_swap_device()
mm, swap: use rbtree for swap_extent
mm/mincore.c: fix race between swapoff and mincore
mm:memcg:
memcg, oom: no oom-kill for __GFP_RETRY_MAYFAIL
memcg, fsnotify: no oom-kill for remote memcg charging
mm, memcg: introduce memory.events.local
mm: memcontrol: dump memory.stat during cgroup OOM
Patch series "mm: reparent slab memory on cgroup removal", v7:
mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache()
mm: memcg/slab: rename slab delayed deactivation functions and fields
mm: memcg/slab: generalize postponed non-root kmem_cache deactivation
mm: memcg/slab: introduce __memcg_kmem_uncharge_memcg()
mm: memcg/slab: unify SLAB and SLUB page accounting
mm: memcg/slab: don't check the dying flag on kmem_cache creation
mm: memcg/slab: synchronize access to kmem_cache dying flag using a spinlock
mm: memcg/slab: rework non-root kmem_cache lifecycle management
mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages
mm: memcg/slab: reparent memcg kmem_caches on cgroup removal
mm, memcg: add a memcg_slabinfo debugfs file
mm:gup:
Patch series "switch the remaining architectures to use generic GUP", v4:
mm: use untagged_addr() for get_user_pages_fast addresses
mm: simplify gup_fast_permitted
mm: lift the x86_32 PAE version of gup_get_pte to common code
MIPS: use the generic get_user_pages_fast code
sh: add the missing pud_page definition
sh: use the generic get_user_pages_fast code
sparc64: add the missing pgd_page definition
sparc64: define untagged_addr()
sparc64: use the generic get_user_pages_fast code
mm: rename CONFIG_HAVE_GENERIC_GUP to CONFIG_HAVE_FAST_GUP
mm: reorder code blocks in gup.c
mm: consolidate the get_user_pages* implementations
mm: validate get_user_pages_fast flags
mm: move the powerpc hugepd code to mm/gup.c
mm: switch gup_hugepte to use try_get_compound_head
mm: mark the page referenced in gup_hugepte
mm/gup: speed up check_and_migrate_cma_pages() on huge page
mm/gup.c: remove some BUG_ONs from get_gate_page()
mm/gup.c: mark undo_dev_pagemap as __maybe_unused
mm:pagemap:
asm-generic, x86: introduce generic pte_{alloc,free}_one[_kernel]
alpha: switch to generic version of pte allocation
arm: switch to generic version of pte allocation
arm64: switch to generic version of pte allocation
csky: switch to generic version of pte allocation
m68k: sun3: switch to generic version of pte allocation
mips: switch to generic version of pte allocation
nds32: switch to generic version of pte allocation
nios2: switch to generic version of pte allocation
parisc: switch to generic version of pte allocation
riscv: switch to generic version of pte allocation
um: switch to generic version of pte allocation
unicore32: switch to generic version of pte allocation
mm/pgtable: drop pgtable_t variable from pte_fn_t functions
mm/memory.c: fail when offset == num in first check of __vm_map_pages()
mm:infrastructure:
mm/mmu_notifier: use hlist_add_head_rcu()
mm:vmalloc:
Patch series "Some cleanups for the KVA/vmalloc", v5:
mm/vmalloc.c: remove "node" argument
mm/vmalloc.c: preload a CPU with one object for split purpose
mm/vmalloc.c: get rid of one single unlink_va() when merge
mm/vmalloc.c: switch to WARN_ON() and move it under unlink_va()
mm/vmalloc.c: spelling> s/informaion/information/
mm:initialization:
mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist
mm/large system hash: clear hashdist when only one node with memory is booted
mm:pagealloc:
arm64: move jump_label_init() before parse_early_param()
Patch series "add init_on_alloc/init_on_free boot options", v10:
mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
mm: init: report memory auto-initialization features at boot time
mm:vmscan:
mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned
mm: vmscan: correct some vmscan counters for THP swapout
mm:tools:
tools/vm/slabinfo: order command line options
tools/vm/slabinfo: add partial slab listing to -X
tools/vm/slabinfo: add option to sort by partial slabs
tools/vm/slabinfo: add sorting info to help menu
mm:proc:
proc: use down_read_killable mmap_sem for /proc/pid/maps
proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup
proc: use down_read_killable mmap_sem for /proc/pid/pagemap
proc: use down_read_killable mmap_sem for /proc/pid/clear_refs
proc: use down_read_killable mmap_sem for /proc/pid/map_files
mm: use down_read_killable for locking mmap_sem in access_remote_vm
mm: smaps: split PSS into components
mm: vmalloc: show number of vmalloc pages in /proc/meminfo
mm:ras:
mm/memory-failure.c: clarify error message
mm:oom-kill:
mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()
mm, oom: refactor dump_tasks for memcg OOMs
mm, oom: remove redundant task_in_mem_cgroup() check
oom: decouple mems_allowed from oom_unkillable_task
mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process()"
* akpm: (147 commits)
mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process()
oom: decouple mems_allowed from oom_unkillable_task
mm, oom: remove redundant task_in_mem_cgroup() check
mm, oom: refactor dump_tasks for memcg OOMs
mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()
mm/memory-failure.c: clarify error message
mm: vmalloc: show number of vmalloc pages in /proc/meminfo
mm: smaps: split PSS into components
mm: use down_read_killable for locking mmap_sem in access_remote_vm
proc: use down_read_killable mmap_sem for /proc/pid/map_files
proc: use down_read_killable mmap_sem for /proc/pid/clear_refs
proc: use down_read_killable mmap_sem for /proc/pid/pagemap
proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup
proc: use down_read_killable mmap_sem for /proc/pid/maps
tools/vm/slabinfo: add sorting info to help menu
tools/vm/slabinfo: add option to sort by partial slabs
tools/vm/slabinfo: add partial slab listing to -X
tools/vm/slabinfo: order command line options
mm: vmscan: correct some vmscan counters for THP swapout
mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned
...
Since commit bbbe480297 ("mm, oom: remove 'prefer children over
parent' heuristic") removed the
"%s: Kill process %d (%s) score %u or sacrifice child\n"
line, oc->chosen_points is no longer used after select_bad_process().
Link: http://lkml.kernel.org/r/1560853435-15575-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
oom_unkillable_task() can be called from three different contexts i.e.
global OOM, memcg OOM and oom_score procfs interface. At the moment
oom_unkillable_task() does a task_in_mem_cgroup() check on the given
process. Since there is no reason to perform task_in_mem_cgroup()
check for global OOM and oom_score procfs interface, those contexts
provide NULL memcg and skips the task_in_mem_cgroup() check. However
for memcg OOM context, the oom_unkillable_task() is always called from
mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes
redundant and effectively dead code. So, just remove the
task_in_mem_cgroup() check altogether.
Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dump_tasks() traverses all the existing processes even for the memcg OOM
context which is not only unnecessary but also wasteful. This imposes a
long RCU critical section even from a contained context which can be quite
disruptive.
Change dump_tasks() to be aligned with select_bad_process and use
mem_cgroup_scan_tasks to selectively traverse only processes of the target
memcg hierarchy during memcg OOM.
Link: http://lkml.kernel.org/r/20190617231207.160865-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit c03cd7738a ("cgroup: Include dying leaders with live
threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
only one thread from each thread group.
[penguin-kernel@I-love.SAKURA.ne.jp: remove thread group leader check in oom_evaluate_task()]
Link: http://lkml.kernel.org/r/1560853257-14934-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Link: http://lkml.kernel.org/r/c763afc8-f0ae-756a-56a7-395f625b95fc@i-love.sakura.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some user who install SIGBUS handler that does longjmp out therefore
keeping the process alive is confused by the error message
"[188988.765862] Memory failure: 0x1840200: Killing cellsrv:33395 due to hardware memory corruption"
Slightly modify the error message to improve clarity.
Link: http://lkml.kernel.org/r/1558403523-22079-1-git-send-email-jane.chu@oracle.com
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vmalloc() is getting more and more used these days (kernel stacks, bpf and
percpu allocator are new top users), and the total % of memory consumed by
vmalloc() can be pretty significant and changes dynamically.
/proc/meminfo is the best place to display this information: its top goal
is to show top consumers of the memory.
Since the VmallocUsed field in /proc/meminfo is not in use for quite a
long time (it has been defined to 0 by a5ad88ce8c ("mm: get rid of
'vmalloc_info' from /proc/meminfo")), let's reuse it for showing the
actual physical memory consumption of vmalloc().
Link: http://lkml.kernel.org/r/20190417194002.12369-3-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function is used by ptrace and proc files like /proc/pid/cmdline and
/proc/pid/environ.
Access_remote_vm never returns error codes, all errors are ignored and
only size of successfully read data is returned. So, if current task was
killed we'll simply return 0 (bytes read).
Mmap_sem could be locked for a long time or forever if something goes
wrong. Using a killable lock permits cleanup of stuck tasks and
simplifies investigation.
Link: http://lkml.kernel.org/r/156007494202.3335.16782303099589302087.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit bd4c82c22c ("mm, THP, swap: delay splitting THP after swapped
out"), THP can be swapped out in a whole. But, nr_reclaimed and some
other vm counters still get inc'ed by one even though a whole THP (512
pages) gets swapped out.
This doesn't make too much sense to memory reclaim.
For example, direct reclaim may just need reclaim SWAP_CLUSTER_MAX
pages, reclaiming one THP could fulfill it. But, if nr_reclaimed is not
increased correctly, direct reclaim may just waste time to reclaim more
pages, SWAP_CLUSTER_MAX * 512 pages in worst case.
And, it may cause pgsteal_{kswapd|direct} is greater than
pgscan_{kswapd|direct}, like the below:
pgsteal_kswapd 122933
pgsteal_direct 26600225
pgscan_kswapd 174153
pgscan_direct 14678312
nr_reclaimed and nr_scanned must be fixed in parallel otherwise it would
break some page reclaim logic, e.g.
vmpressure: this looks at the scanned/reclaimed ratio so it won't change
semantics as long as scanned & reclaimed are fixed in parallel.
compaction/reclaim: compaction wants a certain number of physical pages
freed up before going back to compacting.
kswapd priority raising: kswapd raises priority if we scan fewer pages
than the reclaim target (which itself is obviously expressed in order-0
pages). As a result, kswapd can falsely raise its aggressiveness even
when it's making great progress.
Other than nr_scanned and nr_reclaimed, some other counters, e.g.
pgactivate, nr_skipped, nr_ref_keep and nr_unmap_fail need to be fixed too
since they are user visible via cgroup, /proc/vmstat or trace points,
otherwise they would be underreported.
When isolating pages from LRUs, nr_taken has been accounted in base page,
but nr_scanned and nr_skipped are still accounted in THP. It doesn't make
too much sense too since this may cause trace point underreport the
numbers as well.
So accounting those counters in base page instead of accounting THP as one
page.
nr_dirty, nr_unqueued_dirty, nr_congested and nr_writeback are used by
file cache, so they are not impacted by THP swap.
This change may result in lower steal/scan ratio in some cases since THP
may get split during page reclaim, then a part of tail pages get reclaimed
instead of the whole 512 pages, but nr_scanned is accounted by 512,
particularly for direct reclaim. But, this should be not a significant
issue.
Link: http://lkml.kernel.org/r/1559025859-72759-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 9092c71bb7 ("mm: use sc->priority for slab shrink targets") has
broken up the relationship between sc->nr_scanned and slab pressure.
The sc->nr_scanned can't double slab pressure anymore. So, it sounds no
sense to still keep sc->nr_scanned inc'ed. Actually, it would prevent
from adding pressure on slab shrink since excessive sc->nr_scanned would
prevent from scan->priority raise.
The bonnie test doesn't show this would change the behavior of slab
shrinkers.
w/ w/o
/sec %CP /sec %CP
Sequential delete: 3960.6 94.6 3997.6 96.2
Random delete: 2518 63.8 2561.6 64.6
The slight increase of "/sec" without the patch would be caused by the
slight increase of CPU usage.
Link: http://lkml.kernel.org/r/1559025859-72759-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "add init_on_alloc/init_on_free boot options", v10.
Provide init_on_alloc and init_on_free boot options.
These are aimed at preventing possible information leaks and making the
control-flow bugs that depend on uninitialized values more deterministic.
Enabling either of the options guarantees that the memory returned by the
page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
isn't supported at the moment, as its emulation of kmem caches complicates
handling of SLAB_TYPESAFE_BY_RCU caches correctly.
Enabling init_on_free also guarantees that pages and heap objects are
initialized right after they're freed, so it won't be possible to access
stale data by using a dangling pointer.
As suggested by Michal Hocko, right now we don't let the heap users to
disable initialization for certain allocations. There's not enough
evidence that doing so can speed up real-life cases, and introducing ways
to opt-out may result in things going out of control.
This patch (of 2):
The new options are needed to prevent possible information leaks and make
control-flow bugs that depend on uninitialized values more deterministic.
This is expected to be on-by-default on Android and Chrome OS. And it
gives the opportunity for anyone else to use it under distros too via the
boot args. (The init_on_free feature is regularly requested by folks
where memory forensics is included in their threat models.)
init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
objects with zeroes. Initialization is done at allocation time at the
places where checks for __GFP_ZERO are performed.
init_on_free=1 makes the kernel initialize freed pages and heap objects
with zeroes upon their deletion. This helps to ensure sensitive data
doesn't leak via use-after-free accesses.
Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
returns zeroed memory. The two exceptions are slab caches with
constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
zero-initialized to preserve their semantics.
Both init_on_alloc and init_on_free default to zero, but those defaults
can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
CONFIG_INIT_ON_FREE_DEFAULT_ON.
If either SLUB poisoning or page poisoning is enabled, those options take
precedence over init_on_alloc and init_on_free: initialization is only
applied to unpoisoned allocations.
Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
is within the standard error.
The new features are also going to pave the way for hardware memory
tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
hooks to set the tags for heap objects. With MTE, tagging will have the
same cost as memory initialization.
Although init_on_free is rather costly, there are paranoid use-cases where
in-memory data lifetime is desired to be minimized. There are various
arguments for/against the realism of the associated threat models, but
given that we'll need the infrastructure for MTE anyway, and there are
people who want wipe-on-free behavior no matter what the performance cost,
it seems reasonable to include it in this series.
[glider@google.com: v8]
Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
[glider@google.com: v9]
Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
[glider@google.com: v10]
Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.cz> [page and dmapool parts
Acked-by: James Morris <jamorris@linux.microsoft.com>]
Cc: Christoph Lameter <cl@linux.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Sandeep Patil <sspatil@android.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CONFIG_NUMA on 64-bit CPUs currently enables hashdist unconditionally even
when booting on single node machines. This causes the large system hashes
to be allocated with vmalloc, and mapped with small pages.
This change clears hashdist if only one node has come up with memory.
This results in the important large inode and dentry hashes using memblock
allocations. All others are within 4MB size up to about 128GB of RAM,
which allows them to be allocated from the linear map on most non-NUMA
images.
Other big hashes like futex and TCP should eventually be moved over to the
same style of allocation as those vfs caches that use HASH_EARLY if
!hashdist, so they don't exceed MAX_ORDER on very large non-NUMA images.
This brings dTLB misses for linux kernel tree `git diff` from ~45,000 to
~8,000 on a Kaby Lake KVM guest with 8MB dentry hash and mitigations=off
(performance is in the noise, under 1% difference, page tables are likely
to be well cached for this workload).
Link: http://lkml.kernel.org/r/20190605144814.29319-2-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel currently clamps large system hashes to MAX_ORDER when hashdist
is not set, which is rather arbitrary.
vmalloc space is limited on 32-bit machines, but this shouldn't result in
much more used because of small physical memory limiting system hash
sizes.
Include "vmalloc" or "linear" in the kernel log message.
Link: http://lkml.kernel.org/r/20190605144814.29319-1-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Trigger a warning if an object that is about to be freed is detached. We
used to have a BUG_ON(), but even though it is considered as faulty
behaviour that is not a good reason to break a system.
Link: http://lkml.kernel.org/r/20190606120411.8298-5-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It does not make sense to try to "unlink" the node that is definitely not
linked with a list nor tree. On the first merge step VA just points to
the previously disconnected busy area.
On the second step, check if the node has been merged and do "unlink" if
so, because now it points to an object that must be linked.
Link: http://lkml.kernel.org/r/20190606120411.8298-4-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Hillf Danton <hdanton@sina.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make mmu_notifier_register() safer by issuing a memory barrier before
registering a new notifier. This fixes a theoretical bug on weakly
ordered CPUs. For example, take this simplified use of notifiers by a
driver:
my_struct->mn.ops = &my_ops; /* (1) */
mmu_notifier_register(&my_struct->mn, mm)
...
hlist_add_head(&mn->hlist, &mm->mmu_notifiers); /* (2) */
...
Once mmu_notifier_register() releases the mm locks, another thread can
invalidate a range:
mmu_notifier_invalidate_range()
...
hlist_for_each_entry_rcu(mn, &mm->mmu_notifiers, hlist) {
if (mn->ops->invalidate_range)
The read side relies on the data dependency between mn and ops to ensure
that the pointer is properly initialized. But the write side doesn't have
any dependency between (1) and (2), so they could be reordered and the
readers could dereference an invalid mn->ops. mmu_notifier_register()
does take all the mm locks before adding to the hlist, but those have
acquire semantics which isn't sufficient.
By calling hlist_add_head_rcu() instead of hlist_add_head() we update the
hlist using a store-release, ensuring that readers see prior
initialization of my_struct. This situation is better illustated by
litmus test MP+onceassign+derefonce.
Link: http://lkml.kernel.org/r/20190502133532.24981-1-jean-philippe.brucker@arm.com
Fixes: cddb8a5c14 ("mmu-notifiers: core")
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the caller asks us for offset == num, we should already fail in the
first check, i.e. the one testing for offsets beyond the object.
At the moment, we are failing on the second test anyway, since count
cannot be 0. Still, to agree with the comment of the first test, we
should first test it there.
Link: http://lkml.kernel.org/r/20190528193004.GA7744@gmail.com
Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Drop the pgtable_t variable from all implementation for pte_fn_t as none
of them use it. apply_to_pte_range() should stop computing it as well.
Should help us save some cycles.
Link: http://lkml.kernel.org/r/1556803126-26596-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Matthew Wilcox <willy@infradead.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several mips builds generate the following build warning.
mm/gup.c:1788:13: warning: 'undo_dev_pagemap' defined but not used
The function is declared unconditionally but only called from behind
various ifdefs. Mark it __maybe_unused.
Link: http://lkml.kernel.org/r/1562072523-22311-1-git-send-email-linux@roeck-us.net
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we end up without a PGD or PUD entry backing the gate area, don't BUG
-- just fail gracefully.
It's not entirely implausible that this could happen some day on x86. It
doesn't right now even with an execute-only emulated vsyscall page because
the fixmap shares the PUD, but the core mm code shouldn't rely on that
particular detail to avoid OOPSing.
Link: http://lkml.kernel.org/r/a1d9f4efb75b9d464e59fd6af00104b21c58f6f7.1561610798.git.luto@kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both hugetlb and thp locate on the same migration type of pageblock, since
they are allocated from a free_list[]. Based on this fact, it is enough
to check on a single subpage to decide the migration type of the whole
huge page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on
x86, similar on other archs.
Furthermore, when executing isolate_huge_page(), it avoid taking global
hugetlb_lock many times, and meanless remove/add to the local link list
cma_page_list.
[akpm@linux-foundation.org: make `i' and `step' unsigned]
Link: http://lkml.kernel.org/r/1561612545-28997-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All other get_user_page_fast cases mark the page referenced, so do this
here as well.
Link: http://lkml.kernel.org/r/20190625143715.1689-17-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This applies the overflow fixes from 8fde12ca79 ("mm: prevent
get_user_pages() from overflowing page refcount") to the powerpc hugepd
code and brings it back in sync with the other GUP cases.
Link: http://lkml.kernel.org/r/20190625143715.1689-16-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While only powerpc supports the hugepd case, the code is pretty generic
and I'd like to keep all GUP internals in one place.
Link: http://lkml.kernel.org/r/20190625143715.1689-15-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can only deal with FOLL_WRITE and/or FOLL_LONGTERM in
get_user_pages_fast, so reject all other flags.
Link: http://lkml.kernel.org/r/20190625143715.1689-14-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Always build mm/gup.c so that we don't have to provide separate nommu
stubs. Also merge the get_user_pages_fast and __get_user_pages_fast stubs
when HAVE_FAST_GUP into the main implementations, which will never call
the fast path if HAVE_FAST_GUP is not set.
This also ensures the new put_user_pages* helpers are available for nommu,
as those are currently missing, which would create a problem as soon as we
actually grew users for it.
Link: http://lkml.kernel.org/r/20190625143715.1689-13-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This moves the actually exported functions towards the end of the file,
and reorders some functions to be in more logical blocks as a preparation
for moving various stubs inline into the main functionality using
IS_ENABLED().
Link: http://lkml.kernel.org/r/20190625143715.1689-12-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We only support the generic GUP now, so rename the config option to
be more clear, and always use the mm/Kconfig definition of the
symbol and select it from the arch Kconfigs.
Link: http://lkml.kernel.org/r/20190625143715.1689-11-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The split low/high access is the only non-READ_ONCE version of gup_get_pte
that did show up in the various arch implemenations. Lift it to common
code and drop the ifdef based arch override.
Link: http://lkml.kernel.org/r/20190625143715.1689-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pass in the already calculated end value instead of recomputing it, and
leave the end > start check in the callers instead of duplicating them in
the arch code.
Link: http://lkml.kernel.org/r/20190625143715.1689-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: James Hogan <jhogan@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "switch the remaining architectures to use generic GUP", v4.
A series to switch mips, sh and sparc64 to use the generic GUP code so
that we only have one codebase to touch for further improvements to this
code.
This patch (of 16):
This will allow sparc64, or any future architecture with memory tagging to
override its tags for get_user_pages and get_user_pages_fast.
Link: http://lkml.kernel.org/r/20190625143715.1689-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: David Miller <davem@davemloft.net>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are concerns about memory leaks from extensive use of memory cgroups
as each memory cgroup creates its own set of kmem caches. There is a
possiblity that the memcg kmem caches may remain even after the memory
cgroups have been offlined. Therefore, it will be useful to show the
status of each of memcg kmem caches.
This patch introduces a new <debugfs>/memcg_slabinfo file which is
somewhat similar to /proc/slabinfo in format, but lists only information
about kmem caches that have child memcg kmem caches. Information
available in /proc/slabinfo are not repeated in memcg_slabinfo.
A portion of a sample output of the file was:
# <name> <css_id[:dead]> <active_objs> <num_objs> <active_slabs> <num_slabs>
rpc_inode_cache root 13 51 1 1
rpc_inode_cache 48 0 0 0 0
fat_inode_cache root 1 45 1 1
fat_inode_cache 41 2 45 1 1
xfs_inode root 770 816 24 24
xfs_inode 92 22 34 1 1
xfs_inode 88:dead 1 34 1 1
xfs_inode 89:dead 23 34 1 1
xfs_inode 85 4 34 1 1
xfs_inode 84 9 34 1 1
The css id of the memcg is also listed. If a memcg is not online,
the tag ":dead" will be attached as shown above.
[longman@redhat.com: memcg: add ":deact" tag for reparented kmem caches in memcg_slabinfo]
Link: http://lkml.kernel.org/r/20190621173005.31514-1-longman@redhat.com
[longman@redhat.com: set the flag in the common code as suggested by Roman]
Link: http://lkml.kernel.org/r/20190627184324.5875-1-longman@redhat.com
Link: http://lkml.kernel.org/r/20190619171621.26209-1-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Suggested-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's reparent non-root kmem_caches on memcg offlining. This allows us to
release the memory cgroup without waiting for the last outstanding kernel
object (e.g. dentry used by another application).
Since the parent cgroup is already charged, everything we need to do is to
splice the list of kmem_caches to the parent's kmem_caches list, swap the
memcg pointer, drop the css refcounter for each kmem_cache and adjust the
parent's css refcounter.
Please, note that kmem_cache->memcg_params.memcg isn't a stable pointer
anymore. It's safe to read it under rcu_read_lock(), cgroup_mutex held,
or any other way that protects the memory cgroup from being released.
We can race with the slab allocation and deallocation paths. It's not a
big problem: parent's charge and slab global stats are always correct, and
we don't care anymore about the child usage and global stats. The child
cgroup is already offline, so we don't use or show it anywhere.
Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE) aren't
used anywhere except count_shadow_nodes(). But even there it won't break
anything: after reparenting "nodes" will be 0 on child level (because
we're already reparenting shrinker lists), and on parent level page stats
always were 0, and this patch won't change anything.
[guro@fb.com: properly handle kmem_caches reparented to root_mem_cgroup]
Link: http://lkml.kernel.org/r/20190620213427.1691847-1-guro@fb.com
Link: http://lkml.kernel.org/r/20190611231813.3148843-11-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Waiman Long <longman@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every slab page charged to a non-root memory cgroup has a pointer to the
memory cgroup and holds a reference to it, which protects a non-empty
memory cgroup from being released. At the same time the page has a
pointer to the corresponding kmem_cache, and also hold a reference to the
kmem_cache. And kmem_cache by itself holds a reference to the cgroup.
So there is clearly some redundancy, which allows to stop setting the
page->mem_cgroup pointer and rely on getting memcg pointer indirectly via
kmem_cache. Further it will allow to change this pointer easier, without
a need to go over all charged pages.
So let's stop setting page->mem_cgroup pointer for slab pages, and stop
using the css refcounter directly for protecting the memory cgroup from
going away. Instead rely on kmem_cache as an intermediate object.
Make sure that vmstats and shrinker lists are working as previously, as
well as /proc/kpagecgroup interface.
Link: http://lkml.kernel.org/r/20190611231813.3148843-10-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Waiman Long <longman@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently each charged slab page holds a reference to the cgroup to which
it's charged. Kmem_caches are held by the memcg and are released all
together with the memory cgroup. It means that none of kmem_caches are
released unless at least one reference to the memcg exists, which is very
far from optimal.
Let's rework it in a way that allows releasing individual kmem_caches as
soon as the cgroup is offline, the kmem_cache is empty and there are no
pending allocations.
To make it possible, let's introduce a new percpu refcounter for non-root
kmem caches. The counter is initialized to the percpu mode, and is
switched to the atomic mode during kmem_cache deactivation. The counter
is bumped for every charged page and also for every running allocation.
So the kmem_cache can't be released unless all allocations complete.
To shutdown non-active empty kmem_caches, let's reuse the work queue,
previously used for the kmem_cache deactivation. Once the reference
counter reaches 0, let's schedule an asynchronous kmem_cache release.
* I used the following simple approach to test the performance
(stolen from another patchset by T. Harding):
time find / -name fname-no-exist
echo 2 > /proc/sys/vm/drop_caches
repeat 10 times
Results:
orig patched
real 0m1.455s real 0m1.355s
user 0m0.206s user 0m0.219s
sys 0m0.855s sys 0m0.807s
real 0m1.487s real 0m1.699s
user 0m0.221s user 0m0.256s
sys 0m0.806s sys 0m0.948s
real 0m1.515s real 0m1.505s
user 0m0.183s user 0m0.215s
sys 0m0.876s sys 0m0.858s
real 0m1.291s real 0m1.380s
user 0m0.193s user 0m0.198s
sys 0m0.843s sys 0m0.786s
real 0m1.364s real 0m1.374s
user 0m0.180s user 0m0.182s
sys 0m0.868s sys 0m0.806s
real 0m1.352s real 0m1.312s
user 0m0.201s user 0m0.212s
sys 0m0.820s sys 0m0.761s
real 0m1.302s real 0m1.349s
user 0m0.205s user 0m0.203s
sys 0m0.803s sys 0m0.792s
real 0m1.334s real 0m1.301s
user 0m0.194s user 0m0.201s
sys 0m0.806s sys 0m0.779s
real 0m1.426s real 0m1.434s
user 0m0.216s user 0m0.181s
sys 0m0.824s sys 0m0.864s
real 0m1.350s real 0m1.295s
user 0m0.200s user 0m0.190s
sys 0m0.842s sys 0m0.811s
So it looks like the difference is not noticeable in this test.
[cai@lca.pw: fix an use-after-free in kmemcg_workfn()]
Link: http://lkml.kernel.org/r/1560977573-10715-1-git-send-email-cai@lca.pw
Link: http://lkml.kernel.org/r/20190611231813.3148843-9-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Waiman Long <longman@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the memcg_params.dying flag and the corresponding workqueue used
for the asynchronous deactivation of kmem_caches is synchronized using the
slab_mutex.
It makes impossible to check this flag from the irq context, which will be
required in order to implement asynchronous release of kmem_caches.
So let's switch over to the irq-save flavor of the spinlock-based
synchronization.
Link: http://lkml.kernel.org/r/20190611231813.3148843-8-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Waiman Long <longman@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no point in checking the root_cache->memcg_params.dying flag on
kmem_cache creation path. New allocations shouldn't be performed using a
dead root kmem_cache, so no new memcg kmem_cache creation can be scheduled
after the flag is set. And if it was scheduled before,
flush_memcg_workqueue() will wait for it anyway.
So let's drop this check to simplify the code.
Link: http://lkml.kernel.org/r/20190611231813.3148843-7-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Waiman Long <longman@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the page accounting code is duplicated in SLAB and SLUB
internals. Let's move it into new (un)charge_slab_page helpers in the
slab_common.c file. These helpers will be responsible for statistics
(global and memcg-aware) and memcg charging. So they are replacing direct
memcg_(un)charge_slab() calls.
Link: http://lkml.kernel.org/r/20190611231813.3148843-6-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Waiman Long <longman@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Let's separate the page counter modification code out of
__memcg_kmem_uncharge() in a way similar to what
__memcg_kmem_charge() and __memcg_kmem_charge_memcg() work.
This will allow to reuse this code later using a new
memcg_kmem_uncharge_memcg() wrapper, which calls
__memcg_kmem_uncharge_memcg() if memcg_kmem_enabled()
check is passed.
Link: http://lkml.kernel.org/r/20190611231813.3148843-5-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Waiman Long <longman@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently SLUB uses a work scheduled after an RCU grace period to
deactivate a non-root kmem_cache. This mechanism can be reused for
kmem_caches release, but requires generalization for SLAB case.
Introduce kmemcg_cache_deactivate() function, which calls
allocator-specific __kmem_cache_deactivate() and schedules execution of
__kmem_cache_deactivate_after_rcu() with all necessary locks in a worker
context after an rcu grace period.
Here is the new calling scheme:
kmemcg_cache_deactivate()
__kmemcg_cache_deactivate() SLAB/SLUB-specific
kmemcg_rcufn() rcu
kmemcg_workfn() work
__kmemcg_cache_deactivate_after_rcu() SLAB/SLUB-specific
instead of:
__kmemcg_cache_deactivate() SLAB/SLUB-specific
slab_deactivate_memcg_cache_rcu_sched() SLUB-only
kmemcg_rcufn() rcu
kmemcg_workfn() work
kmemcg_cache_deact_after_rcu() SLUB-only
For consistency, all allocator-specific functions start with "__".
Link: http://lkml.kernel.org/r/20190611231813.3148843-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Waiman Long <longman@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The delayed work/rcu deactivation infrastructure of non-root kmem_caches
can be also used for asynchronous release of these objects. Let's get rid
of the word "deactivation" in corresponding names to make the code look
better after generalization.
It's easier to make the renaming first, so that the generalized code will
look consistent from scratch.
Let's rename struct memcg_cache_params fields:
deact_fn -> work_fn
deact_rcu_head -> rcu_head
deact_work -> work
And RCU/delayed work callbacks in slab common code:
kmemcg_deactivate_rcufn -> kmemcg_rcufn
kmemcg_deactivate_workfn -> kmemcg_workfn
This patch contains no functional changes, only renamings.
Link: http://lkml.kernel.org/r/20190611231813.3148843-3-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Waiman Long <longman@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: reparent slab memory on cgroup removal", v7.
# Why do we need this?
We've noticed that the number of dying cgroups is steadily growing on most
of our hosts in production. The following investigation revealed an issue
in the userspace memory reclaim code [1], accounting of kernel stacks [2],
and also the main reason: slab objects.
The underlying problem is quite simple: any page charged to a cgroup holds
a reference to it, so the cgroup can't be reclaimed unless all charged
pages are gone. If a slab object is actively used by other cgroups, it
won't be reclaimed, and will prevent the origin cgroup from being
reclaimed.
Slab objects, and first of all vfs cache, is shared between cgroups, which
are using the same underlying fs, and what's even more important, it's
shared between multiple generations of the same workload. So if something
is running periodically every time in a new cgroup (like how systemd
works), we do accumulate multiple dying cgroups.
Strictly speaking pagecache isn't different here, but there is a key
difference: we disable protection and apply some extra pressure on LRUs of
dying cgroups, and these LRUs contain all charged pages. My experiments
show that with the disabled kernel memory accounting the number of dying
cgroups stabilizes at a relatively small number (~100, depends on memory
pressure and cgroup creation rate), and with kernel memory accounting it
grows pretty steadily up to several thousands.
Memory cgroups are quite complex and big objects (mostly due to percpu
stats), so it leads to noticeable memory losses. Memory occupied by dying
cgroups is measured in hundreds of megabytes. I've even seen a host with
more than 100Gb of memory wasted for dying cgroups. It leads to a
degradation of performance with the uptime, and generally limits the usage
of cgroups.
My previous attempt [3] to fix the problem by applying extra pressure on
slab shrinker lists caused a regressions with xfs and ext4, and has been
reverted [4]. The following attempts to find the right balance [5, 6]
were not successful.
So instead of trying to find a maybe non-existing balance, let's do
reparent accounted slab caches to the parent cgroup on cgroup removal.
# Implementation approach
There is however a significant problem with reparenting of slab memory:
there is no list of charged pages. Some of them are in shrinker lists,
but not all. Introducing of a new list is really not an option.
But fortunately there is a way forward: every slab page has a stable
pointer to the corresponding kmem_cache. So the idea is to reparent
kmem_caches instead of slab pages.
It's actually simpler and cheaper, but requires some underlying changes:
1) Make kmem_caches to hold a single reference to the memory cgroup,
instead of a separate reference per every slab page.
2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
page->kmem_cache->memcg indirection instead. It's used only on
slab page release, so performance overhead shouldn't be a big issue.
3) Introduce a refcounter for non-root slab caches. It's required to
be able to destroy kmem_caches when they become empty and release
the associated memory cgroup.
There is a bonus: currently we release all memcg kmem_caches all together
with the memory cgroup itself. This patchset allows individual
kmem_caches to be released as soon as they become inactive and free.
Some additional implementation details are provided in corresponding
commit messages.
# Results
Below is the average number of dying cgroups on two groups of our
production hosts. They do run some sort of web frontend workload, the
memory pressure is moderate. As we can see, with the kernel memory
reparenting the number stabilizes in 60s range; however with the original
version it grows almost linearly and doesn't show any signs of plateauing.
The difference in slab and percpu usage between patched and unpatched
versions also grows linearly. In 7 days it exceeded 200Mb.
day 0 1 2 3 4 5 6 7
original 56 362 628 752 1070 1250 1490 1560
patched 23 46 51 55 60 57 67 69
mem diff(Mb) 22 74 123 152 164 182 214 241
# Links
[1]: commit 68600f623d ("mm: don't miss the last page because of round-off error")
[2]: commit 9b6f7e163c ("mm: rework memcg kernel stack accounting")
[3]: commit 172b06c32b ("mm: slowly shrink slabs with a relatively small number of objects")
[4]: commit a9a238e83f ("Revert "mm: slowly shrink slabs with a relatively small number of objects")
[5]: https://lkml.org/lkml/2019/1/28/1865
[6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2
This patch (of 10):
Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache()
rather than in init_memcg_params().
Once kmem_cache will hold a reference to the memory cgroup, it will
simplify the refcounting.
For non-root kmem_caches memcg_link_cache() is always called before the
kmem_cache becomes visible to a user, so it's safe.
Link: http://lkml.kernel.org/r/20190611231813.3148843-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memory controller in cgroup v2 exposes memory.events file for each
memcg which shows the number of times events like low, high, max, oom
and oom_kill have happened for the whole tree rooted at that memcg.
Users can also poll or register notification to monitor the changes in
that file. Any event at any level of the tree rooted at memcg will
notify all the listeners along the path till root_mem_cgroup. There are
existing users which depend on this behavior.
However there are users which are only interested in the events
happening at a specific level of the memcg tree and not in the events in
the underlying tree rooted at that memcg. One such use-case is a
centralized resource monitor which can dynamically adjust the limits of
the jobs running on a system. The jobs can create their sub-hierarchy
for their own sub-tasks. The centralized monitor is only interested in
the events at the top level memcgs of the jobs as it can then act and
adjust the limits of the jobs. Using the current memory.events for such
centralized monitor is very inconvenient. The monitor will keep
receiving events which it is not interested and to find if the received
event is interesting, it has to read memory.event files of the next
level and compare it with the top level one. So, let's introduce
memory.events.local to the memcg which shows and notify for the events
at the memcg level.
Now, does memory.stat and memory.pressure need their local versions. IMHO
no due to the no internal process contraint of the cgroup v2. The
memory.stat file of the top level memcg of a job shows the stats and
vmevents of the whole tree. The local stats or vmevents of the top level
memcg will only change if there is a process running in that memcg but v2
does not allow that. Similarly for memory.pressure there will not be any
process in the internal nodes and thus no chance of local pressure.
Link: http://lkml.kernel.org/r/20190527174643.209172-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The documentation of __GFP_RETRY_MAYFAIL clearly mentioned that the OOM
killer will not be triggered and indeed the page alloc does not invoke OOM
killer for such allocations. However we do trigger memcg OOM killer for
__GFP_RETRY_MAYFAIL. Fix that. This flag will used later to not trigger
oom-killer in the charging path for fanotify and inotify event
allocations.
Link: http://lkml.kernel.org/r/20190514212259.156585-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Via commit 4b3ef9daa4 ("mm/swap: split swap cache into 64MB trunks"),
after swapoff, the address_space associated with the swap device will be
freed. So swap_address_space() users which touch the address_space need
some kind of mechanism to prevent the address_space from being freed
during accessing.
When mincore processes an unmapped range for swapped shmem pages, it
doesn't hold the lock to prevent swap device from being swapped off. So
the following race is possible:
CPU1 CPU2
do_mincore() swapoff()
walk_page_range()
mincore_unmapped_range()
__mincore_unmapped_range
mincore_page
as = swap_address_space()
... exit_swap_address_space()
... kvfree(spaces)
find_get_page(as)
The address space may be accessed after being freed.
To fix the race, get_swap_device()/put_swap_device() is used to enclose
find_get_page() to check whether the swap entry is valid and prevent the
swap device from being swapoff during accessing.
Link: http://lkml.kernel.org/r/20190611020510.28251-1-ying.huang@intel.com
Fixes: 4b3ef9daa4 ("mm/swap: split swap cache into 64MB trunks")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
swap_extent is used to map swap page offset to backing device's block
offset. For a continuous block range, one swap_extent is used and all
these swap_extents are managed in a linked list.
These swap_extents are used by map_swap_entry() during swap's read and
write path. To find out the backing device's block offset for a page
offset, the swap_extent list will be traversed linearly, with
curr_swap_extent being used as a cache to speed up the search.
This works well as long as swap_extents are not huge or when the number
of processes that access swap device are few, but when the swap device
has many extents and there are a number of processes accessing the swap
device concurrently, it can be a problem. On one of our servers, the
disk's remaining size is tight:
$df -h
Filesystem Size Used Avail Use% Mounted on
... ...
/dev/nvme0n1p1 1.8T 1.3T 504G 72% /home/t4
When creating a 80G swapfile there, there are as many as 84656 swap
extents. The end result is, kernel spends abou 30% time in
map_swap_entry() and swap throughput is only 70MB/s.
As a comparison, when I used smaller sized swapfile, like 4G whose
swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and
map_swap_entry() is about 3%.
One downside of using rbtree for swap_extent is, 'struct rbtree' takes
24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more
for each swap_extent. For a swapfile that has 80k swap_extents, that
means 625KiB more memory consumed.
Test:
Since it's not possible to reboot that server, I can not test this patch
diretly there. Instead, I tested it on another server with NVMe disk.
I created a 20G swapfile on an NVMe backed XFS fs. By default, the
filesystem is quite clean and the created swapfile has only 2 extents.
Testing vanilla and this patch shows no obvious performance difference
when swapfile is not fragmented.
To see the patch's effects, I used some tweaks to manually fragment the
swapfile by breaking the extent at 1M boundary. This made the swapfile
have 20K extents.
nr_task=4
kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
vanilla 165191 90.77% 171798 90.21%
patched 858993 +420% 2.16% 715827 +317% 0.77%
nr_task=8
kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf)
vanilla 306783 92.19% 318145 87.76%
patched 954437 +211% 2.35% 1073741 +237% 1.57%
swapout: the throughput of swap out, in KB/s, higher is better 1st
map_swap_entry: cpu cycles percent sampled by perf swapin: the
throughput of swap in, in KB/s, higher is better. 2nd map_swap_entry:
cpu cycles percent sampled by perf
nr_task=1 doesn't show any difference, this is due to the curr_swap_extent
can be effectively used to cache the correct swap extent for single task
workload.
[akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/]
Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronlu
Signed-off-by: Aaron Lu <ziqian.lzq@antfin.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
total_swapcache_pages() may race with swapper_spaces[] allocation and
freeing. Previously, this is protected with a swapper_spaces[] specific
RCU mechanism. To simplify the logic/code complexity, it is replaced with
get/put_swap_device(). The code line number is reduced too. Although not
so important, the swapoff() performance improves too because one
synchronize_rcu() call during swapoff() is deleted.
[ying.huang@intel.com: fix bad swap file entry warning]
Link: http://lkml.kernel.org/r/20190531024102.21723-1-ying.huang@intel.com
Link: http://lkml.kernel.org/r/20190527082714.12151-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When swapin is performed, after getting the swap entry information from
the page table, system will swap in the swap entry, without any lock held
to prevent the swap device from being swapoff. This may cause the race
like below,
CPU 1 CPU 2
----- -----
do_swap_page
swapin_readahead
__read_swap_cache_async
swapoff swapcache_prepare
p->swap_map = NULL __swap_duplicate
p->swap_map[?] /* !!! NULL pointer access */
Because swapoff is usually done when system shutdown only, the race may
not hit many people in practice. But it is still a race need to be fixed.
To fix the race, get_swap_device() is added to check whether the specified
swap entry is valid in its swap device. If so, it will keep the swap
entry valid via preventing the swap device from being swapoff, until
put_swap_device() is called.
Because swapoff() is very rare code path, to make the normal path runs as
fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
reference count is used to implement get/put_swap_device(). >From
get_swap_device() to put_swap_device(), RCU reader side is locked, so
synchronize_rcu() in swapoff() will wait until put_swap_device() is
called.
In addition to swap_map, cluster_info, etc. data structure in the struct
swap_info_struct, the swap cache radix tree will be freed after swapoff,
so this patch fixes the race between swap cache looking up and swapoff
too.
Races between some other swap cache usages and swapoff are fixed too via
calling synchronize_rcu() between clearing PageSwapCache() and freeing
swap cache data structure.
Another possible method to fix this is to use preempt_off() +
stop_machine() to prevent the swap device from being swapoff when its data
structure is being accessed. The overhead in hot-path of both methods is
similar. The advantages of RCU based method are,
1. stop_machine() may disturb the normal execution code path on other
CPUs.
2. File cache uses RCU to protect its radix tree. If the similar
mechanism is used for swap cache too, it is easier to share code
between them.
3. RCU is used to protect swap cache in total_swapcache_pages() and
exit_swap_address_space() already. The two mechanisms can be
merged to simplify the logic.
Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
Fixes: 235b621767 ("mm/swap: add cluster lock")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Not-nacked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 6b4c9f4469 ("filemap: drop the mmap_sem for all blocking
operations") changed when mmap_sem is dropped during filemap page fault
and when returning VM_FAULT_RETRY.
Correct the comment to reflect the change.
Link: http://lkml.kernel.org/r/1556234531-108228-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can just pass a NULL filler and do the right thing inside of
do_read_cache_page based on the NULL parameter.
Link: http://lkml.kernel.org/r/20190520055731.24538-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "fix filler_t callback type mismatches", v2.
Casting mapping->a_ops->readpage to filler_t causes an indirect call
type mismatch with Control-Flow Integrity checking. This change fixes
the mismatch in read_cache_page_gfp and read_mapping_page by adding
using a NULL filler argument as an indication to call ->readpage
directly, and by passing the right parameter callbacks in nfs and jffs2.
This patch (of 4):
Code cleanup.
Link: http://lkml.kernel.org/r/20190520055731.24538-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When debug_pagealloc is enabled, we currently allocate the page_ext
array to mark guard pages with the PAGE_EXT_DEBUG_GUARD flag. Now that
we have the page_type field in struct page, we can use that instead, as
guard pages are neither PageSlab nor mapped to userspace. This reduces
memory overhead when debug_pagealloc is enabled and there are no other
features requiring the page_ext array.
Link: http://lkml.kernel.org/r/20190603143451.27353-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The page allocator checks struct pages for expected state (mapcount,
flags etc) as pages are being allocated (check_new_page()) and freed
(free_pages_check()) to provide some defense against errors in page
allocator users.
Prior commits 479f854a20 ("mm, page_alloc: defer debugging checks of
pages allocated from the PCP") and 4db7548ccb ("mm, page_alloc: defer
debugging checks of freed pages until a PCP drain") this has happened
for order-0 pages as they were allocated from or freed to the per-cpu
caches (pcplists). Since those are fast paths, the checks are now
performed only when pages are moved between pcplists and global free
lists. This however lowers the chances of catching errors soon enough.
In order to increase the chances of the checks to catch errors, the
kernel has to be rebuilt with CONFIG_DEBUG_VM, which also enables
multiple other internal debug checks (VM_BUG_ON() etc), which is
suboptimal when the goal is to catch errors in mm users, not in mm code
itself.
To catch some wrong users of the page allocator we have
CONFIG_DEBUG_PAGEALLOC, which is designed to have virtually no overhead
unless enabled at boot time. Memory corruptions when writing to freed
pages have often the same underlying errors (use-after-free, double free)
as corrupting the corresponding struct pages, so this existing debugging
functionality is a good fit to extend by also perform struct page checks
at least as often as if CONFIG_DEBUG_VM was enabled.
Specifically, after this patch, when debug_pagealloc is enabled on boot,
and CONFIG_DEBUG_VM disabled, pages are checked when allocated from or
freed to the pcplists *in addition* to being moved between pcplists and
free lists. When both debug_pagealloc and CONFIG_DEBUG_VM are enabled,
pages are checked when being moved between pcplists and free lists *in
addition* to when allocated from or freed to the pcplists.
When debug_pagealloc is not enabled on boot, the overhead in fast paths
should be virtually none thanks to the use of static key.
Link: http://lkml.kernel.org/r/20190603143451.27353-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "debug_pagealloc improvements".
I have been recently debugging some pcplist corruptions, where it would be
useful to perform struct page checks immediately as pages are allocated
from and freed to pcplists, which is now only possible by rebuilding the
kernel with CONFIG_DEBUG_VM (details in Patch 2 changelog).
To make this kind of debugging simpler in future on a distro kernel, I
have improved CONFIG_DEBUG_PAGEALLOC so that it has even smaller overhead
when not enabled at boot time (Patch 1) and also when enabled (Patch 3),
and extended it to perform the struct page checks more often when enabled
(Patch 2). Now it can be configured in when building a distro kernel
without extra overhead, and debugging page use after free or double free
can be enabled simply by rebooting with debug_pagealloc=on.
This patch (of 3):
CONFIG_DEBUG_PAGEALLOC has been redesigned by 031bc5743f
("mm/debug-pagealloc: make debug-pagealloc boottime configurable") to
allow being always enabled in a distro kernel, but only perform its
expensive functionality when booted with debug_pagelloc=on. We can
further reduce the overhead when not boot-enabled (including page
allocator fast paths) using static keys. This patch introduces one for
debug_pagealloc core functionality, and another for the optional guard
page functionality (enabled by booting with debug_guardpage_minorder=X).
Link: http://lkml.kernel.org/r/20190603143451.27353-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When failslab was originally written, the intention of the
"ignore-gfp-wait" flag default value ("N") was to fail GFP_ATOMIC
allocations. Those were defined as (__GFP_HIGH), and the code would test
for __GFP_WAIT (0x10u).
However, since then, __GFP_WAIT was replaced by __GFP_RECLAIM
(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM), and GFP_ATOMIC is now
defined as (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM).
This means that when the flag is false, almost no allocation ever fails
(as even GFP_ATOMIC allocations contain ___GFP_KSWAPD_RECLAIM).
Restore the original intent of the code, by ignoring calls that directly
reclaim only (__GFP_DIRECT_RECLAIM), and thus, failing GFP_ATOMIC calls
again by default.
Link: http://lkml.kernel.org/r/20190520214514.81360-1-drinkcat@chromium.org
Fixes: 71baba4b92 ("mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM")
Signed-off-by: Nicolas Boichat <drinkcat@chromium.org>
Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously totalram_pages was the global variable. Currently,
totalram_pages is the static inline function from the include/linux/mm.h
However, the function is also marked as EXPORT_SYMBOL, which is at best an
odd combination. Because there is no point for the static inline function
from a public header to be exported, this commit removes the
EXPORT_SYMBOL() marking. It will be still possible to use the function in
modules because all the symbols it depends on are exported.
Link: http://lkml.kernel.org/r/20190710141031.15642-1-efremov@linux.com
Fixes: ca79b0c211 ("mm: convert totalram_pages and totalhigh_pages variables to atomic")
Signed-off-by: Denis Efremov <efremov@linux.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
undo_isolate_page_range() never fails, so no need to return value.
Link: http://lkml.kernel.org/r/1562075604-8979-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
account_page_dirtied() is only used by our set_page_dirty() helpers and
should not be used anywhere else.
Link: http://lkml.kernel.org/r/20190605183702.30572-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the success case use the same cleanup path as the failure case.
Link: http://lkml.kernel.org/r/20190523134024.GC24093@localhost.localdomain
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
follow_page_mask() is only used in gup.c, make it static.
Link: http://lkml.kernel.org/r/20190510190831.GA4061@bharath12345-Inspiron-5559
Signed-off-by: Bharath Vedartham <linux.bhar@gmail.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ksize() has been unconditionally unpoisoning the whole shadow memory
region associated with an allocation. This can lead to various undetected
bugs, for example, double-kzfree().
Specifically, kzfree() uses ksize() to determine the actual allocation
size, and subsequently zeroes the memory. Since ksize() used to just
unpoison the whole shadow memory region, no invalid free was detected.
This patch addresses this as follows:
1. Add a check in ksize(), and only then unpoison the memory region.
2. Preserve kasan_unpoison_slab() semantics by explicitly unpoisoning
the shadow memory region using the size obtained from __ksize().
Tested:
1. With SLAB allocator: a) normal boot without warnings; b) verified the
added double-kzfree() is detected.
2. With SLUB allocator: a) normal boot without warnings; b) verified the
added double-kzfree() is detected.
[elver@google.com: s/BUG_ON/WARN_ON_ONCE/, per Kees]
Link: http://lkml.kernel.org/r/20190627094445.216365-6-elver@google.com
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199359
Link: http://lkml.kernel.org/r/20190626142014.141844-6-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This refactors common code of ksize() between the various allocators into
slab_common.c: __ksize() is the allocator-specific implementation without
instrumentation, whereas ksize() includes the required KASAN logic.
Link: http://lkml.kernel.org/r/20190626142014.141844-5-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This changes {,__}kasan_check_{read,write} functions to return a boolean
denoting if the access was valid or not.
[sfr@canb.auug.org.au: include types.h for "bool"]
Link: http://lkml.kernel.org/r/20190705184949.13cdd021@canb.auug.org.au
Link: http://lkml.kernel.org/r/20190626142014.141844-3-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/kasan: Add object validation in ksize()", v3.
This patch (of 5):
This introduces __kasan_check_{read,write}. __kasan_check functions may
be used from anywhere, even compilation units that disable instrumentation
selectively.
This change eliminates the need for the __KASAN_INTERNAL definition.
[elver@google.com: v5]
Link: http://lkml.kernel.org/r/20190708170706.174189-2-elver@google.com
Link: http://lkml.kernel.org/r/20190626142014.141844-2-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds support for printing stack frame description on invalid stack
accesses. The frame description is embedded by the compiler, which is
parsed and then pretty-printed.
Currently, we can only print the stack frame info for accesses to the
task's own stack, but not accesses to other tasks' stacks.
Example of what it looks like:
page dumped because: kasan: bad access detected
addr ffff8880673ef98a is located in stack of task insmod/2008 at offset 106 in frame:
kasan_stack_oob+0x0/0xf5 [test_kasan]
this frame has 2 objects:
[32, 36) 'i'
[96, 106) 'stack_array'
Memory state around the buggy address:
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=198435
Link: http://lkml.kernel.org/r/20190522100048.146841-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
According to POSIX, EBUSY means that the "device or resource is busy", and
this can lead to people thinking that the file
`/sys/kernel/debug/kmemleak/` is somehow locked or being used by other
process. Change this error code to a more appropriate one.
Link: http://lkml.kernel.org/r/20190612155231.19448-1-andrealmeid@collabora.com
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently for CONFIG_SLUB, if a memcg kmem cache creation is failed and
the corresponding root kmem cache has SLAB_PANIC flag, the kernel will
be crashed. This is unnecessary as the kernel can handle the creation
failures of memcg kmem caches. Additionally CONFIG_SLAB does not
implement this behavior. So, to keep the behavior consistent between
SLAB and SLUB, removing the panic for memcg kmem cache creation
failures. The root kmem cache creation failure for SLAB_PANIC correctly
panics for both SLAB and SLUB.
Link: http://lkml.kernel.org/r/20190619232514.58994-1-shakeelb@google.com
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If ',' is not found, kmem_cache_flags() calls strlen() to find the end of
line. We can do it in a single pass using strchrnul().
Link: http://lkml.kernel.org/r/20190501053111.7950-1-ynorov@marvell.com
Signed-off-by: Yury Norov <ynorov@marvell.com>
Acked-by: Aaron Tomlin <atomlin@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This avoids any possible type confusion when looking up an object. For
example, if a non-slab were to be passed to kfree(), the invalid
slab_cache pointer (i.e. overlapped with some other value from the
struct page union) would be used for subsequent slab manipulations that
could lead to further memory corruption.
Since the page is already in cache, adding the PageSlab() check will
have nearly zero cost, so add a check and WARN() to virt_to_cache().
Additionally replaces an open-coded virt_to_cache(). To support the
failure mode this also updates all callers of virt_to_cache() and
cache_from_obj() to handle a NULL cache pointer return value (though
note that several already handle this case gracefully).
[dan.carpenter@oracle.com: restore IRQs in kfree()]
Link: http://lkml.kernel.org/r/20190613065637.GE16334@mwanda
Link: http://lkml.kernel.org/r/20190530045017.15252-3-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Alexander Popov <alex.popov@linux.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/slab: Improved sanity checking".
This adds defenses against slab cache confusion (as seen in real-world
exploits[1]) and gracefully handles type confusions when trying to look
up slab caches from an arbitrary page. (Also is patch 3: new LKDTM
tests for these defenses as well as for the existing double-free
detection.
This patch (of 3):
When building under CONFIG_SLAB_FREELIST_HARDENING, it makes sense to
perform sanity-checking on the assumed slab cache during
kmem_cache_free() to make sure the kernel doesn't mix freelists across
slab caches and corrupt memory (as seen in the exploitation of flaws
like CVE-2018-9568[1]). Note that the prior code might WARN() but still
corrupt memory (i.e. return the assumed cache instead of the owned
cache).
There is no noticeable performance impact (changes are within noise).
Measuring parallel kernel builds, I saw the following with
CONFIG_SLAB_FREELIST_HARDENED, before and after this patch:
before:
Run times: 288.85 286.53 287.09 287.07 287.21
Min: 286.53 Max: 288.85 Mean: 287.35 Std Dev: 0.79
after:
Run times: 289.58 287.40 286.97 287.20 287.01
Min: 286.97 Max: 289.58 Mean: 287.63 Std Dev: 0.99
Delta: 0.1% which is well below the standard deviation
[1] https://github.com/ThomasKing2014/slides/raw/master/Building%20universal%20Android%20rooting%20with%20a%20type%20confusion%20vulnerability.pdf
Link: http://lkml.kernel.org/r/20190530045017.15252-2-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Popov <alex.popov@linux.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Following zsmalloc.c's example we call trylock_page() and unlock_page().
Also make z3fold_page_migrate() assert that newpage is passed in locked,
as per the documentation.
[akpm@linux-foundation.org: fix trylock_page return value test, per Shakeel]
Link: http://lkml.kernel.org/r/20190702005122.41036-1-henryburns@google.com
Link: http://lkml.kernel.org/r/20190702233538.52793-1-henryburns@google.com
Signed-off-by: Henry Burns <henryburns@google.com>
Suggested-by: Vitaly Wool <vitalywool@gmail.com>
Acked-by: Vitaly Wool <vitalywool@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Vitaly Vul <vitaly.vul@sony.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Xidong Wang <wangxidong_97@163.com>
Cc: Jonathan Adams <jwadams@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we calculate total statistics for memcg1_stats and memcg1_events,
we use the the index 'i' in the for loop as the events index. Actually
we should use memcg1_stats[i] and memcg1_events[i] as the events index.
Link: http://lkml.kernel.org/r/1562116978-19539-1-git-send-email-laoar.shao@gmail.com
Fixes: 42a3003535 ("mm: memcontrol: fix recursive statistics correctness & scalabilty").
Signed-off-by: Yafang Shao <laoar.shao@gmail.com
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yafang Shao <shaoyafang@didiglobal.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When file refaults are detected and there are many inactive file pages,
the system never reclaim anonymous pages, the file pages are dropped
aggressively when there are still a lot of cold anonymous pages and
system thrashes. This issue impacts the performance of applications
with large executable, e.g. chrome.
With this patch, when file refault is detected, inactive_list_is_low()
always returns true for file pages in get_scan_count() to enable
scanning anonymous pages.
The problem can be reproduced by the following test program.
---8<---
void fallocate_file(const char *filename, off_t size)
{
struct stat st;
int fd;
if (!stat(filename, &st) && st.st_size >= size)
return;
fd = open(filename, O_WRONLY | O_CREAT, 0600);
if (fd < 0) {
perror("create file");
exit(1);
}
if (posix_fallocate(fd, 0, size)) {
perror("fallocate");
exit(1);
}
close(fd);
}
long *alloc_anon(long size)
{
long *start = malloc(size);
memset(start, 1, size);
return start;
}
long access_file(const char *filename, long size, long rounds)
{
int fd, i;
volatile char *start1, *end1, *start2;
const int page_size = getpagesize();
long sum = 0;
fd = open(filename, O_RDONLY);
if (fd == -1) {
perror("open");
exit(1);
}
/*
* Some applications, e.g. chrome, use a lot of executable file
* pages, map some of the pages with PROT_EXEC flag to simulate
* the behavior.
*/
start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
fd, 0);
if (start1 == MAP_FAILED) {
perror("mmap");
exit(1);
}
end1 = start1 + size / 2;
start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
if (start2 == MAP_FAILED) {
perror("mmap");
exit(1);
}
for (i = 0; i < rounds; ++i) {
struct timeval before, after;
volatile char *ptr1 = start1, *ptr2 = start2;
gettimeofday(&before, NULL);
for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
sum += *ptr1 + *ptr2;
gettimeofday(&after, NULL);
printf("File access time, round %d: %f (sec)
", i,
(after.tv_sec - before.tv_sec) +
(after.tv_usec - before.tv_usec) / 1000000.0);
}
return sum;
}
int main(int argc, char *argv[])
{
const long MB = 1024 * 1024;
long anon_mb, file_mb, file_rounds;
const char filename[] = "large";
long *ret1;
long ret2;
if (argc != 4) {
printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS
");
exit(0);
}
anon_mb = atoi(argv[1]);
file_mb = atoi(argv[2]);
file_rounds = atoi(argv[3]);
fallocate_file(filename, file_mb * MB);
printf("Allocate %ld MB anonymous pages
", anon_mb);
ret1 = alloc_anon(anon_mb * MB);
printf("Access %ld MB file pages
", file_mb);
ret2 = access_file(filename, file_mb * MB, file_rounds);
printf("Print result to prevent optimization: %ld
",
*ret1 + ret2);
return 0;
}
---8<---
Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the program
fills ram with 2048 MB memory, access a 200 MB file for 10 times. Without
this patch, the file cache is dropped aggresively and every access to the
file is from disk.
$ ./thrash 2048 200 10
Allocate 2048 MB anonymous pages
Access 200 MB file pages
File access time, round 0: 2.489316 (sec)
File access time, round 1: 2.581277 (sec)
File access time, round 2: 2.487624 (sec)
File access time, round 3: 2.449100 (sec)
File access time, round 4: 2.420423 (sec)
File access time, round 5: 2.343411 (sec)
File access time, round 6: 2.454833 (sec)
File access time, round 7: 2.483398 (sec)
File access time, round 8: 2.572701 (sec)
File access time, round 9: 2.493014 (sec)
With this patch, these file pages can be cached.
$ ./thrash 2048 200 10
Allocate 2048 MB anonymous pages
Access 200 MB file pages
File access time, round 0: 2.475189 (sec)
File access time, round 1: 2.440777 (sec)
File access time, round 2: 2.411671 (sec)
File access time, round 3: 1.955267 (sec)
File access time, round 4: 0.029924 (sec)
File access time, round 5: 0.000808 (sec)
File access time, round 6: 0.000771 (sec)
File access time, round 7: 0.000746 (sec)
File access time, round 8: 0.000738 (sec)
File access time, round 9: 0.000747 (sec)
Checked the swap out stats during the test [1], 19006 pages swapped out
with this patch, 3418 pages swapped out without this patch. There are
more swap out, but I think it's within reasonable range when file backed
data set doesn't fit into the memory.
$ ./thrash 2000 100 2100 5 1 # ANON_MB FILE_EXEC FILE_NOEXEC ROUNDS
PROCESSES Allocate 2000 MB anonymous pages active_anon: 1613644,
inactive_anon: 348656, active_file: 892, inactive_file: 1384 (kB)
pswpout: 7972443, pgpgin: 478615246 Access 100 MB executable file pages
Access 2100 MB regular file pages File access time, round 0: 12.165,
(sec) active_anon: 1433788, inactive_anon: 478116, active_file: 17896,
inactive_file: 24328 (kB) File access time, round 1: 11.493, (sec)
active_anon: 1430576, inactive_anon: 477144, active_file: 25440,
inactive_file: 26172 (kB) File access time, round 2: 11.455, (sec)
active_anon: 1427436, inactive_anon: 476060, active_file: 21112,
inactive_file: 28808 (kB) File access time, round 3: 11.454, (sec)
active_anon: 1420444, inactive_anon: 473632, active_file: 23216,
inactive_file: 35036 (kB) File access time, round 4: 11.479, (sec)
active_anon: 1413964, inactive_anon: 471460, active_file: 31728,
inactive_file: 32224 (kB) pswpout: 7991449 (+ 19006), pgpgin: 489924366
(+ 11309120)
With 4 processes accessing non-overlapping parts of a large file, 30316
pages swapped out with this patch, 5152 pages swapped out without this
patch. The swapout number is small comparing to pgpgin.
[1]: https://github.com/vovo/testing/blob/master/mem_thrash.c
Link: http://lkml.kernel.org/r/20190701081038.GA83398@google.com
Fixes: e986850598 ("mm,vmscan: only evict file pages when we have plenty")
Fixes: 7c5bd705d8 ("mm: memcg: only evict file pages when we have plenty")
Signed-off-by: Kuo-Hsin Yang <vovoy@chromium.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Sonny Rao <sonnyrao@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org> [4.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here is the "large" pull request for char and misc and other assorted
smaller driver subsystems for 5.3-rc1.
It seems that this tree is becoming the funnel point of lots of smaller
driver subsystems, which is fine for me, but that's why it is getting
larger over time and does not just contain stuff under drivers/char/ and
drivers/misc.
Lots of small updates all over the place here from different driver
subsystems:
- habana driver updates
- coresight driver updates
- documentation file movements and updates
- Android binder fixes and updates
- extcon driver updates
- google firmware driver updates
- fsi driver updates
- smaller misc and char driver updates
- soundwire driver updates
- nvmem driver updates
- w1 driver fixes
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXSXmoQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylV9wCgyJGbpPch8v/ecrZGFHYS4sIMexIAoMco3zf6
wnqFmXiz1O0tyo1sgV9R
=7sqO
-----END PGP SIGNATURE-----
Merge tag 'char-misc-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char / misc driver updates from Greg KH:
"Here is the "large" pull request for char and misc and other assorted
smaller driver subsystems for 5.3-rc1.
It seems that this tree is becoming the funnel point of lots of
smaller driver subsystems, which is fine for me, but that's why it is
getting larger over time and does not just contain stuff under
drivers/char/ and drivers/misc.
Lots of small updates all over the place here from different driver
subsystems:
- habana driver updates
- coresight driver updates
- documentation file movements and updates
- Android binder fixes and updates
- extcon driver updates
- google firmware driver updates
- fsi driver updates
- smaller misc and char driver updates
- soundwire driver updates
- nvmem driver updates
- w1 driver fixes
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (188 commits)
coresight: Do not default to CPU0 for missing CPU phandle
dt-bindings: coresight: Change CPU phandle to required property
ocxl: Allow contexts to be attached with a NULL mm
fsi: sbefifo: Don't fail operations when in SBE IPL state
coresight: tmc: Smatch: Fix potential NULL pointer dereference
coresight: etm3x: Smatch: Fix potential NULL pointer dereference
coresight: Potential uninitialized variable in probe()
coresight: etb10: Do not call smp_processor_id from preemptible
coresight: tmc-etf: Do not call smp_processor_id from preemptible
coresight: tmc-etr: alloc_perf_buf: Do not call smp_processor_id from preemptible
coresight: tmc-etr: Do not call smp_processor_id() from preemptible
docs: misc-devices: convert files without extension to ReST
fpga: dfl: fme: align PR buffer size per PR datawidth
fpga: dfl: fme: remove copy_to_user() in ioctl for PR
fpga: dfl-fme-mgr: fix FME_PR_INTFC_ID register address.
intel_th: msu: Start read iterator from a non-empty window
intel_th: msu: Split sgt array and pointer in multiwindow mode
intel_th: msu: Support multipage blocks
intel_th: pci: Add Ice Lake NNPI support
intel_th: msu: Fix single mode with disabled IOMMU
...
lookups.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl0lFIoACgkQ8vlZVpUN
gaOwNQf/aJxFxHVf4t3lga8kfoMhlbwINQknsGUVwg32HporMa1NxQXjbEMMhs6V
A31gBJ44nYVz1enz7nvbE4kx4quF4E8rDVprEetphv4i8GSdUAihwJwY5/H0oSd8
rxzTZzNKddoyN/j7H4LgAh7bo6IFk54kUuaAWuZDJnJtfLNQ6RBaIwg6u6Z8Fael
9H3u/RtFHqWPQp5j50PMUG06abr26GKi1gLL+yeoFD1tuzC54B5i6uy34amrXlon
5agIQ7YuB9bigK4VaLoF4df7o+7+Oa6ENaQ9O/TQc9Uy9ngdVlPpNb2bVDizRLNn
e369sBFTf3C8sMycJy6x9TCqg2B7Hw==
=EpCF
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Many bug fixes and cleanups, and an optimization for case-insensitive
lookups"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix coverity warning on error path of filename setup
ext4: replace ktype default_attrs with default_groups
ext4: rename htree_inline_dir_to_tree() to ext4_inlinedir_to_tree()
ext4: refactor initialize_dirent_tail()
ext4: rename "dirent_csum" functions to use "dirblock"
ext4: allow directory holes
jbd2: drop declaration of journal_sync_buffer()
ext4: use jbd2_inode dirty range scoping
jbd2: introduce jbd2_inode dirty range scoping
mm: add filemap_fdatawait_range_keep_errors()
ext4: remove redundant assignment to node
ext4: optimize case-insensitive lookups
ext4: make __ext4_get_inode_loc plug
ext4: clean up kerneldoc warnigns when building with W=1
ext4: only set project inherit bit for directory
ext4: enforce the immutable flag on open files
ext4: don't allow any modifications to an immutable file
jbd2: fix typo in comment of journal_submit_inode_data_buffers
jbd2: fix some print format mistakes
ext4: gracefully handle ext4_break_layouts() failure during truncate
- Create a generic copy_file_range handler and make individual
filesystems responsible for calling it (i.e. no more assuming that
do_splice_direct will work or is appropriate)
- Refactor copy_file_range and remap_range parameter checking where they
are the same
- Install missing copy_file_range parameter checking(!)
- Remove suid/sgid and update mtime like any other file write
- Change the behavior so that a copy range crossing the source file's
eof will result in a short copy to the source file's eof instead of
EINVAL
- Permit filesystems to decide if they want to handle cross-superblock
copy_file_range in their local handlers.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl0BGvAACgkQ+H93GTRK
tOu2aw/+KGG7PiXm9ED3ZXUppKVddrZMOgqM7mSfHo6TBgW3pJUJcRIhawK0Wz/P
stgTsOkurHSl3iT3vQyX4GTZvLoGN/rfsRLPxogJptBUqVv3BOrXsrI53f7V/kbm
rtjlYsgExji7VBUiMTe5kOWWqxyR7B4nXyvY/8rier57rW/8C1I58B0OrxAmTK0k
rz1e5BtE1dg91xA7cSdEc38FInz8MW8cvsrEzW9vyYY4IVE0PBuhhA1EvryxTrAZ
hfthHFfzwxhJkI0mdha8uqNufNWrHLSqiwyjYC7pwAwSQzQPiQz9U17flu+URnfF
kXaR5LdXbBP3pl46RdthrfuonWsv612cC1Qwfjs8PBG9lG7b9PGJ40MGVTiw7LlQ
924/03ho0zAnV0E8Qn5O9nPshQNDJhwhzMS39EmMyFKb1D5XGzdMV0gDdIfx6hdO
HDbw6VQ33S59gvk7v/gxsFB5Bs4PKfamHx/QmwQwpqWM5XExcr0yJ90OTBtAuY4r
S+9gwG6uED3aPh8HbQ5UgnA8bZmMmi8AkcBvqJ9GgNw5SbZl0oyv9Sj6JNpoOejV
8y9JkhoZUxqiihnKTw/vtMrj5RCOfifNBjMSwrShfLdLKtK0AZl1mXC0/1Q3VnEQ
TUcyRHEzrtHgJ9/AK9xIyDNvNYzvHSLZj7maoZZumgQa2FOFrmw=
=qM44
-----END PGP SIGNATURE-----
Merge tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull copy_file_range updates from Darrick Wong:
"This fixes numerous parameter checking problems and inconsistent
behaviors in the new(ish) copy_file_range system call.
Now the system call will actually check its range parameters
correctly; refuse to copy into files for which the caller does not
have sufficient privileges; update mtime and strip setuid like file
writes are supposed to do; and allows copying up to the EOF of the
source file instead of failing the call like we used to.
Summary:
- Create a generic copy_file_range handler and make individual
filesystems responsible for calling it (i.e. no more assuming that
do_splice_direct will work or is appropriate)
- Refactor copy_file_range and remap_range parameter checking where
they are the same
- Install missing copy_file_range parameter checking(!)
- Remove suid/sgid and update mtime like any other file write
- Change the behavior so that a copy range crossing the source file's
eof will result in a short copy to the source file's eof instead of
EINVAL
- Permit filesystems to decide if they want to handle
cross-superblock copy_file_range in their local handlers"
* tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fuse: copy_file_range needs to strip setuid bits and update timestamps
vfs: allow copy_file_range to copy across devices
xfs: use file_modified() helper
vfs: introduce file_modified() helper
vfs: add missing checks to copy_file_range
vfs: remove redundant checks from generic_remap_checks()
vfs: introduce generic_file_rw_checks()
vfs: no fallback for ->copy_file_range
vfs: introduce generic_copy_file_range()
- A fair pile of RST conversions, many from Mauro. These create more
than the usual number of simple but annoying merge conflicts with other
trees, unfortunately. He has a lot more of these waiting on the wings
that, I think, will go to you directly later on.
- A new document on how to use merges and rebases in kernel repos, and one
on Spectre vulnerabilities.
- Various improvements to the build system, including automatic markup of
function() references because some people, for reasons I will never
understand, were of the opinion that :c:func:``function()`` is
unattractive and not fun to type.
- We now recommend using sphinx 1.7, but still support back to 1.4.
- Lots of smaller improvements, warning fixes, typo fixes, etc.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl0krAEPHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5Yg98H/AuLqO9LpOgUjF4LhyjxGPdzJkY9RExSJ7km
gznyreLCZgFaJR+AY6YDsd4Jw6OJlPbu1YM/Qo3C3WrZVFVhgL/s2ebvBgCo50A8
raAFd8jTf4/mGCHnAqRotAPQ3mETJUk315B66lBJ6Oc+YdpRhwXWq8ZW2bJxInFF
3HDvoFgMf0KhLuMHUkkL0u3fxH1iA+KvDu8diPbJYFjOdOWENz/CV8wqdVkXRSEW
DJxIq89h/7d+hIG3d1I7Nw+gibGsAdjSjKv4eRKauZs4Aoxd1Gpl62z0JNk6aT3m
dtq4joLdwScydonXROD/Twn2jsu4xYTrPwVzChomElMowW/ZBBY=
=D0eO
-----END PGP SIGNATURE-----
Merge tag 'docs-5.3' of git://git.lwn.net/linux
Pull Documentation updates from Jonathan Corbet:
"It's been a relatively busy cycle for docs:
- A fair pile of RST conversions, many from Mauro. These create more
than the usual number of simple but annoying merge conflicts with
other trees, unfortunately. He has a lot more of these waiting on
the wings that, I think, will go to you directly later on.
- A new document on how to use merges and rebases in kernel repos,
and one on Spectre vulnerabilities.
- Various improvements to the build system, including automatic
markup of function() references because some people, for reasons I
will never understand, were of the opinion that
:c:func:``function()`` is unattractive and not fun to type.
- We now recommend using sphinx 1.7, but still support back to 1.4.
- Lots of smaller improvements, warning fixes, typo fixes, etc"
* tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits)
docs: automarkup.py: ignore exceptions when seeking for xrefs
docs: Move binderfs to admin-guide
Disable Sphinx SmartyPants in HTML output
doc: RCU callback locks need only _bh, not necessarily _irq
docs: format kernel-parameters -- as code
Doc : doc-guide : Fix a typo
platform: x86: get rid of a non-existent document
Add the RCU docs to the core-api manual
Documentation: RCU: Add TOC tree hooks
Documentation: RCU: Rename txt files to rst
Documentation: RCU: Convert RCU UP systems to reST
Documentation: RCU: Convert RCU linked list to reST
Documentation: RCU: Convert RCU basic concepts to reST
docs: filesystems: Remove uneeded .rst extension on toctables
scripts/sphinx-pre-install: fix out-of-tree build
docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/
Documentation: PGP: update for newer HW devices
Documentation: Add section about CPU vulnerabilities for Spectre
Documentation: platform: Delete x86-laptop-drivers.txt
docs: Note that :c:func: should no longer be used
...
Pull force_sig() argument change from Eric Biederman:
"A source of error over the years has been that force_sig has taken a
task parameter when it is only safe to use force_sig with the current
task.
The force_sig function is built for delivering synchronous signals
such as SIGSEGV where the userspace application caused a synchronous
fault (such as a page fault) and the kernel responded with a signal.
Because the name force_sig does not make this clear, and because the
force_sig takes a task parameter the function force_sig has been
abused for sending other kinds of signals over the years. Slowly those
have been fixed when the oopses have been tracked down.
This set of changes fixes the remaining abusers of force_sig and
carefully rips out the task parameter from force_sig and friends
making this kind of error almost impossible in the future"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
signal: Remove the signal number and task parameters from force_sig_info
signal: Factor force_sig_info_to_task out of force_sig_info
signal: Generate the siginfo in force_sig
signal: Move the computation of force into send_signal and correct it.
signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
signal: Remove the task parameter from force_sig_fault
signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
signal: Explicitly call force_sig_fault on current
signal/unicore32: Remove tsk parameter from __do_user_fault
signal/arm: Remove tsk parameter from __do_user_fault
signal/arm: Remove tsk parameter from ptrace_break
signal/nds32: Remove tsk parameter from send_sigtrap
signal/riscv: Remove tsk parameter from do_trap
signal/sh: Remove tsk parameter from force_sig_info_fault
signal/um: Remove task parameter from send_sigtrap
signal/x86: Remove task parameter from send_sigtrap
signal: Remove task parameter from force_sig_mceerr
signal: Remove task parameter from force_sig
signal: Remove task parameter from force_sigsegv
...
- arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP}
- Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to
manage the permissions of executable vmalloc regions more strictly
- Slight performance improvement by keeping softirqs enabled while
touching the FPSIMD/SVE state (kernel_neon_begin/end)
- Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new XAFLAG
and AXFLAG instructions for floating point comparison flags
manipulation) and FRINT (rounding floating point numbers to integers)
- Re-instate ARM64_PSEUDO_NMI support which was previously marked as
BROKEN due to some bugs (now fixed)
- Improve parking of stopped CPUs and implement an arm64-specific
panic_smp_self_stop() to avoid warning on not being able to stop
secondary CPUs during panic
- perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI
platforms
- perf: DDR performance monitor support for iMX8QXP
- cache_line_size() can now be set from DT or ACPI/PPTT if provided to
cope with a system cache info not exposed via the CPUID registers
- Avoid warning on hardware cache line size greater than
ARCH_DMA_MINALIGN if the system is fully coherent
- arm64 do_page_fault() and hugetlb cleanups
- Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep)
- Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the 'arm_boot_flags'
introduced in 5.1)
- CONFIG_RANDOMIZE_BASE now enabled in defconfig
- Allow the selection of ARM64_MODULE_PLTS, currently only done via
RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill
over into the vmalloc area
- Make ZONE_DMA32 configurable
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAl0eHqcACgkQa9axLQDI
XvFyNA/+L+bnkz8m3ncydlqqfXomQn4eJJVQ8Uksb0knJz+1+3CUxxbO4ry4jXZN
fMkbggYrDPRKpDbsUl0lsRipj7jW9bqan+N37c3SWqCkgb6HqDaHViwxdx6Ec/Uk
gHudozDSPh/8c7hxGcSyt/CFyuW6b+8eYIQU5rtIgz8aVY2BypBvS/7YtYCbIkx0
w4CFleRTK1zXD5mJQhrc6jyDx659sVkrAvdhf6YIymOY8nBTv40vwdNo3beJMYp8
Po/+0Ixu+VkHUNtmYYZQgP/AGH96xiTcRnUqd172JdtRPpCLqnLqwFokXeVIlUKT
KZFMDPzK+756Ayn4z4huEePPAOGlHbJje8JVNnFyreKhVVcCotW7YPY/oJR10bnc
eo7yD+DxABTn+93G2yP436bNVa8qO1UqjOBfInWBtnNFJfANIkZweij/MQ6MjaTA
o7KtviHnZFClefMPoiI7HDzwL8XSmsBDbeQ04s2Wxku1Y2xUHLx4iLmadwLQ1ZPb
lZMTZP3N/T1554MoURVA1afCjAwiqU3bt1xDUGjbBVjLfSPBAn/25IacsG9Li9AF
7Rp1M9VhrfLftjFFkB2HwpbhRASOxaOSx+EI3kzEfCtM2O9I1WHgP3rvCdc3l0HU
tbK0/IggQicNgz7GSZ8xDlWPwwSadXYGLys+xlMZEYd3pDIOiFc=
=0TDT
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP}
- Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to
manage the permissions of executable vmalloc regions more strictly
- Slight performance improvement by keeping softirqs enabled while
touching the FPSIMD/SVE state (kernel_neon_begin/end)
- Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new
XAFLAG and AXFLAG instructions for floating point comparison flags
manipulation) and FRINT (rounding floating point numbers to integers)
- Re-instate ARM64_PSEUDO_NMI support which was previously marked as
BROKEN due to some bugs (now fixed)
- Improve parking of stopped CPUs and implement an arm64-specific
panic_smp_self_stop() to avoid warning on not being able to stop
secondary CPUs during panic
- perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI
platforms
- perf: DDR performance monitor support for iMX8QXP
- cache_line_size() can now be set from DT or ACPI/PPTT if provided to
cope with a system cache info not exposed via the CPUID registers
- Avoid warning on hardware cache line size greater than
ARCH_DMA_MINALIGN if the system is fully coherent
- arm64 do_page_fault() and hugetlb cleanups
- Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep)
- Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the
'arm_boot_flags' introduced in 5.1)
- CONFIG_RANDOMIZE_BASE now enabled in defconfig
- Allow the selection of ARM64_MODULE_PLTS, currently only done via
RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill
over into the vmalloc area
- Make ZONE_DMA32 configurable
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (54 commits)
perf: arm_spe: Enable ACPI/Platform automatic module loading
arm_pmu: acpi: spe: Add initial MADT/SPE probing
ACPI/PPTT: Add function to return ACPI 6.3 Identical tokens
ACPI/PPTT: Modify node flag detection to find last IDENTICAL
x86/entry: Simplify _TIF_SYSCALL_EMU handling
arm64: rename dump_instr as dump_kernel_instr
arm64/mm: Drop [PTE|PMD]_TYPE_FAULT
arm64: Implement panic_smp_self_stop()
arm64: Improve parking of stopped CPUs
arm64: Expose FRINT capabilities to userspace
arm64: Expose ARMv8.5 CondM capability to userspace
arm64: defconfig: enable CONFIG_RANDOMIZE_BASE
arm64: ARM64_MODULES_PLTS must depend on MODULES
arm64: bpf: do not allocate executable memory
arm64/kprobes: set VM_FLUSH_RESET_PERMS on kprobe instruction pages
arm64/mm: wire up CONFIG_ARCH_HAS_SET_DIRECT_MAP
arm64: module: create module allocations without exec permissions
arm64: Allow user selection of ARM64_MODULE_PLTS
acpi/arm64: ignore 5.1 FADTs that are reported as 5.0
arm64: Allow selecting Pseudo-NMI again
...
swap_readpage() sets waiter = bio->bi_private even if synchronous = F,
this means that the caller can get the spurious wakeup after return.
This can be fatal if blk_wake_io_task() does
set_current_state(TASK_RUNNING) after the caller does
set_special_state(), in the worst case the kernel can crash in
do_task_dead().
Link: http://lkml.kernel.org/r/20190704160301.GA5956@redhat.com
Fixes: 0619317ff8 ("block: add polled wakeup task helper")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Qian Cai <cai@lca.pw>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In production we have noticed hard lockups on large machines running
large jobs due to kswaps hoarding lru lock within isolate_lru_pages when
sc->reclaim_idx is 0 which is a small zone. The lru was couple hundred
GiBs and the condition (page_zonenum(page) > sc->reclaim_idx) in
isolate_lru_pages() was basically skipping GiBs of pages while holding
the LRU spinlock with interrupt disabled.
On further inspection, it seems like there are two issues:
(1) If kswapd on the return from balance_pgdat() could not sleep (i.e.
node is still unbalanced), the classzone_idx is unintentionally set
to 0 and the whole reclaim cycle of kswapd will try to reclaim only
the lowest and smallest zone while traversing the whole memory.
(2) Fundamentally isolate_lru_pages() is really bad when the
allocation has woken kswapd for a smaller zone on a very large machine
running very large jobs. It can hoard the LRU spinlock while skipping
over 100s of GiBs of pages.
This patch only fixes (1). (2) needs a more fundamental solution. To
fix (1), in the kswapd context, if pgdat->kswapd_classzone_idx is
invalid use the classzone_idx of the previous kswapd loop otherwise use
the one the waker has requested.
Link: http://lkml.kernel.org/r/20190701201847.251028-1-shakeelb@google.com
Fixes: e716f2eb24 ("mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 0e56acae4b ("mm: initialize MAX_ORDER_NR_PAGES at a time
instead of doing larger sections") is causing a regression on some
systems when the kernel is booted as Xen dom0.
The system will just hang in early boot.
Reason is an endless loop in get_page_from_freelist() in case the first
zone looked at has no free memory. deferred_grow_zone() is always
returning true due to the following code snipplet:
/* If the zone is empty somebody else may have cleared out the zone */
if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
first_deferred_pfn)) {
pgdat->first_deferred_pfn = ULONG_MAX;
pgdat_resize_unlock(pgdat, &flags);
return true;
}
This in turn results in the loop as get_page_from_freelist() is assuming
forward progress can be made by doing some more struct page
initialization.
Link: http://lkml.kernel.org/r/20190620160821.4210-1-jgross@suse.com
Fixes: 0e56acae4b ("mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections")
Signed-off-by: Juergen Gross <jgross@suse.com>
Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No point having two call sites (earlier in init_rootfs() from
mnt_init() in case we are going to use shmem-style rootfs,
later from do_basic_setup() unconditionally), along with the
logics in shmem_init() itself to make the second call a no-op...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pcpu_setup_first_chunk() will panic or BUG_ON if the are some
error and doesn't return any error, hence it can be defined to
return void.
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
[Dennis: fixed kbuild warning for pcpu_page_first_chunk()]
Christoph Hellwig says:
====================
Below is a series that cleans up the dev_pagemap interface so that it is
more easily usable, which removes the need to wrap it in hmm and thus
allowing to kill a lot of code
Changes since v3:
- pull in "mm/swap: Fix release_pages() when releasing devmap pages" and
rebase the other patches on top of that
- fold the hmm_devmem_add_resource into the DEVICE_PUBLIC memory removal
patch
- remove _vm_normal_page as it isn't needed without DEVICE_PUBLIC memory
- pick up various ACKs
Changes since v2:
- fix nvdimm kunit build
- add a new memory type for device dax
- fix a few issues in intermediate patches that didn't show up in the end
result
- incorporate feedback from Michal Hocko, including killing of
the DEVICE_PUBLIC memory type entirely
Changes since v1:
- rebase
- also switch p2pdma to the internal refcount
- add type checking for pgmap->type
- rename the migrate method to migrate_to_ram
- cleanup the altmap_valid flag
- various tidbits from the reviews
====================
Conflicts resolved by:
- Keeping Ira's version of the code in swap.c
- Using the delete for the section in hmm.rst
- Using the delete for the devmap code in hmm.c and .h
* branch 'hmm-devmem-cleanup.4': (24 commits)
mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
mm: remove the HMM config option
mm: sort out the DEVICE_PRIVATE Kconfig mess
mm: simplify ZONE_DEVICE page private data
mm: remove hmm_devmem_add
mm: remove hmm_vma_alloc_locked_page
nouveau: use devm_memremap_pages directly
nouveau: use alloc_page_vma directly
PCI/P2PDMA: use the dev_pagemap internal refcount
device-dax: use the dev_pagemap internal refcount
memremap: provide an optional internal refcount in struct dev_pagemap
memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
memremap: remove the data field in struct dev_pagemap
memremap: add a migrate_to_ram method to struct dev_pagemap_ops
memremap: lift the devmap_enable manipulation into devm_memremap_pages
memremap: pass a struct dev_pagemap to ->kill and ->cleanup
memremap: move dev_pagemap callbacks into a separate structure
memremap: validate the pagemap type passed to devm_memremap_pages
mm: factor out a devm_request_free_mem_region helper
mm: export alloc_pages_vma
...
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0YK7ceHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGWfcH/36ep8GZHY9H1ARV
RJJoGoMnwENoq2o4eKhH3iZgUIGPq2uonazequhePwnIsrOdFGT7AeMHWSW7W0o4
wNlNFdUrTe0bvU00m+YtDwNIqgNCnFEoUbqn9H+VhAAWpSydKvhh2mlebTFO50KN
hb9+jh59Q8tbxrQdCuNF6yJATdf4hcj1V/ZZMGgF34kx+dFY4wOooSfu/eaIxXIl
fBDKN9K4Mmw8HWJvebV+ocOMZ7Zqknt1lbjx69OxpJmgxhb2Ks7heqSZanLTBPBB
oZxOlEdNPSyOjBQUlsDC2S8VJ7g5gINZk1JcFjByzE7cIPOQ2UXE72R++wwANngm
SR054NQ=
=WlA8
-----END PGP SIGNATURE-----
Merge tag 'v5.2-rc7' into rdma.git hmm
Required for dependencies in the next patches.
The migrate_vma helper is only used by noveau to migrate device private
pages around. Other HMM_MIRROR users like amdgpu or infiniband don't
need it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
All the mm/hmm.c code is better keyed off HMM_MIRROR. Also let nouveau
depend on it instead of the mix of a dummy dependency symbol plus the
actually selected one. Drop various odd dependencies, as the code is
pretty portable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The ZONE_DEVICE support doesn't depend on anything HMM related, just on
various bits of arch support as indicated by the architecture. Also
don't select the option from nouveau as it isn't present in many setups,
and depend on it instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Remove the clumsy hmm_devmem_page_{get,set}_drvdata helpers, and
instead just access the page directly. Also make the page data
a void pointer, and thus much easier to use.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There isn't really much value add in the hmm_devmem_add wrapper and
more, as using devm_memremap_pages directly now is just as simple.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The only user of it has just been removed, and there wasn't really any need
to wrap a basic memory allocator to start with.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add a flags field to struct dev_pagemap to replace the altmap_valid
boolean to be a little more extensible. Also add a pgmap_altmap() helper
to find the optional altmap and clean up the code using the altmap using
it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
struct dev_pagemap is always embedded into a containing structure, so
there is no need to an additional private data field.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This replaces the hacky ->fault callback, which is currently directly
called from common code through a hmm specific data structure as an
exercise in layering violations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Just check if there is a ->page_free operation set and take care of the
static key enable, as well as the put using device managed resources.
Also check that a ->page_free is provided for the pgmaps types that
require it, and check for a valid type as well while we are at it.
Note that this also fixes the fact that hmm never called
dev_pagemap_put_ops and thus would leave the slow path enabled forever,
even after a device driver unload or disable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Passing the actual typed structure leads to more understandable code
vs just passing the ref member.
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The dev_pagemap is a growing too many callbacks. Move them into a
separate ops structure so that they are not duplicated for multiple
instances, and an attacker can't easily overwrite them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Keep the physical address allocation that hmm_add_device does with the
rest of the resource code, and allow future reuse of it without the hmm
wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
nouveau is currently using this through an odd hmm wrapper, and I plan
to switch it to the real thing later in this series.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
->mapping isn't even used by HMM users, and the field at the same offset
in the zone_device part of the union is declared as pad. (Which btw is
rather confusing, as DAX uses ->pgmap and ->mapping from two different
sides of the union, but DAX doesn't use hmm_devmem_free).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The code hasn't been used since it was added to the tree, and doesn't
appear to actually be usable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This code is a trivial wrapper around device model helpers, which
should have been integrated into the driver device model usage from
the start. Assuming it actually had users, which it never had since
the code was added more than 1 1/2 years ago.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
release_pages() is an optimized version of a loop around put_page().
Unfortunately for devmap pages the logic is not entirely correct in
release_pages(). This is because device pages can be more than type
MEMORY_DEVICE_PUBLIC. There are in fact 4 types, private, public, FS DAX,
and PCI P2PDMA. Some of these have specific needs to "put" the page while
others do not.
This logic to handle any special needs is contained in
put_devmap_managed_page(). Therefore all devmap pages should be processed
by this function where we can contain the correct logic for a page put.
Handle all device type pages within release_pages() by calling
put_devmap_managed_page() on all devmap pages. If
put_devmap_managed_page() returns true the page has been put and we
continue with the next page. A false return of put_devmap_managed_page()
means the page did not require special processing and should fall to
"normal" processing.
This was found via code inspection while determining if release_pages()
and the new put_user_pages() could be interchangeable.[1]
[1] https://lkml.kernel.org/r/20190523172852.GA27175@iweiny-DESK2.sc.intel.com
Link: https://lkml.kernel.org/r/20190605214922.17684-1-ira.weiny@intel.com
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
0-Day test system reported some OOM regressions for several THP
(Transparent Huge Page) swap test cases. These regressions are bisected
to 6861428921 ("block: always define BIO_MAX_PAGES as 256"). In the
commit, BIO_MAX_PAGES is set to 256 even when THP swap is enabled. So the
bio_alloc(gfp_flags, 512) in get_swap_bio() may fail when swapping out
THP. That causes the OOM.
As in the patch description of 6861428921 ("block: always define
BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write THP
to swap space. So the issue is fixed via doing that in get_swap_bio().
BTW: I remember I have checked the THP swap code when 6861428921
("block: always define BIO_MAX_PAGES as 256") was merged, and thought the
THP swap code needn't to be changed. But apparently, I was wrong. I
should have done this at that time.
Link: http://lkml.kernel.org/r/20190624075515.31040-1-ying.huang@intel.com
Fixes: 6861428921 ("block: always define BIO_MAX_PAGES as 256")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gcc gets confused in pcpu_get_vm_areas() because there are too many
branches that affect whether 'lva' was initialized before it gets used:
mm/vmalloc.c: In function 'pcpu_get_vm_areas':
mm/vmalloc.c:991:4: error: 'lva' may be used uninitialized in this function [-Werror=maybe-uninitialized]
insert_vmap_area_augment(lva, &va->rb_node,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
&free_vmap_area_root, &free_vmap_area_list);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/vmalloc.c:916:20: note: 'lva' was declared here
struct vmap_area *lva;
^~~
Add an intialization to NULL, and check whether this has changed before
the first use.
[akpm@linux-foundation.org: tweak comments]
Link: http://lkml.kernel.org/r/20190618092650.2943749-1-arnd@arndb.de
Fixes: 68ad4a3304 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Joel Fernandes <joelaf@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In dump_oom_summary() oc->constraint is used to show oom_constraint_text,
but it hasn't been set before. So the value of it is always the default
value 0. We should inititialize it before.
Bellow is the output when memcg oom occurs,
before this patch:
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=7997,uid=0
after this patch:
oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=13681,uid=0
Link: http://lkml.kernel.org/r/1560522038-15879-1-git-send-email-laoar.shao@gmail.com
Fixes: ef8444ea01 ("mm, oom: reorganize the oom report in dump_header")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wind Yu <yuzhoujian@didichuxing.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
for hugepages with overcommitting enabled. That was caused by the
suboptimal code in current soft-offline code. See the following part:
ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC, MR_MEMORY_FAILURE);
if (ret) {
...
} else {
/*
* We set PG_hwpoison only when the migration source hugepage
* was successfully dissolved, because otherwise hwpoisoned
* hugepage remains on free hugepage list, then userspace will
* find it as SIGBUS by allocation failure. That's not expected
* in soft-offlining.
*/
ret = dissolve_free_huge_page(page);
if (!ret) {
if (set_hwpoison_free_buddy_page(page))
num_poisoned_pages_inc();
}
}
return ret;
Here dissolve_free_huge_page() returns -EBUSY if the migration source page
was freed into buddy in migrate_pages(), but even in that case we actually
has a chance that set_hwpoison_free_buddy_page() succeeds. So that means
current code gives up offlining too early now.
dissolve_free_huge_page() checks that a given hugepage is suitable for
dissolving, where we should return success for !PageHuge() case because
the given hugepage is considered as already dissolved.
This change also affects other callers of dissolve_free_huge_page(), which
are cleaned up together.
[n-horiguchi@ah.jp.nec.com: v3]
Link: http://lkml.kernel.org/r/1560761476-4651-3-git-send-email-n-horiguchi@ah.jp.nec.comLink: http://lkml.kernel.org/r/1560154686-18497-3-git-send-email-n-horiguchi@ah.jp.nec.com
Fixes: 6bc9b56433 ("mm: fix race on soft-offlining")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Chen, Jerry T <jerry.t.chen@intel.com>
Tested-by: Chen, Jerry T <jerry.t.chen@intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Cc: "Chen, Jerry T" <jerry.t.chen@intel.com>
Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com>
Cc: <stable@vger.kernel.org> [4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The pass/fail of soft offline should be judged by checking whether the
raw error page was finally contained or not (i.e. the result of
set_hwpoison_free_buddy_page()), but current code do not work like
that. It might lead us to misjudge the test result when
set_hwpoison_free_buddy_page() fails.
Without this fix, there are cases where madvise(MADV_SOFT_OFFLINE) may
not offline the original page and will not return an error.
Link: http://lkml.kernel.org/r/1560154686-18497-2-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Fixes: 6bc9b56433 ("mm: fix race on soft-offlining")
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Cc: "Chen, Jerry T" <jerry.t.chen@intel.com>
Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com>
Cc: <stable@vger.kernel.org> [4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mpol_rebind_nodemask() is called for MPOL_BIND and MPOL_INTERLEAVE
mempoclicies when the tasks's cpuset's mems_allowed changes. For
policies created without MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES,
it works by remapping the policy's allowed nodes (stored in v.nodes)
using the previous value of mems_allowed (stored in
w.cpuset_mems_allowed) as the domain of map and the new mems_allowed
(passed as nodes) as the range of the map (see the comment of
bitmap_remap() for details).
The result of remapping is stored back as policy's nodemask in v.nodes,
and the new value of mems_allowed should be stored in
w.cpuset_mems_allowed to facilitate the next rebind, if it happens.
However, 213980c0f2 ("mm, mempolicy: simplify rebinding mempolicies
when updating cpusets") introduced a bug where the result of remapping
is stored in w.cpuset_mems_allowed instead. Thus, a mempolicy's
allowed nodes can evolve in an unexpected way after a series of
rebinding due to cpuset mems_allowed changes, possibly binding to a
wrong node or a smaller number of nodes which may e.g. overload them.
This patch fixes the bug so rebinding again works as intended.
[vbabka@suse.cz: new changlog]
Link: http://lkml.kernel.org/r/ef6a69c6-c052-b067-8f2c-9d615c619bb9@suse.cz
Link: http://lkml.kernel.org/r/1558768043-23184-1-git-send-email-zhongjiang@huawei.com
Fixes: 213980c0f2 ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the trylock on the hmm->mirrors_sem fails the function will return
without decrementing the notifiers that were previously incremented. Since
the caller will not call invalidate_range_end() on EAGAIN this will result
in notifiers becoming permanently incremented and deadlock.
If the sync_cpu_device_pagetables() required blocking the function will
not return EAGAIN even though the device continues to touch the
pages. This is a violation of the mmu notifier contract.
Switch, and rename, the ranges_lock to a spin lock so we can reliably
obtain it without blocking during error unwind.
The error unwind is necessary since the notifiers count must be held
incremented across the call to sync_cpu_device_pagetables() as we cannot
allow the range to become marked valid by a parallel
invalidate_start/end() pair while doing sync_cpu_device_pagetables().
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Philip Yang <Philip.Yang@amd.com>
hmm_release() is called exactly once per hmm. ops->release() cannot
accidentally trigger any action that would recurse back onto
hmm->mirrors_sem.
This fixes a use after-free race of the form:
CPU0 CPU1
hmm_release()
up_write(&hmm->mirrors_sem);
hmm_mirror_unregister(mirror)
down_write(&hmm->mirrors_sem);
up_write(&hmm->mirrors_sem);
kfree(mirror)
mirror->ops->release(mirror)
The only user we have today for ops->release is an empty function, so this
is unambiguously safe.
As a consequence of plugging this race drivers are not allowed to
register/unregister mirrors from within a release op.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Philip Yang <Philip.Yang@amd.com>
Trying to misuse a range outside its lifetime is a kernel bug. Use poison
bytes to help detect this condition. Double unregister will reliably crash.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Philip Yang <Philip.Yang@amd.com>
No other register/unregister kernel API attempts to provide this kind of
protection as it is inherently racy, so just drop it.
Callers should provide their own protection, and it appears nouveau
already does.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Philip Yang <Philip.Yang@amd.com>
Wire up the special helper functions to manipulate aliases of vmalloc
regions in the linear map.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
In the spirit of filemap_fdatawait_range() and
filemap_fdatawait_keep_errors(), introduce
filemap_fdatawait_range_keep_errors() which both takes a range upon
which to wait and does not clear errors from the address space.
Signed-off-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Based on 2 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 4122 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this work is licensed under the terms of the gnu gpl version 2 see
the copying file in the top level directory
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 35 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this work is licensed under the terms of the gnu gpl version 2
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 48 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Enrico Weigelt <info@metux.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081204.624030236@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this file is released under the gpl v2
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 3 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602204655.103854853@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
So we can check locking at runtime.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Philip Yang <Philip.Yang@amd.com>
Range functions like hmm_range_snapshot() and hmm_range_fault() call
find_vma, which requires hodling the mmget() and the mmap_sem for the mm.
Make this simpler for the callers by holding the mmget() inside the range
for the lifetime of the range. Other functions that accept a range should
only be called if the range is registered.
This has the side effect of directly preventing hmm_release() from
happening while a range is registered. That means range->dead cannot be
false during the lifetime of the range, so remove dead and
hmm_mirror_mm_is_alive() entirely.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Philip Yang <Philip.Yang@amd.com>
This list is always read and written while holding hmm->lock so there is
no need for the confusing _rcu annotations.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <iweiny@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Philip Yang <Philip.Yang@amd.com>
As coded this function can false-fail in various racy situations. Make it
reliable and simpler by running under the write side of the mmap_sem and
avoiding the false-failing compare/exchange pattern. Due to the mmap_sem
this no longer has to avoid racing with a 2nd parallel
hmm_get_or_create().
Unfortunately this still has to use the page_table_lock as the
non-sleeping lock protecting mm->hmm, since the contexts where we free the
hmm are incompatible with mmap_sem.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Philip Yang <Philip.Yang@amd.com>
Pull x86 fixes from Thomas Gleixner:
"The accumulated fixes from this and last week:
- Fix vmalloc TLB flush and map range calculations which lead to
stale TLBs, spurious faults and other hard to diagnose issues.
- Use fault_in_pages_writable() for prefaulting the user stack in the
FPU code as it's less fragile than the current solution
- Use the PF_KTHREAD flag when checking for a kernel thread instead
of current->mm as the latter can give the wrong answer due to
use_mm()
- Compute the vmemmap size correctly for KASLR and 5-Level paging.
Otherwise this can end up with a way too small vmemmap area.
- Make KASAN and 5-level paging work again by making sure that all
invalid bits are masked out when computing the P4D offset. This
worked before but got broken recently when the LDT remap area was
moved.
- Prevent a NULL pointer dereference in the resource control code
which can be triggered with certain mount options when the
requested resource is not available.
- Enforce ordering of microcode loading vs. perf initialization on
secondary CPUs. Otherwise perf tries to access a non-existing MSR
as the boot CPU marked it as available.
- Don't stop the resource control group walk early otherwise the
control bitmaps are not updated correctly and become inconsistent.
- Unbreak kgdb by returning 0 on success from
kgdb_arch_set_breakpoint() instead of an error code.
- Add more Icelake CPU model defines so depending changes can be
queued in other trees"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback
x86/kasan: Fix boot with 5-level paging and KASAN
x86/fpu: Don't use current->mm to check for a kthread
x86/kgdb: Return 0 from kgdb_arch_set_breakpoint()
x86/resctrl: Prevent NULL pointer dereference when local MBM is disabled
x86/resctrl: Don't stop walking closids when a locksetup group is found
x86/fpu: Update kernel's FPU state before using for the fsave header
x86/mm/KASLR: Compute the size of the vmemmap section properly
x86/fpu: Use fault_in_pages_writeable() for pre-faulting
x86/CPU: Add more Icelake model numbers
mm/vmalloc: Avoid rare case of flushing TLB with weird arguments
mm/vmalloc: Fix calculation of direct map addr range