To activate a page, mark_page_accessed() always holds a reference on it.
It either gets a new reference when adding a page to
lru_pvecs.activate_page or reuses an existing one it previously got when
it added a page to lru_pvecs.lru_add. So it doesn't call SetPageActive()
on a page that doesn't have any reference left. Therefore, the race is
impossible these days (I didn't brother to dig into its history).
For other paths, namely reclaim and migration, a reference count is always
held while calling SetPageActive() on a page.
SetPageSlabPfmemalloc() also uses SetPageActive(), but it's irrelevant to
LRU pages.
Signed-off-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/20200818184704.3625199-2-yuzhao@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't initially add anon pages to active lruvec after commit
b518154e59 ("mm/vmscan: protect the workingset on anonymous LRU").
Remove activate_page() from unuse_pte(), which seems to be missed by the
commit. And make the function static while we are at it.
Before the commit, we called lru_cache_add_active_or_unevictable() to add
new ksm pages to active lruvec. Therefore, activate_page() wasn't
necessary for them in the first place.
Signed-off-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/20200818184704.3625199-1-yuzhao@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SWP_FS is used to make swap_{read,write}page() go through the filesystem,
and it's only used for swap files over NFS for now. Otherwise it will
directly submit IO to blockdev according to swapfile extents reported by
filesystems in advance.
As Matthew pointed out [1], SWP_FS naming is somewhat confusing, so let's
rename to SWP_FS_OPS.
[1] https://lore.kernel.org/r/20200820113448.GM17456@casper.infradead.org
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20200822113019.11319-1-hsiangkao@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As suggested by Dan Carpenter, fortify unpin_user_pages() just a bit,
against a typical caller mistake: check if the npages arg is really a
-ERRNO value, which would blow up the unpinning loop: WARN and return.
If this new WARN_ON() fires, then the system *might* be leaking pages (by
leaving them pinned), but probably not. More likely, gup/pup returned a
hard -ERRNO error to the caller, who erroneously passed it here.
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Link: https://lkml.kernel.org/r/20200917065706.409079-1-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gup prohibits users from calling get_user_pages() with FOLL_PIN. But it
allows users to call get_user_pages() with FOLL_LONGTERM only. It seems
insensible.
Since FOLL_LONGTERM is a stricter case of FOLL_PIN, we should prohibit
users from calling get_user_pages() with FOLL_LONGTERM while not with
FOLL_PIN.
mm/gup_benchmark.c used to be the only user who did this improperly.
But it has been fixed by moving to use pin_user_pages().
[akpm@linux-foundation.org: fix CONFIG_MMU=n build]
Link: https://lkml.kernel.org/r/CA+G9fYuNS3k0DVT62twfV746pfNhCSrk5sVMcOcQ1PGGnEseyw@mail.gmail.com
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
Link: http://lkml.kernel.org/r/20200819110100.23504-1-song.bao.hua@hisilicon.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
According to Documentation/core-api/pin_user_pages.rst, FOLL_PIN is a
prerequisite to FOLL_LONGTERM. Another way of saying that is,
FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
Almost all kernel modules are using pin_user_pages() with FOLL_LONGTERM,
mm/gup_benchmark.c seems to the only exception in which FOLL_PIN is not a
prerequisite to FOLL_LONGTERM.
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200815122056.29508-1-song.bao.hua@hisilicon.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the beginning, mm/gup_benchmark.c supported get_user_pages_fast() only,
but right now, it supports the benchmarking of a couple of
get_user_pages() related calls like:
* get_user_pages_fast()
* get_user_pages()
* pin_user_pages_fast()
* pin_user_pages()
The documentation is confusing and needs update.
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/20200821032546.19992-1-song.bao.hua@hisilicon.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Our users reported that there're some random latency spikes when their RT
process is running. Finally we found that latency spike is caused by
FADV_DONTNEED. Which may call lru_add_drain_all() to drain LRU cache on
remote CPUs, and then waits the per-cpu work to complete. The wait time
is uncertain, which may be tens millisecond.
That behavior is unreasonable, because this process is bound to a specific
CPU and the file is only accessed by itself, IOW, there should be no
pagecache pages on a per-cpu pagevec of a remote CPU. That unreasonable
behavior is partially caused by the wrong comparation of the number of
invalidated pages and the number of the target. For example,
if (count < (end_index - start_index + 1))
The count above is how many pages were invalidated in the local CPU, and
(end_index - start_index + 1) is how many pages should be invalidated.
The usage of (end_index - start_index + 1) is incorrect, because they are
virtual addresses, which may not mapped to pages. Besides that, there may
be holes between start and end. So we'd better check whether there are
still pages on per-cpu pagevec after drain the local cpu, and then decide
whether or not to call lru_add_drain_all().
After I applied it with a hotfix to our production environment, most of
the lru_add_drain_all() can be avoided.
Suggested-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20200923133318.14373-1-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We dereference page->mapping and page->index directly after calling
find_subpage() and these fields are not valid for tail pages. While
commit 4101196b19 ("mm: page cache: store only head pages in i_pages")
introduced the call to find_subpage(), the problem existed prior to this;
I'm going to suggest all the way back to when THPs first existed.
The user-visible effects of this are almost negligible. To hit it, you
have to mmap a tmpfs file at an unaligned address and then it's only a
disabled optimisation causing page faults to happen more frequently than
they otherwise would.
Fix this by keeping both head and page pointers and checking the
appropriate one. We could use page_mapping() and page_to_index(), but
that's higher overhead.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Link: https://lkml.kernel.org/r/20200911012532.24761-1-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new FGP_HEAD flag which avoids calling find_subpage() and add a
convenience wrapper for it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Link: https://lkml.kernel.org/r/20200910183318.20139-9-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert shmem_getpage_gfp() (the only remaining caller of
find_lock_entry()) to cope with a head page being returned instead of
the subpage for the index.
[willy@infradead.org: fix BUG()s]
Link https://lore.kernel.org/linux-mm/20200912032042.GA6583@casper.infradead.org/
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Link: https://lkml.kernel.org/r/20200910183318.20139-8-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are only four callers remaining of find_get_entry().
get_shadow_from_swap_cache() only wants to see shadow entries and doesn't
care about which page is returned. Push the find_subpage() call into
find_lock_entry(), find_get_incore_page() and pagecache_get_page().
[willy@infradead.org: fix oops]
Link: https://lkml.kernel.org/r/20200914112738.GM6583@casper.infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Link: https://lkml.kernel.org/r/20200910183318.20139-7-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
i915 does not want to see value entries. Switch it to use
find_lock_page() instead, and remove the export of find_lock_entry().
Move find_lock_entry() and find_get_entry() to mm/internal.h to discourage
any future use.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Link: https://lkml.kernel.org/r/20200910183318.20139-6-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of calling find_get_entry() for every page index, use an XArray
iterator to skip over NULL entries, and avoid calling get_page(),
because we only want the swap entries.
[willy@infradead.org: fix LTP soft lockups]
Link: https://lkml.kernel.org/r/20200914165032.GS6583@casper.infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Qian Cai <cai@redhat.com>
Link: https://lkml.kernel.org/r/20200910183318.20139-4-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current code does not protect against swapoff of the underlying
swap device, so this is a bug fix as well as a worthwhile reduction in
code complexity.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Link: https://lkml.kernel.org/r/20200910183318.20139-3-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Return head pages from find_*_entry", v2.
This patch series started out as part of the THP patch set, but it has
some nice effects along the way and it seems worth splitting it out and
submitting separately.
Currently find_get_entry() and find_lock_entry() return the page
corresponding to the requested index, but the first thing most callers do
is find the head page, which we just threw away. As part of auditing all
the callers, I found some misuses of the APIs and some plain
inefficiencies that I've fixed.
The diffstat is unflattering, but I added more kernel-doc and a new wrapper.
This patch (of 8);
Provide this functionality from the swap cache. It's useful for
more than just mincore().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Link: https://lkml.kernel.org/r/20200910183318.20139-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20200910183318.20139-2-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename head_pincount() --> head_compound_pincount(). These names are more
accurate (or less misleading) than the original ones.
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Link: https://lkml.kernel.org/r/20200807183358.105097-1-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__dump_page() checks i_dentry is fetchable and i_ino is earlier in the
struct than i_ino, so it ought to work fine, but it's possible that struct
randomisation has reordered i_ino after i_dentry and the pointer is just
wild enough that i_dentry is fetchable and i_ino isn't.
Also print the inode number if the dentry is invalid.
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lkml.kernel.org/r/20200819185710.28180-1-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In support of device-dax growing the ability to front physically
dis-contiguous ranges of memory, update devm_memremap_pages() to track
multiple ranges with a single reference counter and devm instance.
Convert all [devm_]memremap_pages() users to specify the number of ranges
they are mapping in their 'struct dev_pagemap' instance.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: "Jérôme Glisse" <jglisse@redhat.co
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/159643103789.4062302.18426128170217903785.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/160106116293.30709.13350662794915396198.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The 'struct resource' in 'struct dev_pagemap' is only used for holding
resource span information. The other fields, 'name', 'flags', 'desc',
'parent', 'sibling', and 'child' are all unused wasted space.
This is in preparation for introducing a multi-range extension of
devm_memremap_pages().
The bulk of this change is unwinding all the places internal to libnvdimm
that used 'struct resource' unnecessarily, and replacing instances of
'struct dev_pagemap'.res with 'struct dev_pagemap'.range.
P2PDMA had a minor usage of the resource flags field, but only to report
failures with "%pR". That is replaced with an open coded print of the
range.
[dan.carpenter@oracle.com: mm/hmm/test: use after free in dmirror_allocate_chunk()]
Link: https://lkml.kernel.org/r/20200926121402.GA7467@kadam
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> [xen]
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/159643103173.4062302.768998885691711532.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/160106115761.30709.13539840236873663620.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation to set a fallback value for dev_dax->target_node, introduce
generic fallback helpers for phys_to_target_node()
A generic implementation based on node-data or memblock was proposed, but
as noted by Mike:
"Here again, I would prefer to add a weak default for
phys_to_target_node() because the "generic" implementation is not really
generic.
The fallback to reserved ranges is x86 specfic because on x86 most of
the reserved areas is not in memblock.memory. AFAIK, no other
architecture does this."
The info message in the generic memory_add_physaddr_to_nid()
implementation is fixed up to properly reflect that
memory_add_physaddr_to_nid() communicates "online" node info and
phys_to_target_node() indicates "target / to-be-onlined" node info.
[akpm@linux-foundation.org: fix CONFIG_MEMORY_HOTPLUG=n build]
Link: https://lkml.kernel.org/r/202008252130.7YrHIyMI%25lkp@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jia He <justin.he@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Link: https://lkml.kernel.org/r/159643097768.4062302.3135192588966888630.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmemleak-test.c is just a kmemleak test module, which also can not be used
as a built-in kernel module. Thus, i think it may should not be in mm
dir, and move the kmemleak-test.c to samples/kmemleak/kmemleak-test.c.
Fix the spelling of built-in by the way.
Signed-off-by: Hui Su <sh_def@163.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Rob Herring <robh@kernel.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Divya Indi <divya.indi@oracle.com>
Cc: Tomas Winkler <tomas.winkler@intel.com>
Cc: David Howells <dhowells@redhat.com>
Link: https://lkml.kernel.org/r/20200925183729.GA172837@rlk
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmemleak_scan() currently relies on the big tasklist_lock hammer to
stabilize iterating through the tasklist. Instead, this patch proposes
simply using rcu along with the rcu-safe for_each_process_thread flavor
(without changing scan semantics), which doesn't make use of
next_thread/p->thread_group and thus cannot race with exit. Furthermore,
any races with fork() and not seeing the new child should be benign as
it's not running yet and can also be detected by the next scan.
Avoiding the tasklist_lock could prove beneficial for performance
considering the scan operation is done periodically. I have seen
improvements of 30%-ish when doing similar replacements on very
pathological microbenchmarks (ie stressing get/setpriority(2)).
However my main motivation is that it's one less user of the global
lock, something that Linus has long time wanted to see gone eventually
(if ever) even if the traditional fairness issues has been dealt with
now with qrwlocks. Of course this is a very long ways ahead. This
patch also kills another user of the deprecated tsk->thread_group.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Qian Cai <cai@lca.pw>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/20200820203902.11308-1-dave@stgolabs.net
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The commit below is incomplete, as it didn't handle the add_full() part.
commit a4d3f8916c ("slub: remove useless kmem_cache_debug() before
remove_full()")
This patch checks for SLAB_STORE_USER instead of kmem_cache_debug(), since
that should be the only context in which we need the list_lock for
add_full().
Signed-off-by: Abel Wu <wuyun.wu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Liu Xiang <liu.xiang6@zte.com.cn>
Link: https://lkml.kernel.org/r/20200811020240.1231-1-wuyun.wu@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ALLOC_SLOWPATH statistics is missing in bulk allocation now. Fix it
by doing statistics in alloc slow path.
Signed-off-by: Abel Wu <wuyun.wu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hewenliang <hewenliang4@huawei.com>
Cc: Hu Shiyuan <hushiyuan@huawei.com>
Link: http://lkml.kernel.org/r/20200811022427.1363-1-wuyun.wu@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The two conditions are mutually exclusive and gcc compiler will optimise
this into if-else-like pattern. Given that the majority of free_slowpath
is free_frozen, let's provide some hint to the compilers.
Tests (perf bench sched messaging -g 20 -l 400000, executed 10x
after reboot) are done and the summarized result:
un-patched patched
max. 192.316 189.851
min. 187.267 186.252
avg. 189.154 188.086
stdev. 1.37 0.99
Signed-off-by: Abel Wu <wuyun.wu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hewenliang <hewenliang4@huawei.com>
Cc: Hu Shiyuan <hushiyuan@huawei.com>
Link: http://lkml.kernel.org/r/20200813101812.1617-1-wuyun.wu@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The removed code was unnecessary and changed nothing in the flow, since in
case of returning NULL by 'kmem_cache_alloc_node' returning 'freelist'
from the function in question is the same as returning NULL.
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: https://lkml.kernel.org/r/20200915230329.13002-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EWUgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnoxEADCVSNBRkpV0OVkOEC3wf8EGhXhk01Jnjtl
u5Mg2V55hcgJ0thQxBV/V28XyqmsEBrmAVi0Yf8Vr9Qbq4Ze08Wae4ChS4rEOyh1
jTcGYWx5aJB3ChLvV/HI0nWQ3bkj03mMrL3SW8rhhf5DTyKHsVeTenpx42Qu/FKf
fRzi09FSr3Pjd0B+EX6gunwJnlyXQC5Fa4AA0GhnXJzAznANXxHkkcXu8a6Yw75x
e28CfhIBliORsK8sRHLoUnPpeTe1vtxCBhBMsE+gJAj9ZUOWMzvNFIPP4FvfawDy
6cCQo2m1azJ/IdZZCDjFUWyjh+wxdKMp+NNryEcoV+VlqIoc3n98rFwrSL+GIq5Z
WVwEwq+AcwoMCsD29Lu1ytL2PQ/RVqcJP5UheMrbL4vzefNfJFumQVZLIcX0k943
8dFL2QHL+H/hM9Dx5y5rjeiWkAlq75v4xPKVjh/DHb4nehddCqn/+DD5HDhNANHf
c1kmmEuYhvLpIaC4DHjE6DwLh8TPKahJjwsGuBOTr7D93NUQD+OOWsIhX6mNISIl
FFhP8cd0/ZZVV//9j+q+5B4BaJsT+ZtwmrelKFnPdwPSnh+3iu8zPRRWO+8P8fRC
YvddxuJAmE6BLmsAYrdz6Xb/wqfyV44cEiyivF0oBQfnhbtnXwDnkDWSfJD1bvCm
ZwfpDh2+Tg==
=LzyE
-----END PGP SIGNATURE-----
Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- Series of merge handling cleanups (Baolin, Christoph)
- Series of blk-throttle fixes and cleanups (Baolin)
- Series cleaning up BDI, seperating the block device from the
backing_dev_info (Christoph)
- Removal of bdget() as a generic API (Christoph)
- Removal of blkdev_get() as a generic API (Christoph)
- Cleanup of is-partition checks (Christoph)
- Series reworking disk revalidation (Christoph)
- Series cleaning up bio flags (Christoph)
- bio crypt fixes (Eric)
- IO stats inflight tweak (Gabriel)
- blk-mq tags fixes (Hannes)
- Buffer invalidation fixes (Jan)
- Allow soft limits for zone append (Johannes)
- Shared tag set improvements (John, Kashyap)
- Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)
- DM no-wait support (Mike, Konstantin)
- Request allocation improvements (Ming)
- Allow md/dm/bcache to use IO stat helpers (Song)
- Series improving blk-iocost (Tejun)
- Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
Xianting, Yang, Yufen, yangerkun)
* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
block: fix uapi blkzoned.h comments
blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
blk-mq: get rid of the dead flush handle code path
block: get rid of unnecessary local variable
block: fix comment and add lockdep assert
blk-mq: use helper function to test hw stopped
block: use helper function to test queue register
block: remove redundant mq check
block: invoke blk_mq_exit_sched no matter whether have .exit_sched
percpu_ref: don't refer to ref->data if it isn't allocated
block: ratelimit handle_bad_sector() message
blk-throttle: Re-use the throtl_set_slice_end()
blk-throttle: Open code __throtl_de/enqueue_tg()
blk-throttle: Move service tree validation out of the throtl_rb_first()
blk-throttle: Move the list operation after list validation
blk-throttle: Fix IO hang for a corner case
blk-throttle: Avoid tracking latency if low limit is invalid
blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
block: Remove redundant 'return' statement
...
Pull compat iovec cleanups from Al Viro:
"Christoph's series around import_iovec() and compat variant thereof"
* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
security/keys: remove compat_keyctl_instantiate_key_iov
mm: remove compat_process_vm_{readv,writev}
fs: remove compat_sys_vmsplice
fs: remove the compat readv/writev syscalls
fs: remove various compat readv/writev helpers
iov_iter: transparently handle compat iovecs in import_iovec
iov_iter: refactor rw_copy_check_uvector and import_iovec
iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c
compat.h: fix a spelling error in <linux/compat.h>
no conflicts at all. This pull includes:
- A reworked and expanded user-mode Linux document
- Some simplifications and improvements for submitting-patches.rst
- An emergency fix for (some) problems with Sphinx 3.x
- Some welcome automarkup improvements to automatically generate
cross-references to struct definitions and other documents
- The usual collection of translation updates, typo fixes, etc.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl+ErNYPHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5Y284H/3bv9fahbg16AJcKYqJXFHGpDs3CsASPnJqQ
9HQoV5tg6Qd4kI3oFb+30l8SK73Wr2t685/DhOPDRR/vN3B5M1vOQvPRL/dEqiwi
aUEhtMbnC/trSbteXsjGDWT+1EnI/+R3NFV++WiRp1XxE4DRXL3xySTeviR0IW+V
rQxU7VCcVp0bklVH+gqjrsvqU5iZeckyZB6evc8X92ThhzjNprR5KVxxgl1wxcu/
dPYizHoKYVoLVNw50rwPGt2hmq9RpyDM6Xh9UhLHcA57ENyzr8NNTJBOT0tVMTWV
smU01X/ECoy54kj1w8AKP+f7F0G7DUU+Jz68X0X/kYPq520dUs4=
=Ovox
-----END PGP SIGNATURE-----
Merge tag 'docs-5.10' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"As hoped, things calmed down for docs this cycle; fewer changes and
almost no conflicts at all. This includes:
- A reworked and expanded user-mode Linux document
- Some simplifications and improvements for submitting-patches.rst
- An emergency fix for (some) problems with Sphinx 3.x
- Some welcome automarkup improvements to automatically generate
cross-references to struct definitions and other documents
- The usual collection of translation updates, typo fixes, etc"
* tag 'docs-5.10' of git://git.lwn.net/linux: (81 commits)
gpiolib: Update indentation in driver.rst for code excerpts
Documentation/admin-guide: tainted-kernels: Fix typo occured
Documentation: better locations for sysfs-pci, sysfs-tagging
docs: programming-languages: refresh blurb on clang support
Documentation: kvm: fix a typo
Documentation: Chinese translation of Documentation/arm64/amu.rst
doc: zh_CN: index files in arm64 subdirectory
mailmap: add entry for <mstarovoitov@marvell.com>
doc: seq_file: clarify role of *pos in ->next()
docs: trace: ring-buffer-design.rst: use the new SPDX tag
Documentation: kernel-parameters: clarify "module." parameters
Fix references to nommu-mmap.rst
docs: rewrite admin-guide/sysctl/abi.rst
docs: fb: Remove vesafb scrollback boot option
docs: fb: Remove sstfb scrollback boot option
docs: fb: Remove matroxfb scrollback boot option
docs: fb: Remove framebuffer scrollback boot option
docs: replace the old User Mode Linux HowTo with a new one
Documentation/admin-guide: blockdev/ramdisk: remove use of "rdev"
Documentation/admin-guide: README & svga: remove use of "rdev"
...
- Add deadlock detection for recursive read-locks. The rationale is outlined
in:
224ec489d3: ("lockdep/Documention: Recursive read lock detection reasoning")
The main deadlock pattern we want to detect is:
TASK A: TASK B:
read_lock(X);
write_lock(X);
read_lock_2(X);
- Add "latch sequence counters" (seqcount_latch_t):
A sequence counter variant where the counter even/odd value is used to
switch between two copies of protected data. This allows the read path,
typically NMIs, to safely interrupt the write side critical section.
We utilize this new variant for sched-clock, and to make x86 TSC handling safer.
- Other seqlock cleanups, fixes and enhancements
- KCSAN updates
- LKMM updates
- Misc updates, cleanups and fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+EX6QRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1g3gxAAkg+Jy/tcdRxlxlEDOQPFy1mBqvFmulNA
pGFPkB6dzqmAWF/NfOZSl4g/h/mqGYsq2V+PfK5E8Sq8DQ/yCmnLhjgVOHNUUliv
x0WWfOysNgJdtdf69NLYJufIQhxhyI0dwFHHoHIsCdGdGqjh2DVevQFPFTBjdpOc
BUZYo+u3gCaCdB6A2nmlcWYbEw8eVEHgv3qLG6dq46J0KJOV0HfliqJoU3EZqH+s
977LvEIo+THfuYWMo/Jepwngbi0y36KeeukOAdwm9fK196htBHIUR+YPPrAe+FWD
z+UXP5IS5XIw9V1sGLmUaC2m+6gpdW19jKBtlzPkxHXmJmsgiZdLLeytEh3WYey7
nzfH+9Jd4NyyZKucLssYkOjf6P5BxGKCyJ9LXb7vlSthIhiDdFNx47oKtW4hxjOY
jubsI3BP5c3G1sIBIjTS53XmOhJg+Z52FxTpQ33JswXn1wGidcHZiuNHZuU5q28p
+tn8rGb2NGJFb4Sw/Vp0yTcqIpEXf+vweiQoaxm6tc9BWzcVzZntGnh0i3gFotx/
VgKafN4+pgXgo6bwHbN2WBK2FGyvcXFaptfaOMZL48En82hJ1DI6EnBEYN+vuERQ
JcCXg+iHeeVbxoou7q8NJxITkBmEL5xNBIugXRRqNSP3fXLxKjFuPYqT84/e7yZi
elGTReYcq6g=
=Iq51
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"These are the locking updates for v5.10:
- Add deadlock detection for recursive read-locks.
The rationale is outlined in commit 224ec489d3 ("lockdep/
Documention: Recursive read lock detection reasoning")
The main deadlock pattern we want to detect is:
TASK A: TASK B:
read_lock(X);
write_lock(X);
read_lock_2(X);
- Add "latch sequence counters" (seqcount_latch_t):
A sequence counter variant where the counter even/odd value is used
to switch between two copies of protected data. This allows the
read path, typically NMIs, to safely interrupt the write side
critical section.
We utilize this new variant for sched-clock, and to make x86 TSC
handling safer.
- Other seqlock cleanups, fixes and enhancements
- KCSAN updates
- LKMM updates
- Misc updates, cleanups and fixes"
* tag 'locking-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits)
lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"
lockdep: Fix lockdep recursion
lockdep: Fix usage_traceoverflow
locking/atomics: Check atomic-arch-fallback.h too
locking/seqlock: Tweak DEFINE_SEQLOCK() kernel doc
lockdep: Optimize the memory usage of circular queue
seqlock: Unbreak lockdep
seqlock: PREEMPT_RT: Do not starve seqlock_t writers
seqlock: seqcount_LOCKNAME_t: Introduce PREEMPT_RT support
seqlock: seqcount_t: Implement all read APIs as statement expressions
seqlock: Use unique prefix for seqcount_t property accessors
seqlock: seqcount_LOCKNAME_t: Standardize naming convention
seqlock: seqcount latch APIs: Only allow seqcount_latch_t
rbtree_latch: Use seqcount_latch_t
x86/tsc: Use seqcount_latch_t
timekeeping: Use seqcount_latch_t
time/sched_clock: Use seqcount_latch_t
seqlock: Introduce seqcount_latch_t
mm/swap: Do not abuse the seqcount_t latching API
time/sched_clock: Use raw_read_seqcount_latch() during suspend
...
- Userspace support for the Memory Tagging Extension introduced by Armv8.5.
Kernel support (via KASAN) is likely to follow in 5.11.
- Selftests for MTE, Pointer Authentication and FPSIMD/SVE context
switching.
- Fix and subsequent rewrite of our Spectre mitigations, including the
addition of support for PR_SPEC_DISABLE_NOEXEC.
- Support for the Armv8.3 Pointer Authentication enhancements.
- Support for ASID pinning, which is required when sharing page-tables with
the SMMU.
- MM updates, including treating flush_tlb_fix_spurious_fault() as a no-op.
- Perf/PMU driver updates, including addition of the ARM CMN PMU driver and
also support to handle CPU PMU IRQs as NMIs.
- Allow prefetchable PCI BARs to be exposed to userspace using normal
non-cacheable mappings.
- Implementation of ARCH_STACKWALK for unwinding.
- Improve reporting of unexpected kernel traps due to BPF JIT failure.
- Improve robustness of user-visible HWCAP strings and their corresponding
numerical constants.
- Removal of TEXT_OFFSET.
- Removal of some unused functions, parameters and prototypes.
- Removal of MPIDR-based topology detection in favour of firmware
description.
- Cleanups to handling of SVE and FPSIMD register state in preparation
for potential future optimisation of handling across syscalls.
- Cleanups to the SDEI driver in preparation for support in KVM.
- Miscellaneous cleanups and refactoring work.
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl+AUXMQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNFc1B/4q2Kabe+pPu7s1f58Q+OTaEfqcr3F1qh27
F1YpFZUYxg0GPfPsFrnbJpo5WKo7wdR9ceI9yF/GHjs7A/MSoQJis3pG6SlAd9c0
nMU5tCwhg9wfq6asJtl0/IPWem6cqqhdzC6m808DjeHuyi2CCJTt0vFWH3OeHEhG
cfmLfaSNXOXa/MjEkT8y1AXJ/8IpIpzkJeCRA1G5s18PXV9Kl5bafIo9iqyfKPLP
0rJljBmoWbzuCSMc81HmGUQI4+8KRp6HHhyZC/k0WEVgj3LiumT7am02bdjZlTnK
BeNDKQsv2Jk8pXP2SlrI3hIUTz0bM6I567FzJEokepvTUzZ+CVBi
=9J8H
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"There's quite a lot of code here, but much of it is due to the
addition of a new PMU driver as well as some arm64-specific selftests
which is an area where we've traditionally been lagging a bit.
In terms of exciting features, this includes support for the Memory
Tagging Extension which narrowly missed 5.9, hopefully allowing
userspace to run with use-after-free detection in production on CPUs
that support it. Work is ongoing to integrate the feature with KASAN
for 5.11.
Another change that I'm excited about (assuming they get the hardware
right) is preparing the ASID allocator for sharing the CPU page-table
with the SMMU. Those changes will also come in via Joerg with the
IOMMU pull.
We do stray outside of our usual directories in a few places, mostly
due to core changes required by MTE. Although much of this has been
Acked, there were a couple of places where we unfortunately didn't get
any review feedback.
Other than that, we ran into a handful of minor conflicts in -next,
but nothing that should post any issues.
Summary:
- Userspace support for the Memory Tagging Extension introduced by
Armv8.5. Kernel support (via KASAN) is likely to follow in 5.11.
- Selftests for MTE, Pointer Authentication and FPSIMD/SVE context
switching.
- Fix and subsequent rewrite of our Spectre mitigations, including
the addition of support for PR_SPEC_DISABLE_NOEXEC.
- Support for the Armv8.3 Pointer Authentication enhancements.
- Support for ASID pinning, which is required when sharing
page-tables with the SMMU.
- MM updates, including treating flush_tlb_fix_spurious_fault() as a
no-op.
- Perf/PMU driver updates, including addition of the ARM CMN PMU
driver and also support to handle CPU PMU IRQs as NMIs.
- Allow prefetchable PCI BARs to be exposed to userspace using normal
non-cacheable mappings.
- Implementation of ARCH_STACKWALK for unwinding.
- Improve reporting of unexpected kernel traps due to BPF JIT
failure.
- Improve robustness of user-visible HWCAP strings and their
corresponding numerical constants.
- Removal of TEXT_OFFSET.
- Removal of some unused functions, parameters and prototypes.
- Removal of MPIDR-based topology detection in favour of firmware
description.
- Cleanups to handling of SVE and FPSIMD register state in
preparation for potential future optimisation of handling across
syscalls.
- Cleanups to the SDEI driver in preparation for support in KVM.
- Miscellaneous cleanups and refactoring work"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
Revert "arm64: initialize per-cpu offsets earlier"
arm64: random: Remove no longer needed prototypes
arm64: initialize per-cpu offsets earlier
kselftest/arm64: Check mte tagged user address in kernel
kselftest/arm64: Verify KSM page merge for MTE pages
kselftest/arm64: Verify all different mmap MTE options
kselftest/arm64: Check forked child mte memory accessibility
kselftest/arm64: Verify mte tag inclusion via prctl
kselftest/arm64: Add utilities and a test to validate mte memory
perf: arm-cmn: Fix conversion specifiers for node type
perf: arm-cmn: Fix unsigned comparison to less than zero
arm64: dbm: Invalidate local TLB when setting TCR_EL1.HD
arm64: mm: Make flush_tlb_fix_spurious_fault() a no-op
arm64: Add support for PR_SPEC_DISABLE_NOEXEC prctl() option
arm64: Pull in task_stack_page() to Spectre-v4 mitigation code
KVM: arm64: Allow patching EL2 vectors even with KASLR is not enabled
arm64: Get rid of arm64_ssbd_state
KVM: arm64: Convert ARCH_WORKAROUND_2 to arm64_get_spectre_v4_state()
KVM: arm64: Get rid of kvm_arm_have_ssbd()
KVM: arm64: Simplify handling of ARCH_WORKAROUND_2
...
When memory is hotplug added or removed the min_free_kbytes should be
recalculated based on what is expected by khugepaged. Currently after
hotplug, min_free_kbytes will be set to a lower default and higher
default set when THP enabled is lost.
This change restores min_free_kbytes as expected for THP consumers.
[vijayb@linux.microsoft.com: v5]
Link: https://lkml.kernel.org/r/1601398153-5517-1-git-send-email-vijayb@linux.microsoft.com
Fixes: f000565adb ("thp: set recommended min free kbytes")
Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Allen Pais <apais@microsoft.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/1600305709-2319-2-git-send-email-vijayb@linux.microsoft.com
Link: https://lkml.kernel.org/r/1600204258-13683-1-git-send-email-vijayb@linux.microsoft.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The syzbot reported the below general protection fault:
general protection fault, probably for non-canonical address
0xe00eeaee0000003b: 0000 [#1] PREEMPT SMP KASAN
KASAN: maybe wild-memory-access in range [0x00777770000001d8-0x00777770000001df]
CPU: 1 PID: 10488 Comm: syz-executor721 Not tainted 5.9.0-rc3-syzkaller #0
RIP: 0010:unlink_file_vma+0x57/0xb0 mm/mmap.c:164
Call Trace:
free_pgtables+0x1b3/0x2f0 mm/memory.c:415
exit_mmap+0x2c0/0x530 mm/mmap.c:3184
__mmput+0x122/0x470 kernel/fork.c:1076
mmput+0x53/0x60 kernel/fork.c:1097
exit_mm kernel/exit.c:483 [inline]
do_exit+0xa8b/0x29f0 kernel/exit.c:793
do_group_exit+0x125/0x310 kernel/exit.c:903
get_signal+0x428/0x1f00 kernel/signal.c:2757
arch_do_signal+0x82/0x2520 arch/x86/kernel/signal.c:811
exit_to_user_mode_loop kernel/entry/common.c:136 [inline]
exit_to_user_mode_prepare+0x1ae/0x200 kernel/entry/common.c:167
syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:242
entry_SYSCALL_64_after_hwframe+0x44/0xa9
It's because the ->mmap() callback can change vma->vm_file and fput the
original file. But the commit d70cec8983 ("mm: mmap: merge vma after
call_mmap() if possible") failed to catch this case and always fput()
the original file, hence add an extra fput().
[ Thanks Hillf for pointing this extra fput() out. ]
Fixes: d70cec8983 ("mm: mmap: merge vma after call_mmap() if possible")
Reported-by: syzbot+c5d5a51dcbb558ca0cb5@syzkaller.appspotmail.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christian König <ckoenig.leichtzumerken@gmail.com>
Cc: Hongxiang Lou <louhongxiang@huawei.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Link: https://lkml.kernel.org/r/20200916090733.31427-1-linmiaohe@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There have been elusive reports of filemap_fault() hitting its
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built
with CONFIG_READ_ONLY_THP_FOR_FS=y.
Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and
CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged
without NUMA reuses the same huge page after collapse_file() failed
(whereas NUMA targets its allocation to the respective node each time).
And most of us were usually testing with CONFIG_NUMA=y kernels.
collapse_file(old start)
new_page = khugepaged_alloc_page(hpage)
__SetPageLocked(new_page)
new_page->index = start // hpage->index=old offset
new_page->mapping = mapping
xas_store(&xas, new_page)
filemap_fault
page = find_get_page(mapping, offset)
// if offset falls inside hpage then
// compound_head(page) == hpage
lock_page_maybe_drop_mmap()
__lock_page(page)
// collapse fails
xas_store(&xas, old page)
new_page->mapping = NULL
unlock_page(new_page)
collapse_file(new start)
new_page = khugepaged_alloc_page(hpage)
__SetPageLocked(new_page)
new_page->index = start // hpage->index=new offset
new_page->mapping = mapping // mapping becomes valid again
// since compound_head(page) == hpage
// page_to_pgoff(page) got changed
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset)
An initial patch replaced __SetPageLocked() by lock_page(), which did
fix the race which Suren illustrates above. But testing showed that it's
not good enough: if the racing task's __lock_page() gets delayed long
after its find_get_page(), then it may follow collapse_file(new start)'s
successful final unlock_page(), and crash on the same VM_BUG_ON_PAGE.
It could be fixed by relaxing filemap_fault()'s VM_BUG_ON_PAGE to a
check and retry (as is done for mapping), with similar relaxations in
find_lock_entry() and pagecache_get_page(): but it's not obvious what
else might get caught out; and khugepaged non-NUMA appears to be unique
in exposing a page to page cache, then revoking, without going through
a full cycle of freeing before reuse.
Instead, non-NUMA khugepaged_prealloc_page() release the old page
if anyone else has a reference to it (1% of cases when I tested).
Although never reported on huge tmpfs, I believe its find_lock_entry()
has been at similar risk; but huge tmpfs does not rely on khugepaged
for its normal working nearly so much as READ_ONLY_THP_FOR_FS does.
Reported-by: Denis Lisov <dennis.lissov@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206569
Link: https://lore.kernel.org/linux-mm/?q=20200219144635.3b7417145de19b65f258c943%40linux-foundation.org
Reported-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/linux-xfs/?q=20200616013309.GB815%40lca.pw
Reported-and-analyzed-by: Suren Baghdasaryan <surenb@google.com>
Fixes: 87c460a0bd ("mm/khugepaged: collapse_shmem() without freezing new_page")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v4.9+
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 70e806e4e6 ("mm: Do early cow for pinned pages during fork()
for ptes") we write-protected the PTE before doing the page pinning
check, in order to avoid a race with concurrent fast-GUP pinning (which
doesn't take the mm semaphore or the page table lock).
That trick doesn't actually work - it doesn't handle memory ordering
properly, and doing so would be prohibitively expensive.
It also isn't really needed. While we're moving in the direction of
allowing and supporting page pinning without marking the pinned area
with MADV_DONTFORK, the fact is that we've never really supported this
kind of odd "concurrent fork() and page pinning", and doing the
serialization on a pte level is just wrong.
We can add serialization with a per-mm sequence counter, so we know how
to solve that race properly, but we'll do that at a more appropriate
time. Right now this just removes the write protect games.
It also turns out that the write protect games actually break on Power,
as reported by Aneesh Kumar:
"Architecture like ppc64 expects set_pte_at to be not used for updating
a valid pte. This is further explained in commit 56eecdb912 ("mm:
Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit")"
and the code triggered a warning there:
WARNING: CPU: 0 PID: 30613 at arch/powerpc/mm/pgtable.c:185 set_pte_at+0x2a8/0x3a0 arch/powerpc/mm/pgtable.c:185
Call Trace:
copy_present_page mm/memory.c:857 [inline]
copy_present_pte mm/memory.c:899 [inline]
copy_pte_range mm/memory.c:1014 [inline]
copy_pmd_range mm/memory.c:1092 [inline]
copy_pud_range mm/memory.c:1127 [inline]
copy_p4d_range mm/memory.c:1150 [inline]
copy_page_range+0x1f6c/0x2cc0 mm/memory.c:1212
dup_mmap kernel/fork.c:592 [inline]
dup_mm+0x77c/0xab0 kernel/fork.c:1355
copy_mm kernel/fork.c:1411 [inline]
copy_process+0x1f00/0x2740 kernel/fork.c:2070
_do_fork+0xc4/0x10b0 kernel/fork.c:2429
Link: https://lore.kernel.org/lkml/CAHk-=wiWr+gO0Ro4LvnJBMs90OiePNyrE3E+pJvc9PzdBShdmw@mail.gmail.com/
Link: https://lore.kernel.org/linuxppc-dev/20201008092541.398079-1-aneesh.kumar@linux.ibm.com/
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memalloc_nocma_{save/restore} APIs can be used to skip page allocation
on CMA area, but, there is a missing case and the page on CMA area could
be allocated even if APIs are used. This patch handles this case to fix
the potential issue.
For now, these APIs are used to prevent long-term pinning on the CMA
page. When the long-term pinning is requested on the CMA page, it is
migrated to the non-CMA page before pinning. This non-CMA page is
allocated by using memalloc_nocma_{save/restore} APIs. If APIs doesn't
work as intended, the CMA page is allocated and it is pinned for a long
time. This long-term pin for the CMA page causes cma_alloc() failure
and it could result in wrong behaviour on the device driver who uses the
cma_alloc().
Missing case is an allocation from the pcplist. MIGRATE_MOVABLE pcplist
could have the pages on CMA area so we need to skip it if ALLOC_CMA
isn't specified.
Fixes: 8510e69c8e (mm/page_alloc: fix memalloc_nocma_{save/restore} APIs)
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Link: https://lkml.kernel.org/r/1601429472-12599-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The routine that applies debug flags to the kmem_cache slabs
inadvertantly prevents non-debug flags from being applied to those
same objects. That is, if slub_debug=<flag>,<slab> is specified,
non-debugged slabs will end up having flags of zero, and the slabs
may be unusable.
Fix this by including the input flags for non-matching slabs with the
contents of slub_debug, so that the caches are created as expected
alongside any debugging options that may be requested. With this, we
can remove the check for a NULL slub_debug_string, since it's covered
by the loop itself.
Fixes: e17f1dfba3 ("mm, slub: extend slub_debug syntax for multiple blocks")
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: https://lkml.kernel.org/r/20200930161931.28575-1-farman@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that import_iovec handles compat iovecs, the native syscalls
can be used for the compat case as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use in compat_syscall to import either native or the compat iovecs, and
remove the now superflous compat_import_iovec.
This removes the need for special compat logic in most callers, and
the remaining ones can still be simplified by using __import_iovec
with a bool compat parameter.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Split rw_copy_check_uvector into two new helpers with more sensible
calling conventions:
- iovec_from_user copies a iovec from userspace either into the provided
stack buffer if it fits, or allocates a new buffer for it. Returns
the actually used iovec. It also verifies that iov_len does fit a
signed type, and handles compat iovecs if the compat flag is set.
- __import_iovec consolidates the native and compat versions of
import_iovec. It calls iovec_from_user, then validates each iovec
actually points to user addresses, and ensures the total length
doesn't overflow.
This has two major implications:
- the access_process_vm case loses the total lenght checking, which
wasn't required anyway, given that each call receives two iovecs
for the local and remote side of the operation, and it verifies
the total length on the local side already.
- instead of a single loop there now are two loops over the iovecs.
Given that the iovecs are cache hot this doesn't make a major
difference
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The async buffered reads feature is not working when readahead is
turned off. There are two things to concern:
- when doing retry in io_read, not only the IOCB_WAITQ flag but also
the IOCB_NOWAIT flag is still set, which makes it goes to would_block
phase in generic_file_buffered_read() and then return -EAGAIN. After
that, the io-wq thread work is queued, and later doing the async
reads in the old way.
- even if we remove IOCB_NOWAIT when doing retry, the feature is still
not running properly, since in generic_file_buffered_read() it goes to
lock_page_killable() after calling mapping->a_ops->readpage() to do
IO, and thus causing process to sleep.
Fixes: 1a0a7853b9 ("mm: support async buffered reads in generic_file_buffered_read()")
Fixes: 3b2a4439e0 ("io_uring: get rid of kiocb_wait_page_queue_init()")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It seems likely this block was pasted from internal_get_user_pages_fast,
which is not passed an mm struct and therefore uses current's. But
__get_user_pages_locked is passed an explicit mm, and current->mm is not
always valid. This was hit when being called from i915, which uses:
pin_user_pages_remote->
__get_user_pages_remote->
__gup_longterm_locked->
__get_user_pages_locked
Before, this would lead to an OOPS:
BUG: kernel NULL pointer dereference, address: 0000000000000064
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
CPU: 10 PID: 1431 Comm: kworker/u33:1 Tainted: P S U O 5.9.0-rc7+ #140
Hardware name: LENOVO 20QTCTO1WW/20QTCTO1WW, BIOS N2OET47W (1.34 ) 08/06/2020
Workqueue: i915-userptr-acquire __i915_gem_userptr_get_pages_worker [i915]
RIP: 0010:__get_user_pages_remote+0xd7/0x310
Call Trace:
__i915_gem_userptr_get_pages_worker+0xc8/0x260 [i915]
process_one_work+0x1ca/0x390
worker_thread+0x48/0x3c0
kthread+0x114/0x130
ret_from_fork+0x1f/0x30
CR2: 0000000000000064
This commit fixes the problem by using the mm pointer passed to the
function rather than the bogus one in current.
Fixes: 008cfe4418 ("mm: Introduce mm_struct.has_pinned")
Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
Reported-by: Harald Arnesen <harald@skogtun.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pinned pages shouldn't be write-protected when fork() happens, because
follow up copy-on-write on these pages could cause the pinned pages to
be replaced by random newly allocated pages.
For huge PMDs, we split the huge pmd if pinning is detected. So that
future handling will be done by the PTE level (with our latest changes,
each of the small pages will be copied). We can achieve this by let
copy_huge_pmd() return -EAGAIN for pinned pages, so that we'll
fallthrough in copy_pmd_range() and finally land the next
copy_pte_range() call.
Huge PUDs will be even more special - so far it does not support
anonymous pages. But it can actually be done the same as the huge PMDs
even if the split huge PUDs means to erase the PUD entries. It'll
guarantee the follow up fault ins will remap the same pages in either
parent/child later.
This might not be the most efficient way, but it should be easy and
clean enough. It should be fine, since we're tackling with a very rare
case just to make sure userspaces that pinned some thps will still work
even without MADV_DONTFORK and after they fork()ed.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows copy_pte_range() to do early cow if the pages were pinned on
the source mm.
Currently we don't have an accurate way to know whether a page is pinned
or not. The only thing we have is page_maybe_dma_pinned(). However
that's good enough for now. Especially, with the newly added
mm->has_pinned flag to make sure we won't affect processes that never
pinned any pages.
It would be easier if we can do GFP_KERNEL allocation within
copy_one_pte(). Unluckily, we can't because we're with the page table
locks held for both the parent and child processes. So the page
allocation needs to be done outside copy_one_pte().
Some trick is there in copy_present_pte(), majorly the wrprotect trick
to block concurrent fast-gup. Comments in the function should explain
better in place.
Oleg Nesterov reported a (probably harmless) bug during review that we
didn't reset entry.val properly in copy_pte_range() so that potentially
there's chance to call add_swap_count_continuation() multiple times on
the same swp entry. However that should be harmless since even if it
happens, the same function (add_swap_count_continuation()) will return
directly noticing that there're enough space for the swp counter. So
instead of a standalone stable patch, it is touched up in this patch
directly.
Link: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This prepares for the future work to trigger early cow on pinned pages
during fork().
No functional change intended.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(Commit message majorly collected from Jason Gunthorpe)
Reduce the chance of false positive from page_maybe_dma_pinned() by
keeping track if the mm_struct has ever been used with pin_user_pages().
This allows cases that might drive up the page ref_count to avoid any
penalty from handling dma_pinned pages.
Future work is planned, to provide a more sophisticated solution, likely
to turn it into a real counter. For now, make it atomic_t but use it as
a boolean for simplicity.
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc fixes from Andrew Morton:
"9 patches.
Subsystems affected by this patch series: mm (thp, memcg, gup,
migration, memory-hotplug), lib, and x86"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: don't rely on system state to detect hot-plug operations
mm: replace memmap_context by meminit_context
arch/x86/lib/usercopy_64.c: fix __copy_user_flushcache() cache writeback
lib/memregion.c: include memregion.h
lib/string.c: implement stpcpy
mm/migrate: correct thp migration stats
mm/gup: fix gup_fast with dynamic page table folding
mm: memcontrol: fix missing suffix of workingset_restore
mm, THP, swap: fix allocating cluster for swapfile by mistake
In register_mem_sect_under_node() the system_state's value is checked to
detect whether the call is made during boot time or during an hot-plug
operation. Unfortunately, that check against SYSTEM_BOOTING is wrong
because regular memory is registered at SYSTEM_SCHEDULING state. In
addition, memory hot-plug operation can be triggered at this system
state by the ACPI [1]. So checking against the system state is not
enough.
The consequence is that on system with interleaved node's ranges like this:
Early memory node ranges
node 1: [mem 0x0000000000000000-0x000000011fffffff]
node 2: [mem 0x0000000120000000-0x000000014fffffff]
node 1: [mem 0x0000000150000000-0x00000001ffffffff]
node 0: [mem 0x0000000200000000-0x000000048fffffff]
node 2: [mem 0x0000000490000000-0x00000007ffffffff]
This can be seen on PowerPC LPAR after multiple memory hot-plug and
hot-unplug operations are done. At the next reboot the node's memory
ranges can be interleaved and since the call to link_mem_sections() is
made in topology_init() while the system is in the SYSTEM_SCHEDULING
state, the node's id is not checked, and the sections registered to
multiple nodes:
$ ls -l /sys/devices/system/memory/memory21/node*
total 0
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
In that case, the system is able to boot but if later one of theses
memory blocks is hot-unplugged and then hot-plugged, the sysfs
inconsistency is detected and this is triggering a BUG_ON():
kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
Call Trace:
add_memory_resource+0x23c/0x340 (unreliable)
__add_memory+0x5c/0xf0
dlpar_add_lmb+0x1b4/0x500
dlpar_memory+0x1f8/0xb80
handle_dlpar_errorlog+0xc0/0x190
dlpar_store+0x198/0x4a0
kobj_attr_store+0x30/0x50
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
vfs_write+0xe8/0x290
ksys_write+0xdc/0x130
system_call_exception+0x160/0x270
system_call_common+0xf0/0x27c
This patch addresses the root cause by not relying on the system_state
value to detect whether the call is due to a hot-plug operation. An
extra parameter is added to link_mem_sections() detailing whether the
operation is due to a hot-plug operation.
[1] According to Oscar Salvador, using this qemu command line, ACPI
memory hotplug operations are raised at SYSTEM_SCHEDULING state:
$QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
-m size=$MEM,slots=255,maxmem=4294967296k \
-numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
-object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
-object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
-object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
-object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
-object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
-object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
-object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \
Fixes: 4fbce63391 ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: fix memory to node bad links in sysfs", v3.
Sometimes, firmware may expose interleaved memory layout like this:
Early memory node ranges
node 1: [mem 0x0000000000000000-0x000000011fffffff]
node 2: [mem 0x0000000120000000-0x000000014fffffff]
node 1: [mem 0x0000000150000000-0x00000001ffffffff]
node 0: [mem 0x0000000200000000-0x000000048fffffff]
node 2: [mem 0x0000000490000000-0x00000007ffffffff]
In that case, we can see memory blocks assigned to multiple nodes in
sysfs:
$ ls -l /sys/devices/system/memory/memory21
total 0
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
-rw-r--r-- 1 root root 65536 Aug 24 05:27 online
-r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
-r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
drwxr-xr-x 2 root root 0 Aug 24 05:27 power
-r--r--r-- 1 root root 65536 Aug 24 05:27 removable
-rw-r--r-- 1 root root 65536 Aug 24 05:27 state
lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory
-rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
-r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
The same applies in the node's directory with a memory21 link in both
the node1 and node2's directory.
This is wrong but doesn't prevent the system to run. However when
later, one of these memory blocks is hot-unplugged and then hot-plugged,
the system is detecting an inconsistency in the sysfs layout and a
BUG_ON() is raised:
kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
Call Trace:
add_memory_resource+0x23c/0x340 (unreliable)
__add_memory+0x5c/0xf0
dlpar_add_lmb+0x1b4/0x500
dlpar_memory+0x1f8/0xb80
handle_dlpar_errorlog+0xc0/0x190
dlpar_store+0x198/0x4a0
kobj_attr_store+0x30/0x50
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
vfs_write+0xe8/0x290
ksys_write+0xdc/0x130
system_call_exception+0x160/0x270
system_call_common+0xf0/0x27c
This has been seen on PowerPC LPAR.
The root cause of this issue is that when node's memory is registered,
the range used can overlap another node's range, thus the memory block
is registered to multiple nodes in sysfs.
There are two issues here:
(a) The sysfs memory and node's layouts are broken due to these
multiple links
(b) The link errors in link_mem_sections() should not lead to a system
panic.
To address (a) register_mem_sect_under_node should not rely on the
system state to detect whether the link operation is triggered by a hot
plug operation or not. This is addressed by the patches 1 and 2 of this
series.
Issue (b) will be addressed separately.
This patch (of 2):
The memmap_context enum is used to detect whether a memory operation is
due to a hot-add operation or happening at boot time.
Make it general to the hotplug operation and rename it as
meminit_context.
There is no functional change introduced by this patch
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J . Wysocki" <rafael@kernel.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PageTransHuge returns true for both thp and hugetlb, so thp stats was
counting both thp and hugetlb migrations. Exclude hugetlb migration by
setting is_thp variable right.
Clean up thp handling code too when we are there.
Fixes: 1a5bae25e3 ("mm/vmstat: add events for THP migration without split")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lkml.kernel.org/r/20200917210413.1462975-1-zi.yan@sent.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
unsigned int flags, struct page **pages, int *nr)
...
pudp = pud_offset(&p4d, addr);
This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be
iterated.
On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.
Here is an example of what happens with gup_fast on s390, for a task
with 3-level paging, crossing a 2 GB pud boundary:
// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
unsigned int flags, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;
// pud_offset returns &p4d itself (a pointer to a value on stack)
pudp = pud_offset(&p4d, addr);
do {
// on second iteratation reading "random" stack value
pud_t pud = READ_ONCE(*pudp);
// next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
next = pud_addr_end(addr, end);
...
} while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
return 1;
}
This happens since s390 moved to common gup code with commit
d1874a0c28 ("s390/mm: make the pxd_offset functions more robust") and
commit 1a42010cdc ("s390/mm: convert to the generic
get_user_pages_fast code").
s390 tried to mimic static level folding by changing pXd_offset
primitives to always calculate top level page table offset in pgd_offset
and just return the value passed when pXd_offset has to act as folded.
What is crucial for gup_fast and what has been overlooked is that
PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly.
And the latter is not possible with dynamic folding.
To fix the issue in addition to pXd values pass original pXdp pointers
down to gup_pXd_range functions. And introduce pXd_offset_lockless
helpers, which take an additional pXd entry value parameter. This has
already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
Fixes: 1a42010cdc ("s390/mm: convert to the generic get_user_pages_fast code")
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: <stable@vger.kernel.org> [5.2+]
Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We forget to add the suffix to the workingset_restore string, so fix it.
And also update the documentation of cgroup-v2.rst.
Fixes: 170b04b7ae ("mm/workingset: prepare the workingset detection infrastructure for anon LRU")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/20200916100030.71698-1-songmuchun@bytedance.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SWP_FS is used to make swap_{read,write}page() go through the
filesystem, and it's only used for swap files over NFS. So, !SWP_FS
means non NFS for now, it could be either file backed or device backed.
Something similar goes with legacy SWP_FILE.
So in order to achieve the goal of the original patch, SWP_BLKDEV should
be used instead.
FS corruption can be observed with SSD device + XFS + fragmented
swapfile due to CONFIG_THP_SWAP=y.
I reproduced the issue with the following details:
Environment:
QEMU + upstream kernel + buildroot + NVMe (2 GB)
Kernel config:
CONFIG_BLK_DEV_NVME=y
CONFIG_THP_SWAP=y
Some reproducible steps:
mkfs.xfs -f /dev/nvme0n1
mkdir /tmp/mnt
mount /dev/nvme0n1 /tmp/mnt
bs="32k"
sz="1024m" # doesn't matter too much, I also tried 16m
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
mkswap /tmp/mnt/sw
swapon /tmp/mnt/sw
stress --vm 2 --vm-bytes 600M # doesn't matter too much as well
Symptoms:
- FS corruption (e.g. checksum failure)
- memory corruption at: 0xd2808010
- segfault
Fixes: f0eea189e8 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
Fixes: 38d8b4e6bd ("mm, THP, swap: delay splitting THP during swap out")
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Eric Sandeen <esandeen@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200820045323.7809-1-hsiangkao@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the commit 10befea91b ("mm: memcg/slab: use a single set of
kmem_caches for all allocations"), it becomes possible to call kfree()
from the slabs_destroy().
The functions cache_flusharray() and do_drain() calls slabs_destroy() on
array_cache of the local CPU without updating the size of the
array_cache. This enables the kfree() call from the slabs_destroy() to
recursively call cache_flusharray() which can potentially call
free_block() on the same elements of the array_cache of the local CPU
and causing double free and memory corruption.
To fix the issue, simply update the local CPU array_cache cache before
calling slabs_destroy().
Fixes: 10befea91b ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ted Ts'o <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace the two negative flags that are always used together with a
single positive flag that indicates the writeback capability instead
of two related non-capabilities. Also remove the pointless wrappers
to just check the flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to
make the checks more obvious. Also remove the pointless
bdi_cap_account_writeback wrapper that just obsfucates the check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it. This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.
One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore. It is replaced with a queue attribute which
also is writable for easier testing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no point in trying to call bdev_read_page if SWP_SYNCHRONOUS_IO
is not set, as the device won't support it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
BDI_CAP_SYNCHRONOUS_IO is only checked in the swap code, and used to
decided if ->rw_page can be used on a block device. Just check up for
the method instead. The only complication is that zram needs a second
set of block_device_operations as it can switch between modes that
actually support ->rw_page and those who don't.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set up a readahead size by default, as very few users have a good
reason to change it. This means code, ecryptfs, and orangefs now
set up the values while they were previously missing it, while ubifs,
mtd and vboxsf manually set it to 0 to avoid readahead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
nommu-mmap.rst was moved to Documentation/admin-guide/mm; this patch
updates the remaining stale references to Documentation/mm.
Fixes: 800c02f5d0 ("docs: move nommu-mmap.txt to admin-guide and rename to ReST")
Signed-off-by: Stephen Kitt <steve@sk2.org>
Link: https://lore.kernel.org/r/20200812092230.27541-1-steve@sk2.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Commit 09854ba94c ("mm: do_wp_page() simplification") reorganized all
the code around the page re-use vs copy, but in the process also moved
the final unlock_page() around to after the wp_page_reuse() call.
That normally doesn't matter - but it means that the unlock_page() is
now done after releasing the page table lock. Again, not a big deal,
you'd think.
But it turns out that it's very wrong indeed, because once we've
released the page table lock, we've basically lost our only reference to
the page - the page tables - and it could now be free'd at any time. We
do hold the mmap_sem, so no actual unmap() can happen, but madvise can
come in and a MADV_DONTNEED will zap the page range - and free the page.
So now the page may be free'd just as we're unlocking it, which in turn
will usually trigger a "Bad page state" error in the freeing path. To
make matters more confusing, by the time the debug code prints out the
page state, the unlock has typically completed and everything looks fine
again.
This all doesn't happen in any normal situations, but it does trigger
with the dirtyc0w_child LTP test. And it seems to trigger much more
easily (but not expclusively) on s390 than elsewhere, probably because
s390 doesn't do the "batch pages up for freeing after the TLB flush"
that gives the unlock_page() more time to complete and makes the race
harder to hit.
Fixes: 09854ba94c ("mm: do_wp_page() simplification")
Link: https://lore.kernel.org/lkml/a46e9bbef2ed4e17778f5615e818526ef848d791.camel@redhat.com/
Link: https://lore.kernel.org/linux-mm/c41149a8-211e-390b-af1d-d5eee690fecb@linux.alibaba.com/
Reported-by: Qian Cai <cai@redhat.com>
Reported-by: Alex Shi <alex.shi@linux.alibaba.com>
Bisected-and-analyzed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Tested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This completes the split of the non-present and present pte cases by
moving the check for the source pte being present into the single
caller, which also means that we clearly separate out the very different
return value case for a non-present pte.
The present pte case currently always succeeds.
This is a pure code re-organization with no semantic change: the intent
is to make it much easier to add a new return case to the present pte
case for when we do early COW at page table copy time.
This was split out from the previous commit simply to make it easy to
visually see that there were no semantic changes from this code
re-organization.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a purely mechanical split of the copy_one_pte() function. It's
not immediately obvious when looking at the diff because of the
indentation change, but the way to see what is going on in this commit
is to use the "-w" flag to not show pure whitespace changes, and you see
how the first part of copy_one_pte() is simply lifted out into a
separate function.
And since the non-present case is marked unlikely, don't make the new
function be inlined. Not that gcc really seems to care, since it looks
like it will inline it anyway due to the whole "single callsite for
static function" logic. In fact, code generation with the function
split is almost identical to before. But not marking it inline is the
right thing to do.
This is pure prep-work and cleanup for subsequent changes.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
swap_type_of is used for two entirely different purposes:
(1) check what swap type a given device/offset corresponds to
(2) find the first available swap device that can be written to
Mixing both in a single function creates an unreadable mess. Create two
separate functions instead, and switch both to pass a dev_t instead of
a struct block_device to further simplify the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a race during page offline that can lead to infinite loop:
a page never ends up on a buddy list and __offline_pages() keeps
retrying infinitely or until a termination signal is received.
Thread#1 - a new process:
load_elf_binary
begin_new_exec
exec_mmap
mmput
exit_mmap
tlb_finish_mmu
tlb_flush_mmu
release_pages
free_unref_page_list
free_unref_page_prepare
set_pcppage_migratetype(page, migratetype);
// Set page->index migration type below MIGRATE_PCPTYPES
Thread#2 - hot-removes memory
__offline_pages
start_isolate_page_range
set_migratetype_isolate
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
Set migration type to MIGRATE_ISOLATE-> set
drain_all_pages(zone);
// drain per-cpu page lists to buddy allocator.
Thread#1 - continue
free_unref_page_commit
migratetype = get_pcppage_migratetype(page);
// get old migration type
list_add(&page->lru, &pcp->lists[migratetype]);
// add new page to already drained pcp list
Thread#2
Never drains pcp again, and therefore gets stuck in the loop.
The fix is to try to drain per-cpu lists again after
check_pages_isolated_cb() fails.
Fixes: c52e75935f ("mm: remove extra drain pages on pcp list")
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200903140032.380431-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20200904151448.100489-2-pasha.tatashin@soleen.com
Link: http://lkml.kernel.org/r/20200904070235.GA15277@dhcp22.suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A migrating transparent huge page has to already be unmapped. Otherwise,
the page could be modified while it is being copied to a new page and data
could be lost. The function __split_huge_pmd() checks for a PMD migration
entry before calling __split_huge_pmd_locked() leading one to think that
__split_huge_pmd_locked() can handle splitting a migrating PMD.
However, the code always increments the page->_mapcount and adjusts the
memory control group accounting assuming the page is mapped.
Also, if the PMD entry is a migration PMD entry, the call to
is_huge_zero_pmd(*pmd) is incorrect because it calls pmd_pfn(pmd) instead
of migration_entry_to_pfn(pmd_to_swp_entry(pmd)). Fix these problems by
checking for a PMD migration entry.
Fixes: 84c3fc4e9c ("mm: thp: check pmd migration entry in common path")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org> [4.14+]
Link: https://lkml.kernel.org/r/20200903183140.19055-1-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e809d5f0b5 ("tmpfs: per-superblock i_ino support") made changes
to shmem_reserve_inode() in mm/shmem.c, however the original test for
(sbinfo->max_inodes) got dropped. This causes mounting tmpfs with option
nr_inodes=0 to fail:
# mount -ttmpfs -onr_inodes=0 none /ext0
mount: /ext0: mount(2) system call failed: Cannot allocate memory.
This patch restores the nr_inodes=0 functionality.
Fixes: e809d5f0b5 ("tmpfs: per-superblock i_ino support")
Signed-off-by: Byron Stanoszek <gandalf@winds.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Chris Down <chris@chrisdown.name>
Link: https://lkml.kernel.org/r/20200902035715.16414-1-gandalf@winds.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5.8 commit 5d91f31faf ("mm: swap: fix vmstats for huge page") has
established that vm_events should count every subpage of a THP, including
unevictable_pgs_culled and unevictable_pgs_rescued; but
lru_cache_add_inactive_or_unevictable() was not doing so for
unevictable_pgs_mlocked, and mm/mlock.c was not doing so for
unevictable_pgs mlocked, munlocked, cleared and stranded.
Fix them; but THPs don't go the pagevec way in mlock.c, so no fixes needed
on that path.
Fixes: 5d91f31faf ("mm: swap: fix vmstats for huge page")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301408230.5954@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
check_move_unevictable_pages() is used in making unevictable shmem pages
evictable: by shmem_unlock_mapping(), drm_gem_check_release_pagevec() and
i915/gem check_release_pagevec(). Those may pass down subpages of a huge
page, when /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force".
That does not crash or warn at present, but the accounting of vmstats
unevictable_pgs_scanned and unevictable_pgs_rescued is inconsistent:
scanned being incremented on each subpage, rescued only on the head (since
tails already appear evictable once the head has been updated).
5.8 commit 5d91f31faf ("mm: swap: fix vmstats for huge page") has
established that vm_events in general (and unevictable_pgs_rescued in
particular) should count every subpage: so follow that precedent here.
Do this in such a way that if mem_cgroup_page_lruvec() is made stricter
(to check page->mem_cgroup is always set), no problem: skip the tails
before calling it, and add thp_nr_pages() to vmstats on the head.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301405000.5954@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hugetlbfs pages do not participate in memcg: so although they do find most
of migrate_page_states() useful, it would be better if they did not call
into mem_cgroup_migrate() - where Qian Cai reported that LTP's
move_pages12 triggers the warning in Alex Shi's prospective commit
"mm/memcg: warning on !memcg after readahead page charged".
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxch.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301359460.5954@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: fixes to past from future testing".
Here's a set of independent fixes against 5.9-rc2: prompted by
testing Alex Shi's "warning on !memcg" and lru_lock series, but
I think fit for 5.9 - though maybe only the first for stable.
This patch (of 5):
In 5.8 some instances of memcg charging in do_swap_page() and unuse_pte()
were removed, on the understanding that swap cache is now already charged
at those points; but a case was missed, when ksm_might_need_to_copy() has
decided it must allocate a substitute page: such pages were never charged.
Fix it inside ksm_might_need_to_copy().
This was discovered by Alex Shi's prospective commit "mm/memcg: warning on
!memcg after readahead page charged".
But there is a another surprise: this also fixes some rarer uncharged
PageAnon cases, when KSM is configured in, but has never been activated.
ksm_might_need_to_copy()'s anon_vma->root and linear_page_index() check
sometimes catches a case which would need to have been copied if KSM were
turned on. Or that's my optimistic interpretation (of my own old code),
but it leaves some doubt as to whether everything is working as intended
there - might it hint at rare anon ptes which rmap cannot find? A
question not easily answered: put in the fix for missed memcg charges.
Cc; Matthew Wilcox <willy@infradead.org>
Fixes: 4c6355b25e ("mm: memcontrol: charge swapin pages on instantiation")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org> [5.8]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301343270.5954@eggly.anvils
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301358020.5954@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull percpu fix from Dennis Zhou:
"This is a fix for the first chunk size calculation where the variable
length array incorrectly used the number of longs instead of bytes of
longs.
This came in as a code fix and not a bug report, so I don't think it
was widely problematic. I believe it worked out due to it being
memblock memory and alignment requirements working in our favor"
* 'for-5.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu: fix first chunk size calculation for populated bitmap
Variable populated, which is a member of struct pcpu_chunk, is used as a
unit of size of unsigned long.
However, size of populated is miscounted. So, I fix this minor part.
Fixes: 8ab16c43ea ("percpu: change the number of pages marked in the first_chunk pop bitmap")
Cc: <stable@vger.kernel.org> # 4.14+
Signed-off-by: Sunghyun Jin <mcsmonk@gmail.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Commit 2a9127fcf2 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit eef1a429f2 ("mm/swap.c: piggyback lru_add_drain_all() calls")
implemented an optimization mechanism to exit the to-be-started LRU
drain operation (name it A) if another drain operation *started and
finished* while (A) was blocked on the LRU draining mutex.
This was done through a seqcount_t latch, which is an abuse of its
semantics:
1. seqcount_t latching should be used for the purpose of switching
between two storage places with sequence protection to allow
interruptible, preemptible, writer sections. The referenced
optimization mechanism has absolutely nothing to do with that.
2. The used raw_write_seqcount_latch() has two SMP write memory
barriers to insure one consistent storage place out of the two
storage places available. A full memory barrier is required
instead: to guarantee that the pagevec counter stores visible by
local CPU are visible to other CPUs -- before loading the current
drain generation.
Beside the seqcount_t API abuse, the semantics of a latch sequence
counter was force-fitted into the referenced optimization. What was
meant is to track "generations" of LRU draining operations, where
"global lru draining generation = x" implies that all generations
0 < n <= x are already *scheduled* for draining -- thus nothing needs
to be done if the current generation number n <= x.
Remove the conceptually-inappropriate seqcount_t latch usage. Manually
implement the referenced optimization using a counter and SMP memory
barriers.
Note, while at it, use the non-atomic variant of cpumask_set_cpu(),
__cpumask_set_cpu(), due to the already existing mutex protection.
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/87y2pg9erj.fsf@vostro.fn.ogness.net
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCX1Rn1wAKCRCAXGG7T9hj
vlEjAQC/KGC3wYw5TweWcY48xVzgvued3JLAQ6pcDlOe6osd6AEAzZcZKgL948cx
oY0T98dxb/U+lUhbIzhpBr/30g8JbAQ=
=Xcxp
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
"A small series for fixing a problem with Xen PVH guests when running
as backends (e.g. as dom0).
Mapping other guests' memory is now working via ZONE_DEVICE, thus not
requiring to abuse the memory hotplug functionality for that purpose"
* tag 'for-linus-5.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: add helpers to allocate unpopulated memory
memremap: rename MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC
xen/balloon: add header guard
Merge misc fixes from Andrew Morton:
"19 patches.
Subsystems affected by this patch series: MAINTAINERS, ipc, fork,
checkpatch, lib, and mm (memcg, slub, pagemap, madvise, migration,
hugetlb)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
include/linux/log2.h: add missing () around n in roundup_pow_of_two()
mm/khugepaged.c: fix khugepaged's request size in collapse_file
mm/hugetlb: fix a race between hugetlb sysctl handlers
mm/hugetlb: try preferred node first when alloc gigantic page from cma
mm/migrate: preserve soft dirty in remove_migration_pte()
mm/migrate: remove unnecessary is_zone_device_page() check
mm/rmap: fixup copying of soft dirty and uffd ptes
mm/migrate: fixup setting UFFD_WP flag
mm: madvise: fix vma user-after-free
checkpatch: fix the usage of capture group ( ... )
fork: adjust sysctl_max_threads definition to match prototype
ipc: adjust proc_ipc_sem_dointvec definition to match prototype
mm: track page table modifications in __apply_to_page_range()
MAINTAINERS: IA64: mark Status as Odd Fixes only
MAINTAINERS: add LLVM maintainers
MAINTAINERS: update Cavium/Marvell entries
mm: slub: fix conversion of freelist_corrupted()
mm: memcg: fix memcg reclaim soft lockup
memcg: fix use-after-free in uncharge_batch
collapse_file() in khugepaged passes PAGE_SIZE as the number of pages to
be read to page_cache_sync_readahead(). The intent was probably to read
a single page. Fix it to use the number of pages to the end of the
window instead.
Fixes: 99cb0dbd47 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Eric Biggers <ebiggers@google.com>
Link: https://lkml.kernel.org/r/20200903140844.14194-2-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a race between the assignment of `table->data` and write value
to the pointer of `table->data` in the __do_proc_doulongvec_minmax() on
the other thread.
CPU0: CPU1:
proc_sys_write
hugetlb_sysctl_handler proc_sys_call_handler
hugetlb_sysctl_handler_common hugetlb_sysctl_handler
table->data = &tmp; hugetlb_sysctl_handler_common
table->data = &tmp;
proc_doulongvec_minmax
do_proc_doulongvec_minmax sysctl_head_finish
__do_proc_doulongvec_minmax unuse_table
i = table->data;
*i = val; // corrupt CPU1's stack
Fix this by duplicating the `table`, and only update the duplicate of
it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to
simplify the code.
The following oops was seen:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
Code: Bad RIP value.
...
Call Trace:
? set_max_huge_pages+0x3da/0x4f0
? alloc_pool_huge_page+0x150/0x150
? proc_doulongvec_minmax+0x46/0x60
? hugetlb_sysctl_handler_common+0x1c7/0x200
? nr_hugepages_store+0x20/0x20
? copy_fd_bitmaps+0x170/0x170
? hugetlb_sysctl_handler+0x1e/0x20
? proc_sys_call_handler+0x2f1/0x300
? unregister_sysctl_table+0xb0/0xb0
? __fd_install+0x78/0x100
? proc_sys_write+0x14/0x20
? __vfs_write+0x4d/0x90
? vfs_write+0xef/0x240
? ksys_write+0xc0/0x160
? __ia32_sys_read+0x50/0x50
? __close_fd+0x129/0x150
? __x64_sys_write+0x43/0x50
? do_syscall_64+0x6c/0x200
? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: e5ff215941 ("hugetlb: multiple hstates for multiple page sizes")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/20200828031146.43035-1-songmuchun@bytedance.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit cf11e85fc0 ("mm: hugetlb: optionally allocate gigantic
hugepages using cma"), the gigantic page would be allocated from node
which is not the preferred node, although there are pages available from
that node. The reason is that the nid parameter has been ignored in
alloc_gigantic_page().
Besides, the __GFP_THISNODE also need be checked if user required to
alloc only from the preferred node.
After this patch, the preferred node is tried first before other allowed
nodes, and don't try to allocate from other nodes if __GFP_THISNODE is
specified. If user don't specify the preferred node, the current node
will be used as preferred node, which makes sure consistent behavior of
allocating gigantic and non-gigantic hugetlb page.
Fixes: cf11e85fc0 ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Link: https://lkml.kernel.org/r/20200902025016.697260-1-lixinhai.lxh@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The code to remove a migration PTE and replace it with a device private
PTE was not copying the soft dirty bit from the migration entry. This
could lead to page contents not being marked dirty when faulting the page
back from device private memory.
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Link: https://lkml.kernel.org/r/20200831212222.22409-3-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/migrate: preserve soft dirty in remove_migration_pte()".
I happened to notice this from code inspection after seeing Alistair
Popple's patch ("mm/rmap: Fixup copying of soft dirty and uffd ptes").
This patch (of 2):
The check for is_zone_device_page() and is_device_private_page() is
unnecessary since the latter is sufficient to determine if the page is a
device private page. Simplify the code for easier reading.
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Link: https://lkml.kernel.org/r/20200831212222.22409-1-rcampbell@nvidia.com
Link: https://lkml.kernel.org/r/20200831212222.22409-2-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During memory migration a pte is temporarily replaced with a migration
swap pte. Some pte bits from the existing mapping such as the soft-dirty
and uffd write-protect bits are preserved by copying these to the
temporary migration swap pte.
However these bits are not stored at the same location for swap and
non-swap ptes. Therefore testing these bits requires using the
appropriate helper function for the given pte type.
Unfortunately several code locations were found where the wrong helper
function is being used to test soft_dirty and uffd_wp bits which leads to
them getting incorrectly set or cleared during page-migration.
Fix these by using the correct tests based on pte type.
Fixes: a5430dda8a ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
Fixes: 8c3328f1f3 ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
Fixes: f45ec5ff16 ("userfaultfd: wp: support swap and page migration")
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200825064232.10023-2-alistair@popple.id.au
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit f45ec5ff16 ("userfaultfd: wp: support swap and page migration")
introduced support for tracking the uffd wp bit during page migration.
However the non-swap PTE variant was used to set the flag for zone device
private pages which are a type of swap page.
This leads to corruption of the swap offset if the original PTE has the
uffd_wp flag set.
Fixes: f45ec5ff16 ("userfaultfd: wp: support swap and page migration")
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Link: https://lkml.kernel.org/r/20200825064232.10023-1-alistair@popple.id.au
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 52f2347808 ("mm/slub.c: fix corrupted freechain in
deactivate_slab()") suffered an update when picked up from LKML [1].
Specifically, relocating 'freelist = NULL' into 'freelist_corrupted()'
created a no-op statement. Fix it by sticking to the behavior intended
in the original patch [1]. In addition, make freelist_corrupted()
immune to passing NULL instead of &freelist.
The issue has been spotted via static analysis and code review.
[1] https://lore.kernel.org/linux-mm/20200331031450.12182-1-dongli.zhang@oracle.com/
Fixes: 52f2347808 ("mm/slub.c: fix corrupted freechain in deactivate_slab()")
Signed-off-by: Eugeniu Rosca <erosca@de.adit-jv.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200824130643.10291-1-erosca@de.adit-jv.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg
doesn't have any reclaimable memory.
It can be easily reproduced as below:
watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204]
CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12
Call Trace:
shrink_lruvec+0x49f/0x640
shrink_node+0x2a6/0x6f0
do_try_to_free_pages+0xe9/0x3e0
try_to_free_mem_cgroup_pages+0xef/0x1f0
try_charge+0x2c1/0x750
mem_cgroup_charge+0xd7/0x240
__add_to_page_cache_locked+0x2fd/0x370
add_to_page_cache_lru+0x4a/0xc0
pagecache_get_page+0x10b/0x2f0
filemap_fault+0x661/0xad0
ext4_filemap_fault+0x2c/0x40
__do_fault+0x4d/0xf9
handle_mm_fault+0x1080/0x1790
It only happens on our 1-vcpu instances, because there's no chance for
oom reaper to run to reclaim the to-be-killed process.
Add a cond_resched() at the upper shrink_node_memcgs() to solve this
issue, this will mean that we will get a scheduling point for each memcg
in the reclaimed hierarchy without any dependency on the reclaimable
memory in that memcg thus making it more predictable.
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: http://lkml.kernel.org/r/1598495549-67324-1-git-send-email-xlpang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
syzbot has reported an use-after-free in the uncharge_batch path
BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
BUG: KASAN: use-after-free in atomic64_sub_return include/asm-generic/atomic-instrumented.h:970 [inline]
BUG: KASAN: use-after-free in atomic_long_sub_return include/asm-generic/atomic-long.h:113 [inline]
BUG: KASAN: use-after-free in page_counter_cancel mm/page_counter.c:54 [inline]
BUG: KASAN: use-after-free in page_counter_uncharge+0x3d/0xc0 mm/page_counter.c:155
Write of size 8 at addr ffff8880371c0148 by task syz-executor.0/9304
CPU: 0 PID: 9304 Comm: syz-executor.0 Not tainted 5.8.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1f0/0x31e lib/dump_stack.c:118
print_address_description+0x66/0x620 mm/kasan/report.c:383
__kasan_report mm/kasan/report.c:513 [inline]
kasan_report+0x132/0x1d0 mm/kasan/report.c:530
check_memory_region_inline mm/kasan/generic.c:183 [inline]
check_memory_region+0x2b5/0x2f0 mm/kasan/generic.c:192
instrument_atomic_write include/linux/instrumented.h:71 [inline]
atomic64_sub_return include/asm-generic/atomic-instrumented.h:970 [inline]
atomic_long_sub_return include/asm-generic/atomic-long.h:113 [inline]
page_counter_cancel mm/page_counter.c:54 [inline]
page_counter_uncharge+0x3d/0xc0 mm/page_counter.c:155
uncharge_batch+0x6c/0x350 mm/memcontrol.c:6764
uncharge_page+0x115/0x430 mm/memcontrol.c:6796
uncharge_list mm/memcontrol.c:6835 [inline]
mem_cgroup_uncharge_list+0x70/0xe0 mm/memcontrol.c:6877
release_pages+0x13a2/0x1550 mm/swap.c:911
tlb_batch_pages_flush mm/mmu_gather.c:49 [inline]
tlb_flush_mmu_free mm/mmu_gather.c:242 [inline]
tlb_flush_mmu+0x780/0x910 mm/mmu_gather.c:249
tlb_finish_mmu+0xcb/0x200 mm/mmu_gather.c:328
exit_mmap+0x296/0x550 mm/mmap.c:3185
__mmput+0x113/0x370 kernel/fork.c:1076
exit_mm+0x4cd/0x550 kernel/exit.c:483
do_exit+0x576/0x1f20 kernel/exit.c:793
do_group_exit+0x161/0x2d0 kernel/exit.c:903
get_signal+0x139b/0x1d30 kernel/signal.c:2743
arch_do_signal+0x33/0x610 arch/x86/kernel/signal.c:811
exit_to_user_mode_loop kernel/entry/common.c:135 [inline]
exit_to_user_mode_prepare+0x8d/0x1b0 kernel/entry/common.c:166
syscall_exit_to_user_mode+0x5e/0x1a0 kernel/entry/common.c:241
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Commit 1a3e1f4096 ("mm: memcontrol: decouple reference counting from
page accounting") reworked the memcg lifetime to be bound the the struct
page rather than charges. It also removed the css_put_many from
uncharge_batch and that is causing the above splat.
uncharge_batch() is supposed to uncharge accumulated charges for all
pages freed from the same memcg. The queuing is done by uncharge_page
which however drops the memcg reference after it adds charges to the
batch. If the current page happens to be the last one holding the
reference for its memcg then the memcg is OK to go and the next page to
be freed will trigger batched uncharge which needs to access the memcg
which is gone already.
Fix the issue by taking a reference for the memcg in the current batch.
Fixes: 1a3e1f4096 ("mm: memcontrol: decouple reference counting from page accounting")
Reported-by: syzbot+b305848212deec86eabe@syzkaller.appspotmail.com
Reported-by: syzbot+b5ea6fb6f139c8b9482b@syzkaller.appspotmail.com
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Hugh Dickins <hughd@google.com>
Link: https://lkml.kernel.org/r/20200820090341.GC5033@dhcp22.suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge emailed patches from Peter Xu:
"This is a small series that I picked up from Linus's suggestion to
simplify cow handling (and also make it more strict) by checking
against page refcounts rather than mapcounts.
This makes uffd-wp work again (verified by running upmapsort)"
Note: this is horrendously bad timing, and making this kind of
fundamental vm change after -rc3 is not at all how things should work.
The saving grace is that it really is a a nice simplification:
8 files changed, 29 insertions(+), 120 deletions(-)
The reason for the bad timing is that it turns out that commit
17839856fd ("gup: document and work around 'COW can break either way'
issue" broke not just UFFD functionality (as Peter noticed), but Mikulas
Patocka also reports that it caused issues for strace when running in a
DAX environment with ext4 on a persistent memory setup.
And we can't just revert that commit without re-introducing the original
issue that is a potential security hole, so making COW stricter (and in
the process much simpler) is a step to then undoing the forced COW that
broke other uses.
Link: https://lore.kernel.org/lkml/alpine.LRH.2.02.2009031328040.6929@file01.intranet.prod.int.rdu2.redhat.com/
* emailed patches from Peter Xu <peterx@redhat.com>:
mm: Add PGREUSE counter
mm/gup: Remove enfornced COW mechanism
mm/ksm: Remove reuse_ksm_page()
mm: do_wp_page() simplification
This accounts for wp_page_reuse() case, where we reused a page for COW.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the more strict (but greatly simplified) page reuse logic in
do_wp_page(), we can safely go back to the world where cow is not
enforced with writes.
This essentially reverts commit 17839856fd ("gup: document and work
around 'COW can break either way' issue"). There are some context
differences due to some changes later on around it:
2170ecfa76 ("drm/i915: convert get_user_pages() --> pin_user_pages()", 2020-06-03)
376a34efa4 ("mm/gup: refactor and de-duplicate gup_fast() code", 2020-06-03)
Some lines moved back and forth with those, but this revert patch should
have striped out and covered all the enforced cow bits anyways.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the function as the last reference has gone away with the do_wp_page()
changes.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>