mainlining shenanigans
Go to file
Hugh Dickins 9a1ea439b1 mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along
since 2006: but you cannot safely wait_on_page_locked() without holding a
reference to the page, and that extra reference is enough to make
migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
on the entry before migrate_page_move_mapping() gets there.

And that failure is retried nine times, amplifying the pain when trying to
migrate a popular page.  With a single persistent faulter, migration
sometimes succeeds; with two or three concurrent faulters, success becomes
much less likely (and the more the page was mapped, the worse the overhead
of unmapping and remapping it on each try).

This is especially a problem for memory offlining, where the outer level
retries forever (or until terminated from userspace), because a heavy
refault workload can trigger an endless loop of migration failures.
wait_on_page_locked() is the wrong tool for the job.

David Herrmann (but was he the first?) noticed this issue in 2014:
https://marc.info/?l=linux-mm&m=140110465608116&w=2

Tim Chen started a thread in August 2017 which appears relevant:
https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
on to implicate __migration_entry_wait():
https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
up with the v4.14 commits: 2554db9165 ("sched/wait: Break up long wake
list walk") 11a19c7b09 ("sched/wait: Introduce wakeup boomark in
wake_up_page_bit")

Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
https://marc.info/?l=linux-mm&m=154217936431300&w=2

We have all assumed that it is essential to hold a page reference while
waiting on a page lock: partly to guarantee that there is still a struct
page when MEMORY_HOTREMOVE is configured, but also to protect against
reuse of the struct page going to someone who then holds the page locked
indefinitely, when the waiter can reasonably expect timely unlocking.

But in fact, so long as wait_on_page_bit_common() does the put_page(), and
is careful not to rely on struct page contents thereafter, there is no
need to hold a reference to the page while waiting on it.  That does mean
that this case cannot go back through the loop: but that's fine for the
page migration case, and even if used more widely, is limited by the "Stop
walking if it's locked" optimization in wake_page_function().

Add interface put_and_wait_on_page_locked() to do this, using "behavior"
enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
No interruptible or killable variant needed yet, but they might follow: I
have a vague notion that reporting -EINTR should take precedence over
return from wait_on_page_bit_common() without knowing the page state, so
arrange it accordingly - but that may be nothing but pedantic.

__migration_entry_wait() still has to take a brief reference to the page,
prior to calling put_and_wait_on_page_locked(): but now that it is dropped
before waiting, the chance of impeding page migration is very much
reduced.  Should we perhaps disable preemption across this?

shrink_page_list()'s __ClearPageLocked(): that was a surprise!  This
survived a lot of testing before that showed up.  PageWaiters may have
been set by wait_on_page_bit_common(), and the reference dropped, just
before shrink_page_list() succeeds in freezing its last page reference: in
such a case, unlock_page() must be used.  Follow the suggestion from
Michal Hocko, just revert a978d6f521 ("mm: unlockless reclaim") now:
that optimization predates PageWaiters, and won't buy much these days; but
we can reinstate it for the !PageWaiters case if anyone notices.

It does raise the question: should vmscan.c's is_page_cache_freeable() and
__remove_mapping() now treat a PageWaiters page as if an extra reference
were held?  Perhaps, but I don't think it matters much, since
shrink_page_list() already had to win its trylock_page(), so waiters are
not very common there: I noticed no difference when trying the bigger
change, and it's surely not needed while put_and_wait_on_page_locked() is
only used for page migration.

[willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:48 -08:00
arch mm: make free_reserved_area() return "const char *" 2018-12-28 12:11:48 -08:00
block block: Fix null_blk_zoned creation failure with small number of zones 2018-12-11 16:19:38 -07:00
certs export.h: remove VMLINUX_SYMBOL() and VMLINUX_SYMBOL_STR() 2018-08-22 23:21:44 +09:00
crypto Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2018-12-27 13:53:32 -08:00
Documentation mm: reclaim small amounts of memory when an external fragmentation event occurs 2018-12-28 12:11:48 -08:00
drivers mm/memory_hotplug: drop "online" parameter from add_memory_resource() 2018-12-28 12:11:48 -08:00
firmware kbuild: remove all dummy assignments to obj- 2017-11-18 11:46:06 +09:00
fs userfaultfd: convert userfaultfd_ctx::refcount to refcount_t 2018-12-28 12:11:47 -08:00
include mm: put_and_wait_on_page_locked() while page is migrated 2018-12-28 12:11:48 -08:00
init debugobjects: call debug_objects_mem_init eariler 2018-12-28 12:11:45 -08:00
ipc ipc: IPCMNI limit check for semmni 2018-10-31 08:54:14 -07:00
kernel mm, oom: reorganize the oom report in dump_header 2018-12-28 12:11:48 -08:00
lib mm: convert zone->managed_pages to atomic variable 2018-12-28 12:11:47 -08:00
LICENSES This is a fairly typical cycle for documentation. There's some welcome 2018-10-24 18:01:11 +01:00
mm mm: put_and_wait_on_page_locked() while page is migrated 2018-12-28 12:11:48 -08:00
net mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
samples Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2018-12-27 13:04:52 -08:00
scripts scripts/tags.sh: add more declarations 2018-12-28 12:11:44 -08:00
security mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
sound xen: features and fixes for 4.21 2018-12-26 11:35:07 -08:00
tools mm, devm_memremap_pages: fix shutdown handling 2018-12-28 12:11:47 -08:00
usr initramfs: move gen_initramfs_list.sh from scripts/ to usr/ 2018-08-22 23:21:44 +09:00
virt Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-12-26 13:07:19 -08:00
.clang-format page cache: Convert find_get_pages_contig to XArray 2018-10-21 10:46:34 -04:00
.cocciconfig scripts: add Linux .cocciconfig for coccinelle 2016-07-22 12:13:39 +02:00
.get_maintainer.ignore
.gitattributes .gitattributes: set git diff driver for C source code files 2016-10-07 18:46:30 -07:00
.gitignore Kbuild updates for v4.17 (2nd) 2018-04-15 17:21:30 -07:00
.mailmap Merge tag 'nand/for-4.21' of git://git.infradead.org/linux-mtd into mtd/next 2018-12-18 19:59:16 +01:00
COPYING COPYING: use the new text with points to the license files 2018-03-23 12:41:45 -06:00
CREDITS MAINTAINERS: update entry for MMP platform 2018-12-03 12:39:57 -08:00
Kbuild Kbuild updates for v4.15 2017-11-17 17:45:29 -08:00
Kconfig kconfig: move the "Executable file formats" menu to fs/Kconfig.binfmt 2018-08-02 08:06:55 +09:00
MAINTAINERS Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2018-12-27 13:53:32 -08:00
Makefile Linux 4.20 2018-12-23 15:55:59 -08:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.