pmdp_get_and_clear/pmdp_clear_flush/pmdp_splitting_flush were trapped as
BUG() and they were defined only to diminish the risk of build issues on
not-x86 archs and to be consistent with the generic pte methods previously
defined in include/asm-generic/pgtable.h.
But they are causing more trouble than they were supposed to solve, so
it's simpler not to define them when THP is off.
This is also correcting the export of pmdp_splitting_flush which is
currently unused (x86 isn't using the generic implementation in
mm/pgtable-generic.c and no other arch needs that [yet]).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Sam Ravnborg <sam@ravnborg.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Local symbols should be static.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits)
ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework
ACPI: fix resource check message
ACPI / Battery: Update information on info notification and resume
ACPI: Drop device flag wake_capable
ACPI: Always check if _PRW is present before trying to evaluate it
ACPI / PM: Check status of power resources under mutexes
ACPI / PM: Rename acpi_power_off_device()
ACPI / PM: Drop acpi_power_nocheck
ACPI / PM: Drop acpi_bus_get_power()
Platform / x86: Make fujitsu_laptop use acpi_bus_update_power()
ACPI / Fan: Rework the handling of power resources
ACPI / PM: Register power resource devices as soon as they are needed
ACPI / PM: Register acpi_power_driver early
ACPI / PM: Add function for updating device power state consistently
ACPI / PM: Add function for device power state initialization
ACPI / PM: Introduce __acpi_bus_get_power()
ACPI / PM: Introduce function for refcounting device power resources
ACPI / PM: Add functions for manipulating lists of power resources
ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes
ACPICA: Update version to 20101209
...
In the current implementation mem_cgroup_end_migration() decides whether
the page migration has succeeded or not by checking "oldpage->mapping".
But if we are tring to migrate a shmem swapcache, the page->mapping of it
is NULL from the begining, so the check would be invalid. As a result,
mem_cgroup_end_migration() assumes the migration has succeeded even if
it's not, so "newpage" would be freed while it's not uncharged.
This patch fixes it by passing mem_cgroup_end_migration() the result of
the page migration.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In mem_cgroup_alloc() we currently do either kmalloc() or vmalloc() then
followed by memset() to zero the memory. This can be more efficiently
achieved by using kzalloc() and vzalloc(). There's also one situation
where we can use kzalloc_node() - this is what's new in this version of
the patch.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit b1dd693e ("memcg: avoid deadlock between move charge and
try_charge()") can cause another deadlock about mmap_sem on task migration
if cpuset and memcg are mounted onto the same mount point.
After the commit, cgroup_attach_task() has sequence like:
cgroup_attach_task()
ss->can_attach()
cpuset_can_attach()
mem_cgroup_can_attach()
down_read(&mmap_sem) (1)
ss->attach()
cpuset_attach()
mpol_rebind_mm()
down_write(&mmap_sem) (2)
up_write(&mmap_sem)
cpuset_migrate_mm()
do_migrate_pages()
down_read(&mmap_sem)
up_read(&mmap_sem)
mem_cgroup_move_task()
mem_cgroup_clear_mc()
up_read(&mmap_sem)
We can cause deadlock at (2) because we've already aquire the mmap_sem at (1).
But the commit itself is necessary to fix deadlocks which have existed
before the commit like:
Ex.1)
move charge | try charge
--------------------------------------+------------------------------
mem_cgroup_can_attach() | down_write(&mmap_sem)
mc.moving_task = current | ..
mem_cgroup_precharge_mc() | __mem_cgroup_try_charge()
mem_cgroup_count_precharge() | prepare_to_wait()
down_read(&mmap_sem) | if (mc.moving_task)
-> cannot aquire the lock | -> true
| schedule()
| -> move charge should wake it up
Ex.2)
move charge | try charge
--------------------------------------+------------------------------
mem_cgroup_can_attach() |
mc.moving_task = current |
mem_cgroup_precharge_mc() |
mem_cgroup_count_precharge() |
down_read(&mmap_sem) |
.. |
up_read(&mmap_sem) |
| down_write(&mmap_sem)
mem_cgroup_move_task() | ..
mem_cgroup_move_charge() | __mem_cgroup_try_charge()
down_read(&mmap_sem) | prepare_to_wait()
-> cannot aquire the lock | if (mc.moving_task)
| -> true
| schedule()
| -> move charge should wake it up
This patch fixes all of these problems by:
1. revert the commit.
2. To fix the Ex.1, we set mc.moving_task after mem_cgroup_count_precharge()
has released the mmap_sem.
3. To fix the Ex.2, we use down_read_trylock() instead of down_read() in
mem_cgroup_move_charge() and, if it has failed to aquire the lock, cancel
all extra charges, wake up all waiters, and retry trylock.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reported-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adding the number of swap pages to the byte limit of a memory control
group makes no sense. Convert the pages to bytes before adding them.
The only user of this code is the OOM killer, and the way it is used means
that the error results in a higher OOM badness value. Since the cgroup
limit is the same for all tasks in the cgroup, the error should have no
practical impact at the moment.
But let's not wait for future or changing users to trip over it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page
accounting and migration code. This reworks the locking scheme of
_update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK,
which is always taken under IRQ disable.
1. If pages are being migrated from a memcg, then updates to that
memcg page statistics are protected by grabbing PCG_MOVE_LOCK using
move_lock_page_cgroup(). In an upcoming commit, memcg dirty page
accounting will be updating memcg page accounting (specifically: num
writeback pages) from IRQ context (softirq). Avoid a deadlocking
nested spin lock attempt by disabling irq on the local processor when
grabbing the PCG_MOVE_LOCK.
2. lock for update_page_stat is used only for avoiding race with
move_account(). So, IRQ awareness of lock_page_cgroup() itself is not
a problem. The problem is between mem_cgroup_update_page_stat() and
mem_cgroup_move_account_page().
Trade-off:
* Changing lock_page_cgroup() to always disable IRQ (or
local_bh) has some impacts on performance and I think
it's bad to disable IRQ when it's not necessary.
* adding a new lock makes move_account() slower. Score is
here.
Performance Impact: moving a 8G anon process.
Before:
real 0m0.792s
user 0m0.000s
sys 0m0.780s
After:
real 0m0.854s
user 0m0.000s
sys 0m0.842s
This score is bad but planned patches for optimization can reduce
this impact.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Andrea Righi <arighi@develer.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace usage of the mem_cgroup_update_file_mapped() memcg
statistic update routine with two new routines:
* mem_cgroup_inc_page_stat()
* mem_cgroup_dec_page_stat()
As before, only the file_mapped statistic is managed. However, these more
general interfaces allow for new statistics to be more easily added. New
statistics are added with memcg dirty page accounting.
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The zone->lru_lock is heavily contented in workload where activate_page()
is frequently used. We could do batch activate_page() to reduce the lock
contention. The batched pages will be added into zone list when the pool
is full or page reclaim is trying to drain them.
For example, in a 4 socket 64 CPU system, create a sparse file and 64
processes, processes shared map to the file. Each process read access the
whole file and then exit. The process exit will do unmap_vmas() and cause
a lot of activate_page() call. In such workload, we saw about 58% total
time reduction with below patch. Other workloads with a lot of
activate_page also benefits a lot too.
I tested some microbenchmarks:
case-anon-cow-rand-mt 0.58%
case-anon-cow-rand -3.30%
case-anon-cow-seq-mt -0.51%
case-anon-cow-seq -5.68%
case-anon-r-rand-mt 0.23%
case-anon-r-rand 0.81%
case-anon-r-seq-mt -0.71%
case-anon-r-seq -1.99%
case-anon-rx-rand-mt 2.11%
case-anon-rx-seq-mt 3.46%
case-anon-w-rand-mt -0.03%
case-anon-w-rand -0.50%
case-anon-w-seq-mt -1.08%
case-anon-w-seq -0.12%
case-anon-wx-rand-mt -5.02%
case-anon-wx-seq-mt -1.43%
case-fork 1.65%
case-fork-sleep -0.07%
case-fork-withmem 1.39%
case-hugetlb -0.59%
case-lru-file-mmap-read-mt -0.54%
case-lru-file-mmap-read 0.61%
case-lru-file-mmap-read-rand -2.24%
case-lru-file-readonce -0.64%
case-lru-file-readtwice -11.69%
case-lru-memcg -1.35%
case-mmap-pread-rand-mt 1.88%
case-mmap-pread-rand -15.26%
case-mmap-pread-seq-mt 0.89%
case-mmap-pread-seq -69.72%
case-mmap-xread-rand-mt 0.71%
case-mmap-xread-seq-mt 0.38%
The most significent are:
case-lru-file-readtwice -11.69%
case-mmap-pread-rand -15.26%
case-mmap-pread-seq -69.72%
which use activate_page a lot. others are basically variations because
each run has slightly difference.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clean up code and remove duplicate code. Next patch will use
pagevec_lru_move_fn introduced here too.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2.6.37 added an unmap_and_move_huge_page() for memory failure recovery,
but its anon_vma handling was still based around the 2.6.35 conventions.
Update it to use page_lock_anon_vma, get_anon_vma, page_unlock_anon_vma,
drop_anon_vma in the same way as we're now changing unmap_and_move().
I don't particularly like to propose this for stable when I've not seen
its problems in practice nor tested the solution: but it's clearly out of
synch at present.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: <stable@kernel.org> [2.6.37, 2.6.36]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Increased usage of page migration in mmotm reveals that the anon_vma
locking in unmap_and_move() has been deficient since 2.6.36 (or even
earlier). Review at the time of f18194275c
("mm: fix hang on anon_vma->root->lock") missed the issue here: the
anon_vma to which we get a reference may already have been freed back to
its slab (it is in use when we check page_mapped, but that can change),
and so its anon_vma->root may be switched at any moment by reuse in
anon_vma_prepare.
Perhaps we could fix that with a get_anon_vma_unless_zero(), but let's
not: just rely on page_lock_anon_vma() to do all the hard thinking for us,
then we don't need any rcu read locking over here.
In removing the rcu_unlock label: since PageAnon is a bit in
page->mapping, it's impossible for a !page->mapping page to be anon; but
insert VM_BUG_ON in case the implementation ever changes.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: <stable@kernel.org> [2.6.37, 2.6.36]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It was hard to explain the page counts which were causing new LTP tests
of KSM to fail: we need to drain the per-cpu pagevecs to LRU occasionally.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: CAI Qian <caiqian@redhat.com>
Cc:Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When parsing changes to the huge page pool sizes made from userspace via
the sysfs interface, bogus input values are being covered up by
nr_hugepages_store_common and nr_overcommit_hugepages_store returning 0
when strict_strtoul returns an error. This can cause an infinite loop in
the nr_hugepages_store code. This patch changes the return value for
these functions to -EINVAL when strict_strtoul returns an error.
Signed-off-by: Eric B Munson <emunson@mgebm.net>
Reported-by: CAI Qian <caiqian@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Huge pages with order >= MAX_ORDER must be allocated at boot via the
kernel command line, they cannot be allocated or freed once the kernel is
up and running. Currently we allow values to be written to the sysfs and
sysctl files controling pool size for these huge page sizes. This patch
makes the store functions for nr_hugepages and nr_overcommit_hugepages
return -EINVAL when the pool for a page size >= MAX_ORDER is changed.
[akpm@linux-foundation.org: avoid multiple return paths in nr_hugepages_store_common()]
[caiqian@redhat.com: add checking in hugetlb_overcommit_handler()]
Signed-off-by: Eric B Munson <emunson@mgebm.net>
Reported-by: CAI Qian <caiqian@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_doulongvec_minmax may fail if the given buffer doesn't represent a
valid number. If we provide something invalid we will initialize the
resulting value (nr_overcommit_huge_pages in this case) to a random value
from the stack.
The issue was introduced by a3d0c6aa when the default handler has been
replaced by the helper function where we do not check the return value.
Reproducer:
echo "" > /proc/sys/vm/nr_overcommit_hugepages
[akpm@linux-foundation.org: correctly propagate proc_doulongvec_minmax return code]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: CAI Qian <caiqian@redhat.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As it stands this code will degenerate into a busy-wait if the calling task
has signal_pending().
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dma_pool_free() scans for the page to free in the pool list holding the
pool lock. Then it releases the lock basically to acquire it immediately
again. Modify the code to only take the lock once.
This will do some additional loops and computations with the lock held in
if memory debugging is activated. If it is not activated the only new
operations with this lock is one if and one substraction.
Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The previous approach of calucation of combined index was
page_idx & ~(1 << order))
but we have same result with
page_idx & buddy_idx
This reduces instructions slightly as well as enhances readability.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix used-unintialised warning]
Signed-off-by: KyongHo Cho <pullip.cho@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Even if CONFIG_COMPAT_BRK is set in the kernel configuration, it can still
be overriden by randomize_va_space sysctl.
If this is the case, the min_brk computation in sys_brk() implementation
is wrong, as it solely takes into account COMPAT_BRK setting, assuming
that brk start is not randomized. But that might not be the case if
randomize_va_space sysctl has been set to '2' at the time the binary has
been loaded from disk.
In such case, the check has to be done in a same way as in
!CONFIG_COMPAT_BRK case.
In addition to that, the check for the COMPAT_BRK case introduced back in
a5b4592c ("brk: make sys_brk() honor COMPAT_BRK when computing lower
bound") is slightly wrong -- the lower bound shouldn't be mm->end_code,
but mm->end_data instead, as that's where the legacy applications expect
brk section to start (i.e. immediately after last global variable).
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The NODEMASK_ALLOC macro may dynamically allocate memory for its second
argument ('nodes_allowed' in this context).
In nr_hugepages_store_common() we may abort early if strict_strtoul()
fails, but in that case we do not free the memory already allocated to
'nodes_allowed', causing a memory leak.
This patch closes the leak by freeing the memory in the error path.
[akpm@linux-foundation.org: use NODEMASK_FREE, per Minchan Kim]
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for
anonymous pages, as introduced by git commit
989f89c57e ("fix rcu_read_lock() in page
migraton"). The point of the RCU protection there is part of getting a
stable reference to anon_vma and is only held for anon pages as file pages
are locked which is sufficient protection against freeing.
However, while a file page's mapping is being migrated, the radix tree is
double checked to ensure it is the expected page. This uses
radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held
triggering the following warning.
[ 173.674290] ===================================================
[ 173.676016] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 173.676016] ---------------------------------------------------
[ 173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
[ 173.676016]
[ 173.676016] other info that might help us debug this:
[ 173.676016]
[ 173.676016]
[ 173.676016] rcu_scheduler_active = 1, debug_locks = 0
[ 173.676016] 1 lock held by hugeadm/2899:
[ 173.676016] #0: (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab
[ 173.676016]
[ 173.676016] stack backtrace:
[ 173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild
[ 173.676016] Call Trace:
[ 173.676016] [<c128cc01>] ? printk+0x14/0x1b
[ 173.676016] [<c1063502>] lockdep_rcu_dereference+0x7d/0x86
[ 173.676016] [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab
[ 173.676016] [<c10e41ad>] migrate_page+0x23/0x39
[ 173.676016] [<c10e491b>] buffer_migrate_page+0x22/0x107
[ 173.676016] [<c10e48f9>] ? buffer_migrate_page+0x0/0x107
[ 173.676016] [<c10e425d>] move_to_new_page+0x9a/0x1ae
[ 173.676016] [<c10e47e6>] migrate_pages+0x1e7/0x2fa
This patch introduces radix_tree_deref_slot_protected() which calls
rcu_dereference_protected(). Users of it must pass in the
mapping->tree_lock that is protecting this dereference. Holding the tree
lock protects against parallel updaters of the radix tree meaning that
rcu_dereference_protected is allowable.
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Milton Miller <miltonm@bga.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org> [2.6.37.early]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cleanup some code with common compound_trans_head helper.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes KSM full operational with THP pages. Subpages are scanned
while the hugepage is still in place and delivering max cpu performance,
and only if there's a match and we're going to deduplicate memory, the
single hugepages with the subpage match is split.
There will be no false sharing between ksmd and khugepaged. khugepaged
won't collapse 2m virtual regions with KSM pages inside. ksmd also should
only split pages when the checksum matches and we're likely to split an
hugepage for some long living ksm page (usual ksm heuristic to avoid
sharing pages that get de-cowed).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MADV_HUGEPAGE and MADV_NOHUGEPAGE were fully effective only if run after
mmap and before touching the memory. While this is enough for most
usages, it's little effort to make madvise more dynamic at runtime on an
existing mapping by making khugepaged aware about madvise.
MADV_HUGEPAGE: register in khugepaged immediately without waiting a page
fault (that may not ever happen if all pages are already mapped and the
"enabled" knob was set to madvise during the initial page faults).
MADV_NOHUGEPAGE: skip vmas marked VM_NOHUGEPAGE in khugepaged to stop
collapsing pages where not needed.
[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add madvise MADV_NOHUGEPAGE to mark regions that are not important to be
hugepage backed. Return -EINVAL if the vma is not of an anonymous type,
or the feature isn't built into the kernel. Never silently return
success.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hugetlbfs was changed to allow memory failure to migrate the hugetlbfs
pages and that broke THP as split_huge_page was then called on hugetlbfs
pages too.
compound_head/order was also run unsafe on THP pages that can be splitted
at any time.
All compound_head() invocations in memory-failure.c that are run on pages
that aren't pinned and that can be freed and reused from under us (while
compound_head is running) are buggy because compound_head can return a
dangling pointer, but I'm not fixing this as this is a generic
memory-failure bug not specific to THP but it applies to hugetlbfs too, so
I can fix it later after THP is merged upstream.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add debug checks for invariants that if broken could lead to mapcount vs
page_mapcount debug checks to trigger later in split_huge_page.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make sure we scale up nr_rotated when we encounter a referenced
transparent huge page. This ensures pageout scanning balance is not
distorted when there are huge pages on the LRU.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU
statistics, so the Active(anon) and Inactive(anon) statistics in
/proc/meminfo are correct.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On small systems, the extra memory used by the anti-fragmentation memory
reserve and simply because huge pages are smaller than large pages can
easily outweigh the benefits of less TLB misses.
A less obvious concern is if run on a NUMA machine with asymmetric node
sizes and one of them is very small. The reserve could make the node
unusable.
In case of the crashdump kernel, OOMs have been observed due to the
anti-fragmentation memory reserve taking up a large fraction of the
crashdump image.
This patch disables transparent hugepages on systems with less than 1GB of
RAM, but the hugepage subsystem is fully initialized so administrators can
enable THP through /sys if desired.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Avi Kiviti <avi@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It makes no sense not to enable compaction for small order pages as we
don't want to end up with bad order 2 allocations and good and graceful
order 9 allocations.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This takes advantage of memory compaction to properly generate pages of
order > 0 if regular page reclaim fails and priority level becomes more
severe and we don't reach the proper watermarks.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's unclear why schedule friendly kernel threads can't be taken away by
the CPU through the scheduler itself. It's safer to stop them as they can
trigger memory allocation, if kswapd also freezes itself to avoid
generating I/O they have too.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For GRU and EPT, we need gup-fast to set referenced bit too (this is why
it's correct to return 0 when shadow_access_mask is zero, it requires
gup-fast to set the referenced bit). qemu-kvm access already sets the
young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow
paging EPT minor fault we relay on gup-fast to signal the page is in
use...
We also need to check the young bits on the secondary pagetables for NPT
and not nested shadow mmu as the data may never get accessed again by the
primary pte.
Without this closer accuracy, we'd have to remove the heuristic that
avoids collapsing hugepages in hugepage virtual regions that have not even
a single subpage in use.
->test_young is full backwards compatible with GRU and other usages that
don't have young bits in pagetables set by the hardware and that should
nuke the secondary mmu mappings when ->clear_flush_young runs just like
EPT does.
Removing the heuristic that checks the young bit in
khugepaged/collapse_huge_page completely isn't so bad either probably but
I thought it was worth it and this makes it reliable.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Archs implementing Transparent Hugepage Support must implement a function
called has_transparent_hugepage to be sure the virtual or physical CPU
supports Transparent Hugepages.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An huge pmd can only be mapped if the corresponding 2M virtual range is
fully contained in the vma. At times the VM calls split_vma twice, if the
first split_vma succeeds and the second fail, the first split_vma remains
in effect and it's not rolled back. For split_vma or vma_adjust to fail
an allocation failure is needed so it's a very unlikely event (the out of
memory killer would normally fire before any allocation failure is visible
to kernel and userland and if an out of memory condition happens it's
unlikely to happen exactly here). Nevertheless it's safer to ensure that
no huge pmd can be left around if the vma is adjusted in a way that can't
fit hugepages anymore at the new vm_start/vm_end address.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With transparent hugepage support we need compaction for the "defrag"
sysfs controls to be effective.
At the moment THP hangs the system if COMPACTION isn't selected, as
without COMPACTION lumpy reclaim wouldn't be entirely disabled. So at the
moment it's not orthogonal. When lumpy will be removed from the VM I can
remove the select COMPACTION in theory, but then 99% of THP users would be
still doing a mistake in disabling compaction, even if the mistake won't
return in fatal runtime but just slightly degraded performance. So from a
theoretical standpoing forcing the below select is not needed (the
dependency isn't strict nor at compile time nor at runtime) but from a
practical standpoint it is safer.
If anybody really wants THP to run without compaction, it'd be such a
weird setup that editing the Kconfig file to allow it will be surely not a
problem.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow to choose between the always|madvise default for page faults and
khugepaged at config time. madvise guarantees zero risk of higher memory
footprint for applications (applications using madvise(MADV_HUGEPAGE)
won't risk to use any more memory by backing their virtual regions with
hugepages).
Initially set the default to N and don't depend on EMBEDDED.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This tries to be more friendly to filesystem in userland, with userland
backends that allocate memory in the I/O paths and that could deadlock if
khugepaged holds the mmap_sem write mode of the userland backend while
allocating memory. Memory allocation may wait for writeback I/O
completion from the daemon that may be blocked in the mmap_sem read mode
if a page fault happens and the daemon wasn't using mlock for the memory
required for the I/O submission and completion.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's mostly a matter of replacing alloc_pages with alloc_pages_vma after
introducing alloc_pages_vma. khugepaged needs special handling as the
allocation has to happen inside collapse_huge_page where the vma is known
and an error has to be returned to the outer loop to sleep
alloc_sleep_millisecs in case of failure. But it retains the more
efficient logic of handling allocation failures in khugepaged in case of
CONFIG_NUMA=n.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With memory compaction in, and lumpy-reclaim disabled, it seems safe
enough to defrag memory during the (synchronous) transparent hugepage page
faults (TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) and not only during khugepaged
(async) hugepage allocations that was already enabled even before memory
compaction was in (TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If transparent hugepage is enabled initialize min_free_kbytes to an
optimal value by default. This moves the hugeadm algorithm in kernel.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Natively handle huge pmds when changing page tables on behalf of
mprotect().
I left out update_mmu_cache() because we do not need it on x86 anyway but
more importantly the interface works on ptes, not pmds.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Flushing the tlb for huge pmds requires the vma's anon_vma, so pass along
the vma instead of the mm, we can always get the latter when we need it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Handle transparent huge page pmd entries natively instead of splitting
them into subpages.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for transparent hugepages to x86 32bit.
Share the same VM_ bitflag for VM_MAPPED_COPY. mm/nommu.c will never
support transparent hugepages.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can
be added to page->flags without overflowing (because of the sparse section
bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also
has to move the memory hotplug code from _mapcount to lru.next to avoid
any risk of clashes. We can't use lru.next for PG_buddy removal, but
memory hotplug can use lru.next even more easily than the mapcount
instead.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
register in khugepaged if the vma grows.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add khugepaged to relocate fragmented pages into hugepages if new
hugepages become available. (this is indipendent of the defrag logic that
will have to make new hugepages available)
The fundamental reason why khugepaged is unavoidable, is that some memory
can be fragmented and not everything can be relocated. So when a virtual
machine quits and releases gigabytes of hugepages, we want to use those
freely available hugepages to create huge-pmd in the other virtual
machines that may be running on fragmented memory, to maximize the CPU
efficiency at all times. The scan is slow, it takes nearly zero cpu time,
except when it copies data (in which case it means we definitely want to
pay for that cpu time) so it seems a good tradeoff.
In addition to the hugepages being released by other process releasing
memory, we have the strong suspicion that the performance impact of
potentially defragmenting hugepages during or before each page fault could
lead to more performance inconsistency than allocating small pages at
first and having them collapsed into large pages later... if they prove
themselfs to be long lived mappings (khugepaged scan is slow so short
lived mappings have low probability to run into khugepaged if compared to
long lived mappings).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add hugepage stat information to /proc/vmstat and /proc/meminfo.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add memcg charge/uncharge to hugepage faults in huge_memory.c.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By this patch, when a transparent hugepage is charged, not only the head
page but also all the tail pages are committed, IOW pc->mem_cgroup and
pc->flags of tail pages are set.
Without this patch:
- Tail pages are not linked to any memcg's LRU at splitting. This causes many
problems, for example, the charged memcg's directory can never be rmdir'ed
because it doesn't have enough pages to scan to make the usage decrease to 0.
- "rss" field in memory.stat would be incorrect. Moreover, usage_in_bytes in
root cgroup is calculated by the stat not by res_counter(since 2.6.32),
it would be incorrect too.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No pmd_trans_huge should ever materialize in migration ptes areas, because
we split the hugepage before migration ptes are instantiated.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add madvise MADV_HUGEPAGE to mark regions that are important to be
hugepage backed. Return -EINVAL if the vma is not of an anonymous type,
or the feature isn't built into the kernel. Never silently return
success.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pte_trans_huge must not leak in certain vmas like the mmio special pfn or
filebacked mappings.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This documents how split_huge_page is safe vs new vma inserctions into the
anon_vma that may have already released the anon_vma->lock but not
established pmds yet when split_huge_page starts.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If you configure THP in addition to HUGETLB_PAGE on x86_32 without PAE,
the p?d-folding works out that munlock_vma_pages_range() can crash to
follow_page()'s pud_huge() BUG_ON(flags & FOLL_GET): it needs the same
VM_HUGETLB check already there on the pmd_huge() line. Conveniently,
openSUSE provides a "blogd" which tests this out at startup!
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lately I've been working to make KVM use hugepages transparently without
the usual restrictions of hugetlbfs. Some of the restrictions I'd like to
see removed:
1) hugepages have to be swappable or the guest physical memory remains
locked in RAM and can't be paged out to swap
2) if a hugepage allocation fails, regular pages should be allocated
instead and mixed in the same vma without any failure and without
userland noticing
3) if some task quits and more hugepages become available in the
buddy, guest physical memory backed by regular pages should be
relocated on hugepages automatically in regions under
madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
not null)
4) avoidance of reservation and maximization of use of hugepages whenever
possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
1 machine with 1 database with 1 database cache with 1 database cache size
known at boot time. It's definitely not feasible with a virtualization
hypervisor usage like RHEV-H that runs an unknown number of virtual machines
with an unknown size of each virtual machine with an unknown amount of
pagecache that could be potentially useful in the host for guest not using
O_DIRECT (aka cache=off).
hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization,
becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
to 19 in case only the hypervisor uses transparent hugepages, and they
decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
linux hypervisor and the linux guest both uses this patch (though the
guest will limit the addition speedup to anonymous regions only for
now...). Even more important is that the tlb miss handler is much slower
on a NPT/EPT guest than for a regular shadow paging or no-virtualization
scenario. So maximizing the amount of virtual memory cached by the TLB
pays off significantly more with NPT/EPT than without (even if there would
be no significant speedup in the tlb-miss runtime).
The first (and more tedious) part of this work requires allowing the VM to
handle anonymous hugepages mixed with regular pages transparently on
regular anonymous vmas. This is what this patch tries to achieve in the
least intrusive possible way. We want hugepages and hugetlb to be used in
a way so that all applications can benefit without changes (as usual we
leverage the KVM virtualization design: by improving the Linux VM at
large, KVM gets the performance boost too).
The most important design choice is: always fallback to 4k allocation if
the hugepage allocation fails! This is the _very_ opposite of some large
pagecache patches that failed with -EIO back then if a 64k (or similar)
allocation failed...
Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an operation
that can't fail. This way the reliability of the swapping isn't decreased
(no need to allocate memory when we are short on memory to swap) and it's
trivial to plug a split_huge_page* one-liner where needed without
polluting the VM. Over time we can teach mprotect, mremap and friends to
handle pmd_trans_huge natively without calling split_huge_page*. The fact
it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
(instead of the current void) we'd need to rollback the mprotect from the
middle of it (ideally including undoing the split_vma) which would be a
big change and in the very wrong direction (it'd likely be simpler not to
call split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle). In short the
very value of split_huge_page is that it can't fail.
The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
incremental and it'll just be an "harmless" addition later if this initial
part is agreed upon. It also should be noted that locking-wise replacing
regular pages with hugepages is going to be very easy if compared to what
I'm doing below in split_huge_page, as it will only happen when
page_count(page) matches page_mapcount(page) if we can take the PG_lock
and mmap_sem in write mode. collapse_huge_page will be a "best effort"
that (unlike split_huge_page) can fail at the minimal sign of trouble and
we can try again later. collapse_huge_page will be similar to how KSM
works and the madvise(MADV_HUGEPAGE) will work similar to
madvise(MADV_MERGEABLE).
The default I like is that transparent hugepages are used at page fault
time. This can be changed with
/sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set
to three values "always", "madvise", "never" which mean respectively that
hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
or never used. /sys/kernel/mm/transparent_hugepage/defrag instead
controls if the hugepage allocation should defrag memory aggressively
"always", only inside "madvise" regions, or "never".
The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head. I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point of
view. In short there is no current existing way to serialize the O_DIRECT
final put_page against split_huge_page_refcount so I had to invent a new
one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
returns so...). And I didn't want to impact all gup/gup_fast users for
now, maybe if we change the gup interface substantially we can avoid this
locking, I admit I didn't think too much about it because changing the gup
unpinning interface would be invasive.
If we ignored O_DIRECT we could stick to the existing compound refcounting
code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
(and any other mmu notifier user) would call it without FOLL_GET (and if
FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
current task mmu notifier list yet). But O_DIRECT is fundamental for
decent performance of virtualized I/O on fast storage so we can't avoid it
to solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.
Swap and oom works fine (well just like with regular pages ;). MMU
notifier is handled transparently too, with the exception of the young bit
on the pmd, that didn't have a range check but I think KVM will be fine
because the whole point of hugepages is that EPT/NPT will also use a huge
pmd when they notice gup returns pages with PageCompound set, so they
won't care of a range and there's just the pmd young bit to check in that
case.
NOTE: in some cases if the L2 cache is small, this may slowdown and waste
memory during COWs because 4M of memory are accessed in a single fault
instead of 8k (the payoff is that after COW the program can run faster).
So we might want to switch the copy_huge_page (and clear_huge_page too) to
not temporal stores. I also extensively researched ways to avoid this
cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
up to 1M (I can send those patches that fully implemented prefault) but I
concluded they're not worth it and they add an huge additional complexity
and they remove all tlb benefits until the full hugepage has been faulted
in, to save a little bit of memory and some cache during app startup, but
they still don't improve substantially the cache-trashing during startup
if the prefault happens in >4k chunks. One reason is that those 4k pte
entries copied are still mapped on a perfectly cache-colored hugepage, so
the trashing is the worst one can generate in those copies (cow of 4k page
copies aren't so well colored so they trashes less, but again this results
in software running faster after the page fault). Those prefault patches
allowed things like a pte where post-cow pages were local 4k regular anon
pages and the not-yet-cowed pte entries were pointing in the middle of
some hugepage mapped read-only. If it doesn't payoff substantially with
todays hardware it will payoff even less in the future with larger l2
caches, and the prefault logic would blot the VM a lot. If one is
emebdded transparent_hugepage can be disabled during boot with sysfs or
with the boot commandline parameter transparent_hugepage=0 (or
transparent_hugepage=2 to restrict hugepages inside madvise regions) that
will ensure not a single hugepage is allocated at boot time. It is simple
enough to just disable transparent hugepage globally and let transparent
hugepages be allocated selectively by applications in the MADV_HUGEPAGE
region (both at page fault time, and if enabled with the
collapse_huge_page too through the kernel daemon).
This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone. Also some archs like
power have certain tlb limits that prevents mixing different page size in
the same regions so they will not fit in this framework that requires
"graceful fallback" to basic PAGE_SIZE in case of physical memory
fragmentation. hugetlbfs remains a perfect fit for those because its
software limits happen to match the hardware limits. hugetlbfs also
remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
to be found not fragmented after a certain system uptime and that would be
very expensive to defragment with relocation, so requiring reservation.
hugetlbfs is the "reservation way", the point of transparent hugepages is
not to have any reservation at all and maximizing the use of cache and
hugepages at all times automatically.
Some performance result:
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988
============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (3UL*1024*1024*1024)
int main()
{
char *p = malloc(SIZE), *p2;
struct timeval before, after;
gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset page fault %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset second tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
for (p2 = p; p2 < p+SIZE; p2 += 4096)
*p2 = 0;
gettimeofday(&after, NULL);
printf("random access tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
for (p2 = p; p2 < p+SIZE; p2 += 4096)
*p2 = 0;
gettimeofday(&after, NULL);
printf("random access second tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
return 0;
}
============
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Not worth throwing away the precious reserved free memory pool for
allocations that can fail gracefully (either through mempool or because
they're transhuge allocations later falling back to 4k allocations).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Transparent hugepage allocations must be allowed not to invoke kswapd or
any other kind of indirect reclaim (especially when the defrag sysfs is
control disabled). It's unacceptable to swap out anonymous pages
(potentially anonymous transparent hugepages) in order to create new
transparent hugepages. This is true for the MADV_HUGEPAGE areas too
(swapping out a kvm virtual machine and so having it suffer an unbearable
slowdown, so another one with guest physical memory marked MADV_HUGEPAGE
can run 30% faster if it is running memory intensive workloads, makes no
sense). If a transparent hugepage allocation fails the slowdown is minor
and there is total fallback, so kswapd should never be asked to swapout
memory to allow the high order allocation to succeed.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the copy/clear_huge_page functions to common code to share between
hugetlb.c and huge_memory.c.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paging logic that splits the page before it is unmapped and added to swap
to ensure backwards compatibility with the legacy swap code. Eventually
swap should natively pageout the hugepages to increase performance and
decrease seeking and fragmentation of swap space. swapoff can just skip
over huge pmd as they cannot be part of swap yet. In add_to_swap be
careful to split the page only if we got a valid swap entry so we don't
split hugepages with a full swap.
In theory we could split pages before isolating them during the lru scan,
but for khugepaged to be safe, I'm relying on either mmap_sem write mode,
or PG_lock taken, so split_huge_page has to run either with mmap_sem
read/write mode or PG_lock taken. Calling it from isolate_lru_page would
make locking more complicated, in addition to that split_huge_page would
deadlock if called by __isolate_lru_page because it has to take the lru
lock to add the tail pages.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
split_huge_page_pmd compat code. Each one of those would need to be
expanded to hundred of lines of complex code without a fully reliable
split_huge_page_pmd design.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pte alloc routines must wait for split_huge_page if the pmd is not present
and not null (i.e. pmd_trans_splitting). The additional branches are
optimized away at compile time by pmd_trans_splitting if the config option
is off. However we must pass the vma down in order to know the anon_vma
lock to wait for.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some are needed to build but not actually used on archs not supporting
transparent hugepages. Others like pmdp_clear_flush are used by x86 too.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Warn destroy_compound_page that __split_huge_page_refcount is heavily
dependent on its internal behavior.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
huge_memory.c needs it too when it fallbacks in copying hugepages into
regular fragmented pages if hugepage allocation fails during COW.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clear compound mapping for anonymous compound pages like it already
happens for regular anonymous pages. But crash if mapping is set for any
tail page, also the PageAnon check is meaningless for tail pages. This
check only makes sense for the head page, for tail page it can only hide
bugs and we definitely don't want to hide bugs.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After releasing the compound_lock split_huge_page can still run and release the
page before put_page_testzero runs.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alter compound get_page/put_page to keep references on subpages too, in
order to allow __split_huge_page_refcount to split an hugepage even while
subpages have been pinned by one of the get_user_pages() variants.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_count shows the count of the head page, but the actual check is done
on the tail page, so show what is really being checked.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a swapcache page is replaced by a ksm page, it's best to free that
swap immediately.
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I think determine_dirtyable_memory() is a rather costly function since it
need many atomic reads for gathering zone/global page state. But when we
use vm_dirty_bytes && dirty_background_bytes, we don't need that costly
calculation.
This patch eliminates such unnecessary overhead.
NOTE : newly added if condition might add overhead in normal path.
But it should be _really_ small because anyway we need the
access both vm_dirty_bytes and dirty_background_bytes so it is
likely to hit the cache.
[akpm@linux-foundation.org: fix used-uninitialised warning]
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When numa_zonelist_order parameter is set to "node" or "zone" on the
command line it's still showing as "default" in sysctl. That's because
early_param parsing function changes only user_zonelist_order variable.
Fix this by copying user-provided string to numa_zonelist_order if it was
successfully parsed.
Signed-off-by: Volodymyr G Lukiianyk <volodymyrgl@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When kswapd is woken up for a high-order allocation, it takes account of
the highest usable zone by the caller (the classzone idx). During
allocation, this index is used to select the lowmem_reserve[] that should
be applied to the watermark calculation in zone_watermark_ok().
When balancing a node, kswapd considers the highest unbalanced zone to be
the classzone index. This will always be at least be the callers
classzone_idx and can be higher. However, sleeping_prematurely() always
considers the lowest zone (e.g. ZONE_DMA) to be the classzone index.
This means that sleeping_prematurely() can consider a zone to be balanced
that is unusable by the allocation request that originally woke kswapd.
This patch changes sleeping_prematurely() to use a classzone_idx matching
the value it used in balance_pgdat().
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After DEF_PRIORITY, balance_pgdat() considers all_unreclaimable zones to
be balanced but sleeping_prematurely does not. This can force kswapd to
stay awake longer than it should. This patch fixes it.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When kswapd wakes up, it reads its order and classzone from pgdat and
calls balance_pgdat. While its awake, it potentially reclaimes at a high
order and a low classzone index. This might have been a once-off that was
not required by subsequent callers. However, because the pgdat values
were not reset, they remain artifically high while balance_pgdat() is
running and potentially kswapd enters a second unnecessary reclaim cycle.
Reset the pgdat order and classzone index after reading.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before kswapd goes to sleep, it uses sleeping_prematurely() to check if
there was a race pushing a zone below its watermark. If the race
happened, it stays awake. However, balance_pgdat() can decide to reclaim
at order-0 if it decides that high-order reclaim is not working as
expected. This information is not passed back to sleeping_prematurely().
The impact is that kswapd remains awake reclaiming pages long after it
should have gone to sleep. This patch passes the adjusted order to
sleeping_prematurely and uses the same logic as balance_pgdat to decide if
it's ok to go to sleep.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When reclaiming for high-orders, kswapd is responsible for balancing a
node but it should not reclaim excessively. It avoids excessive reclaim
by considering if any zone in a node is balanced then the node is
balanced. In the cases where there are imbalanced zone sizes (e.g.
ZONE_DMA with both ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep
prematurely as just one small zone was balanced.
This alters the sleep logic of kswapd slightly. It counts the number of
pages that make up the balanced zones. If the total number of balanced
pages is more than a quarter of the zone, kswapd will go back to sleep.
This should keep a node balanced without reclaiming an excessive number of
pages.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simon Kirby reported the following problem
We're seeing cases on a number of servers where cache never fully
grows to use all available memory. Sometimes we see servers with 4 GB
of memory that never seem to have less than 1.5 GB free, even with a
constantly-active VM. In some cases, these servers also swap out while
this happens, even though they are constantly reading the working set
into memory. We have been seeing this happening for a long time; I
don't think it's anything recent, and it still happens on 2.6.36.
After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.
There are two apparent problems here. On the target machine, there is a
small Normal zone in comparison to DMA32. As kswapd tries to balance all
zones, it would continually try reclaiming for Normal even though DMA32
was balanced enough for callers. The second problem is that
sleeping_prematurely() does not use the same logic as balance_pgdat() when
deciding whether to sleep or not. This keeps kswapd artifically awake.
A number of tests were run and the figures from previous postings will
look very different for a few reasons. One, the old figures were forcing
my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
Second, I previous specified slub_min_order=3 again in an attempt to
reproduce Simon's problem. In this posting, I'm depending on Simon to say
whether his problem is fixed or not and these figures are to show the
impact to the ordinary cases. Finally, the "vmscan" figures are taken
from /proc/vmstat instead of the tracepoints. There is less information
but recording is less disruptive.
The first test of relevance was postmark with a process running in the
background reading a large amount of anonymous memory in blocks. The
objective was to vaguely simulate what was happening on Simon's machine
and it's memory intensive enough to have kswapd awake.
POSTMARK
traceonly kanyzone
Transactions per second: 156.00 ( 0.00%) 153.00 (-1.96%)
Data megabytes read per second: 21.51 ( 0.00%) 21.52 ( 0.05%)
Data megabytes written per second: 29.28 ( 0.00%) 29.11 (-0.58%)
Files created alone per second: 250.00 ( 0.00%) 416.00 (39.90%)
Files create/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)
Files deleted alone per second: 520.00 ( 0.00%) 420.00 (-23.81%)
Files delete/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 16.58 17.4
Total Elapsed Time (seconds) 218.48 222.47
VMstat Reclaim Statistics: vmscan
Direct reclaims 0 4
Direct reclaim pages scanned 0 203
Direct reclaim pages reclaimed 0 184
Kswapd pages scanned 326631 322018
Kswapd pages reclaimed 312632 309784
Kswapd low wmark quickly 1 4
Kswapd high wmark quickly 122 475
Kswapd skip congestion_wait 1 0
Pages activated 700040 705317
Pages deactivated 212113 203922
Pages written 9875 6363
Total pages scanned 326631 322221
Total pages reclaimed 312632 309968
%age total pages scanned/reclaimed 95.71% 96.20%
%age total pages scanned/written 3.02% 1.97%
proc vmstat: Faults
Major Faults 300 254
Minor Faults 645183 660284
Page ins 493588 486704
Page outs 4960088 4986704
Swap ins 1230 661
Swap outs 9869 6355
Performance is mildly affected because kswapd is no longer doing as much
work and the background memory consumer process is getting in the way.
Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
and overall fewer pages were scanned and reclaimed. Swap in/out is
particularly reduced again reflecting kswapd throwing out fewer pages.
The slight performance impact is unfortunate here but it looks like a
direct result of kswapd being less aggressive. As the bug report is about
too many pages being freed by kswapd, it may have to be accepted for now.
The second test is a streaming IO benchmark that was previously used by
Johannes to show regressions in page reclaim.
MICRO
traceonly kanyzone
User/Sys Time Running Test (seconds) 29.29 28.87
Total Elapsed Time (seconds) 492.18 488.79
VMstat Reclaim Statistics: vmscan
Direct reclaims 2128 1460
Direct reclaim pages scanned 2284822 1496067
Direct reclaim pages reclaimed 148919 110937
Kswapd pages scanned 15450014 16202876
Kswapd pages reclaimed 8503697 8537897
Kswapd low wmark quickly 3100 3397
Kswapd high wmark quickly 1860 7243
Kswapd skip congestion_wait 708 801
Pages activated 9635 9573
Pages deactivated 1432 1271
Pages written 223 1130
Total pages scanned 17734836 17698943
Total pages reclaimed 8652616 8648834
%age total pages scanned/reclaimed 48.79% 48.87%
%age total pages scanned/written 0.00% 0.01%
proc vmstat: Faults
Major Faults 165 221
Minor Faults 9655785 9656506
Page ins 3880 7228
Page outs 37692940 37480076
Swap ins 0 69
Swap outs 19 15
Again fewer pages are scanned and reclaimed as expected and this time the
test completed faster. Note that kswapd is hitting its watermarks faster
(low and high wmark quickly) which I expect is due to kswapd reclaiming
fewer pages.
I also ran fs-mark, iozone and sysbench but there is nothing interesting
to report in the figures. Performance is not significantly changed and
the reclaim statistics look reasonable.
Tgis patch:
When the allocator enters its slow path, kswapd is woken up to balance the
node. It continues working until all zones within the node are balanced.
For order-0 allocations, this makes perfect sense but for higher orders it
can have unintended side-effects. If the zone sizes are imbalanced,
kswapd may reclaim heavily within a smaller zone discarding an excessive
number of pages. The user-visible behaviour is that kswapd is awake and
reclaiming even though plenty of pages are free from a suitable zone.
This patch alters the "balance" logic for high-order reclaim allowing
kswapd to stop if any suitable zone becomes balanced to reduce the number
of pages it reclaims from other zones. kswapd still tries to ensure that
order-0 watermarks for all zones are met before sleeping.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Eric B Munson <emunson@mgebm.net>
Cc: Simon Kirby <sim@hostway.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Running the annotated branch profiler on a box doing average work
(firefox, evolution, xchat, distcc farm), the likely() used in
grab_cache_page_write_begin() was incorrect most of the time:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
1924262 71332401 97 grab_cache_page_write_begin filemap.c 2206
Adding a trace_printk() and running the function tracer limited to
just this function I can see:
gconfd-2-2696 [000] 4467.268935: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=7
gconfd-2-2696 [000] 4467.268946: grab_cache_page_write_begin <-ext3_write_begin
gconfd-2-2696 [000] 4467.268947: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=8
gconfd-2-2696 [000] 4467.268959: grab_cache_page_write_begin <-ext3_write_begin
gconfd-2-2696 [000] 4467.268960: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=9
gconfd-2-2696 [000] 4467.268972: grab_cache_page_write_begin <-ext3_write_begin
gconfd-2-2696 [000] 4467.268973: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=10
gconfd-2-2696 [000] 4467.268991: grab_cache_page_write_begin <-ext3_write_begin
gconfd-2-2696 [000] 4467.268992: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=11
gconfd-2-2696 [000] 4467.269005: grab_cache_page_write_begin <-ext3_write_begin
Which shows that a lot of calls from ext3_write_begin will result in the
page returned by "find_lock_page" will be NULL.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
IS_ERR() already implies unlikely(), so it can be omitted here.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__get_user_pages gets a new 'nonblocking' parameter to signal that the
caller is prepared to re-acquire mmap_sem and retry the operation if
needed. This is used to split off long operations if they are going to
block on a disk transfer, or when we detect contention on the mmap_sem.
[akpm@linux-foundation.org: remove ref to rwsem_is_contended()]
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use a single code path for faulting in pages during mlock.
The reason to have it in this patch series is that I did not want to
update both code paths in a later change that releases mmap_sem when
blocking on disk.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the code to mlock pages from __mlock_vma_pages_range() to
follow_page().
This allows __mlock_vma_pages_range() to not have to break down work into
16-page batches.
An additional motivation for doing this within the present patch series is
that it'll make it easier for a later chagne to drop mmap_sem when
blocking on disk (we'd like to be able to resume at the page that was read
from disk instead of at the start of a 16-page batch).
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently mlock() holds mmap_sem in exclusive mode while the pages get
faulted in. In the case of a large mlock, this can potentially take a
very long time, during which various commands such as 'ps auxw' will
block. This makes sysadmins unhappy:
real 14m36.232s
user 0m0.003s
sys 0m0.015s
(output from 'time ps auxw' while a 20GB file was being mlocked without
being previously preloaded into page cache)
I propose that mlock() could release mmap_sem after the VM_LOCKED bits
have been set in all appropriate VMAs. Then a second pass could be done
to actually mlock the pages, in small batches, releasing mmap_sem when we
block on disk access or when we detect some contention.
This patch:
Before this change, mlock() holds mmap_sem in exclusive mode while the
pages get faulted in. In the case of a large mlock, this can potentially
take a very long time. Various things will block while mmap_sem is held,
including 'ps auxw'. This can make sysadmins angry.
I propose that mlock() could release mmap_sem after the VM_LOCKED bits
have been set in all appropriate VMAs. Then a second pass could be done
to actually mlock the pages with mmap_sem held for reads only. We need to
recheck the vma flags after we re-acquire mmap_sem, but this is easy.
In the case where a vma has been munlocked before mlock completes, pages
that were already marked as PageMlocked() are handled by the munlock()
call, and mlock() is careful to not mark new page batches as PageMlocked()
after the munlock() call has cleared the VM_LOCKED vma flags. So, the end
result will be identical to what'd happen if munlock() had executed after
the mlock() call.
In a later change, I will allow the second pass to release mmap_sem when
blocking on disk accesses or when it is otherwise contended, so that it
won't be held for long periods of time even in shared mode.
Signed-off-by: Michel Lespinasse <walken@google.com>
Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When faulting in pages for mlock(), we want to break COW for anonymous or
file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is no
need to write-fault into VM_SHARED vmas since shared file pages can be
mlocked first and dirtied later, when/if they actually get written to.
Skipping the write fault is desirable, as we don't want to unnecessarily
cause these pages to be dirtied and queued for writeback.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Theodore Tso <tytso@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reorganize the code so that dirty pages are handled closer to the place
that makes them dirty (handling write fault into shared, writable VMAs).
No behavior changes.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Theodore Tso <tytso@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mlocking a shared, writable vma currently causes the corresponding pages
to be marked as dirty and queued for writeback. This seems rather
unnecessary given that the pages are not being actually modified during
mlock. It is understood that for non-shared mappings (file or anon) we
want to use a write fault in order to break COW, but there is just no such
need for shared mappings.
The first two patches in this series do not introduce any behavior change.
The intent there is to make it obvious that dirtying file pages is only
done in the (writable, shared) case. I think this clarifies the code, but
I wouldn't mind dropping these two patches if there is no consensus about
them.
The last patch is where we actually avoid dirtying shared mappings during
mlock. Note that as a side effect of this, we won't call page_mkwrite()
for the mappings that define it, and won't be pre-allocating data blocks
at the FS level if the mapped file was sparsely allocated. My
understanding is that mlock does not need to provide such guarantee, as
evidenced by the fact that it never did for the filesystems that don't
define page_mkwrite() - including some common ones like ext3. However, I
would like to gather feedback on this from filesystem people as a
precaution. If this turns out to be a showstopper, maybe block
preallocation can be added back on using a different interface.
Large shared mlocks are getting significantly (>2x) faster in my tests, as
the disk can be fully used for reading the file instead of having to share
between this and writeback.
This patch:
Reorganize the code to remove the 'reuse' flag. No behavior changes.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Theodore Tso <tytso@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>