Commit Graph

7743 Commits

Author SHA1 Message Date
Andrea Arcangeli
9b7ac26018 mm/hugetlb.c: defer PageHeadHuge() symbol export
No actual need of it. So keep it internal.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21 16:19:43 -08:00
Andrew Morton
26296ad2df mm/swap.c: reorganize put_compound_page()
Tweak it so save a tab stop, make code layout slightly less nutty.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21 16:19:43 -08:00
Andrew Morton
758f66a29c mm/hugetlb.c: simplify PageHeadHuge() and PageHuge()
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21 16:19:43 -08:00
Andrea Arcangeli
3bfcd13ec0 mm: hugetlbfs: use __compound_tail_refcounted in __get_page_tail too
Also remove hugetlb.h which isn't needed anymore as PageHeadHuge is
handled in mm.h.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21 16:19:43 -08:00
Andrea Arcangeli
44518d2b32 mm: tail page refcounting optimization for slab and hugetlbfs
This skips the _mapcount mangling for slab and hugetlbfs pages.

The main trouble in doing this is to guarantee that PageSlab and
PageHeadHuge remains constant for all get_page/put_page run on the tail
of slab or hugetlbfs compound pages.  Otherwise if they're set during
get_page but not set during put_page, the _mapcount of the tail page
would underflow.

PageHeadHuge will remain true until the compound page is released and
enters the buddy allocator so it won't risk to change even if the tail
page is the last reference left on the page.

PG_slab instead is cleared before the slab frees the head page with
put_page, so if the tail pin is released after the slab freed the page,
we would have a problem.  But in the slab case the tail pin cannot be
the last reference left on the page.  This is because the slab code is
free to reuse the compound page after a kfree/kmem_cache_free without
having to check if there's any tail pin left.  In turn all tail pins
must be always released while the head is still pinned by the slab code
and so we know PG_slab will be still set too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21 16:19:43 -08:00
Andrea Arcangeli
ebf360f9bb mm: hugetlbfs: move the put/get_page slab and hugetlbfs optimization in a faster path
We don't actually need a reference on the head page in the slab and
hugetlbfs paths, as long as we add a smp_rmb() which should be faster
than get_page_unless_zero.

[akpm@linux-foundation.org: fix typo in comment]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21 16:19:43 -08:00
Andrea Arcangeli
a0368d4e48 mm: hugetlb: use get_page_foll() in follow_hugetlb_page()
get_page_foll() is more optimal and is always safe to use under the PT
lock.  More so for hugetlbfs as there's no risk of race conditions with
split_huge_page regardless of the PT lock.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21 16:19:43 -08:00
Dan Williams
0abdd7a81b dma-debug: introduce debug_dma_assert_idle()
Record actively mapped pages and provide an api for asserting a given
page is dma inactive before execution proceeds.  Placing
debug_dma_assert_idle() in cow_user_page() flagged the violation of the
dma-api in the NET_DMA implementation (see commit 7787380336 "net_dma:
mark broken").

The implementation includes the capability to count, in a limited way,
repeat mappings of the same page that occur without an intervening
unmap.  This 'overlap' counter is limited to the few bits of tag space
in a radix tree.  This mechanism is added to mitigate false negative
cases where, for example, a page is dma mapped twice and
debug_dma_assert_idle() is called after the page is un-mapped once.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: James Bottomley <JBottomley@Parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21 16:19:41 -08:00
Mikulas Patocka
03e5ac2fc3 mm: fix crash when using XFS on loopback
Commit 8456a648cf ("slab: use struct page for slab management") causes
a crash in the LVM2 testsuite on PA-RISC (the crashing test is
fsadm.sh).  The testsuite doesn't crash on 3.12, crashes on 3.13-rc1 and
later.

 Bad Address (null pointer deref?): Code=15 regs=000000413edd89a0 (Addr=000006202224647d)
 CPU: 3 PID: 24008 Comm: loop0 Not tainted 3.13.0-rc6 #5
 task: 00000001bf3c0048 ti: 000000413edd8000 task.ti: 000000413edd8000

      YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
 PSW: 00001000000001101111100100001110 Not tainted
 r00-03  000000ff0806f90e 00000000405c8de0 000000004013e6c0 000000413edd83f0
 r04-07  00000000405a95e0 0000000000000200 00000001414735f0 00000001bf349e40
 r08-11  0000000010fe3d10 0000000000000001 00000040829c7778 000000413efd9000
 r12-15  0000000000000000 000000004060d800 0000000010fe3000 0000000010fe3000
 r16-19  000000413edd82a0 00000041078ddbc0 0000000000000010 0000000000000001
 r20-23  0008f3d0d83a8000 0000000000000000 00000040829c7778 0000000000000080
 r24-27  00000001bf349e40 00000001bf349e40 202d66202224640d 00000000405a95e0
 r28-31  202d662022246465 000000413edd88f0 000000413edd89a0 0000000000000001
 sr00-03  000000000532c000 0000000000000000 0000000000000000 000000000532c000
 sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000

 IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000401fe42c 00000000401fe430
  IIR: 539c0030    ISR: 00000000202d6000  IOR: 000006202224647d
  CPU:        3   CR30: 000000413edd8000 CR31: 0000000000000000
  ORIG_R28: 00000000405a95e0
  IAOQ[0]: vma_interval_tree_iter_first+0x14/0x48
  IAOQ[1]: vma_interval_tree_iter_first+0x18/0x48
  RP(r2): flush_dcache_page+0x128/0x388
 Backtrace:
   flush_dcache_page+0x128/0x388
   lo_splice_actor+0x90/0x148 [loop]
   splice_from_pipe_feed+0xc0/0x1d0
   __splice_from_pipe+0xac/0xc0
   lo_direct_splice_actor+0x1c/0x70 [loop]
   splice_direct_to_actor+0xec/0x228
   lo_receive+0xe4/0x298 [loop]
   loop_thread+0x478/0x640 [loop]
   kthread+0x134/0x168
   end_fault_vector+0x20/0x28
   xfs_setsize_buftarg+0x0/0x90 [xfs]

 Kernel panic - not syncing: Bad Address (null pointer deref?)

Commit 8456a648cf changes the page structure so that the slab
subsystem reuses the page->mapping field.

The crash happens in the following way:
 * XFS allocates some memory from slab and issues a bio to read data
   into it.
 * the bio is sent to the loopback device.
 * lo_receive creates an actor and calls splice_direct_to_actor.
 * lo_splice_actor copies data to the target page.
 * lo_splice_actor calls flush_dcache_page because the page may be
   mapped by userspace.  In that case we need to flush the kernel cache.
 * flush_dcache_page asks for the list of userspace mappings, however
   that page->mapping field is reused by the slab subsystem for a
   different purpose.  This causes the crash.

Note that other architectures without coherent caches (sparc, arm, mips)
also call page_mapping from flush_dcache_page, so they may crash in the
same way.

This patch fixes this bug by testing if the page is a slab page in
page_mapping and returning NULL if it is.

The patch also fixes VM_BUG_ON(PageSlab(page)) that could happen in
earlier kernels in the same scenario on architectures without cache
coherence when CONFIG_DEBUG_VM is enabled - so it should be backported
to stable kernels.

In the old kernels, the function page_mapping is placed in
include/linux/mm.h, so you should modify the patch accordingly when
backporting it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: John David Anglin <dave.anglin@bell.net>]
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Helge Deller <deller@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-15 14:19:42 +07:00
Hugh Dickins
eecc1e426d thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once only
We see General Protection Fault on RSI in copy_page_rep: that RSI is
what you get from a NULL struct page pointer.

  RIP: 0010:[<ffffffff81154955>]  [<ffffffff81154955>] copy_page_rep+0x5/0x10
  RSP: 0000:ffff880136e15c00  EFLAGS: 00010286
  RAX: ffff880000000000 RBX: ffff880136e14000 RCX: 0000000000000200
  RDX: 6db6db6db6db6db7 RSI: db73880000000000 RDI: ffff880dd0c00000
  RBP: ffff880136e15c18 R08: 0000000000000200 R09: 000000000005987c
  R10: 000000000005987c R11: 0000000000000200 R12: 0000000000000001
  R13: ffffea00305aa000 R14: 0000000000000000 R15: 0000000000000000
  FS:  00007f195752f700(0000) GS:ffff880c7fc20000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000093010000 CR3: 00000001458e1000 CR4: 00000000000027e0
  Call Trace:
    copy_user_huge_page+0x93/0xab
    do_huge_pmd_wp_page+0x710/0x815
    handle_mm_fault+0x15d8/0x1d70
    __do_page_fault+0x14d/0x840
    do_page_fault+0x2f/0x90
    page_fault+0x22/0x30

do_huge_pmd_wp_page() tests is_huge_zero_pmd(orig_pmd) four times: but
since shrink_huge_zero_page() can free the huge_zero_page, and we have
no hold of our own on it here (except where the fourth test holds
page_table_lock and has checked pmd_same), it's possible for it to
answer yes the first time, but no to the second or third test.  Change
all those last three to tests for NULL page.

(Note: this is not the same issue as trinity's DEBUG_PAGEALLOC BUG
in copy_page_rep with RSI: ffff88009c422000, reported by Sasha Levin
in https://lkml.org/lkml/2013/3/29/103.  I believe that one is due
to the source page being split, and a tail page freed, while copy
is in progress; and not a problem without DEBUG_PAGEALLOC, since
the pmd_same check will prevent a miscopy from being made visible.)

Fixes: 97ae17497e ("thp: implement refcounting for huge zero page")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v3.10 v3.11 v3.12
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-12 16:47:15 +07:00
Naoya Horiguchi
a3e0f9e47d mm/memory-failure.c: transfer page count from head page to tail page after split thp
Memory failures on thp tail pages cause kernel panic like below:

   mce: [Hardware Error]: Machine check events logged
   MCE exception done on CPU 7
   BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
   IP: [<ffffffff811b7cd1>] dequeue_hwpoisoned_huge_page+0x131/0x1e0
   PGD bae42067 PUD ba47d067 PMD 0
   Oops: 0000 [#1] SMP
  ...
   CPU: 7 PID: 128 Comm: kworker/7:2 Tainted: G   M       O 3.13.0-rc4-131217-1558-00003-g83b7df08e462 #25
  ...
   Call Trace:
     me_huge_page+0x3e/0x50
     memory_failure+0x4bb/0xc20
     mce_process_work+0x3e/0x70
     process_one_work+0x171/0x420
     worker_thread+0x11b/0x3a0
     ? manage_workers.isra.25+0x2b0/0x2b0
     kthread+0xe4/0x100
     ? kthread_create_on_node+0x190/0x190
     ret_from_fork+0x7c/0xb0
     ? kthread_create_on_node+0x190/0x190
  ...
   RIP   dequeue_hwpoisoned_huge_page+0x131/0x1e0
   CR2: 0000000000000058

The reasoning of this problem is shown below:
 - when we have a memory error on a thp tail page, the memory error
   handler grabs a refcount of the head page to keep the thp under us.
 - Before unmapping the error page from processes, we split the thp,
   where page refcounts of both of head/tail pages don't change.
 - Then we call try_to_unmap() over the error page (which was a tail
   page before). We didn't pin the error page to handle the memory error,
   this error page is freed and removed from LRU list.
 - We never have the error page on LRU list, so the first page state
   check returns "unknown page," then we move to the second check
   with the saved page flag.
 - The saved page flag have PG_tail set, so the second page state check
   returns "hugepage."
 - We call me_huge_page() for freed error page, then we hit the above panic.

The root cause is that we didn't move refcount from the head page to the
tail page after split thp.  So this patch suggests to do this.

This panic was introduced by commit 524fca1e73 ("HWPOISON: fix
misjudgement of page_action() for errors on mlocked pages").  Note that we
did have the same refcount problem before this commit, but it was just
ignored because we had only first page state check which returned "unknown
page." The commit changed the refcount problem from "doesn't work" to
"kernel panic."

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: <stable@vger.kernel.org>	[3.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-02 14:40:30 -08:00
Mel Gorman
d0319bd52e mm: remove bogus warning in copy_huge_pmd()
Sasha Levin reported the following warning being triggered

  WARNING: CPU: 28 PID: 35287 at mm/huge_memory.c:887 copy_huge_pmd+0x145/ 0x3a0()
  Call Trace:
    copy_huge_pmd+0x145/0x3a0
    copy_page_range+0x3f2/0x560
    dup_mmap+0x2c9/0x3d0
    dup_mm+0xad/0x150
    copy_process+0xa68/0x12e0
    do_fork+0x96/0x270
    SyS_clone+0x16/0x20
    stub_clone+0x69/0x90

This warning was introduced by "mm: numa: Avoid unnecessary disruption
of NUMA hinting during migration" for paranoia reasons but the warning
is bogus.  I was thinking of parallel races between NUMA hinting faults
and forks but this warning would also be triggered by a parallel reclaim
splitting a THP during a fork.  Remote the bogus warning.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-02 14:40:30 -08:00
Vladimir Davydov
695c608307 memcg: fix memcg_size() calculation
The mem_cgroup structure contains nr_node_ids pointers to
mem_cgroup_per_node objects, not the objects themselves.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-02 14:40:30 -08:00
Rik van Riel
4eb919825e mm: fix use-after-free in sys_remap_file_pages
remap_file_pages calls mmap_region, which may merge the VMA with other
existing VMAs, and free "vma".  This can lead to a use-after-free bug.
Avoid the bug by remembering vm_flags before calling mmap_region, and
not trying to dereference vma later.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-02 14:40:30 -08:00
Vlastimil Babka
3b25df93c6 mm: munlock: fix deadlock in __munlock_pagevec()
Commit 7225522bb4 ("mm: munlock: batch non-THP page isolation and
munlock+putback using pagevec" introduced __munlock_pagevec() to speed
up munlock by holding lru_lock over multiple isolated pages.  Pages that
fail to be isolated are put_page()d immediately, also within the lock.

This can lead to deadlock when __munlock_pagevec() becomes the holder of
the last page pin and put_page() leads to __page_cache_release() which
also locks lru_lock.  The deadlock has been observed by Sasha Levin
using trinity.

This patch avoids the deadlock by deferring put_page() operations until
lru_lock is released.  Another pagevec (which is also used by later
phases of the function is reused to gather the pages for put_page()
operation.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-02 14:40:30 -08:00
Vlastimil Babka
c424be1cbb mm: munlock: fix a bug where THP tail page is encountered
Since commit ff6a6da60b ("mm: accelerate munlock() treatment of THP
pages") munlock skips tail pages of a munlocked THP page.  However, when
the head page already has PageMlocked unset, it will not skip the tail
pages.

Commit 7225522bb4 ("mm: munlock: batch non-THP page isolation and
munlock+putback using pagevec") has added a PageTransHuge() check which
contains VM_BUG_ON(PageTail(page)).  Sasha Levin found this triggered
using trinity, on the first tail page of a THP page without PageMlocked
flag.

This patch fixes the issue by skipping tail pages also in the case when
PageMlocked flag is unset.  There is still a possibility of race with
THP page split between clearing PageMlocked and determining how many
pages to skip.  The race might result in former tail pages not being
skipped, which is however no longer a bug, as during the skip the
PageTail flags are cleared.

However this race also affects correctness of NR_MLOCK accounting, which
is to be fixed in a separate patch.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-02 14:40:30 -08:00
Benjamin LaHaise
8e321fefb0 aio/migratepages: make aio migrate pages sane
The arbitrary restriction on page counts offered by the core
migrate_page_move_mapping() code results in rather suspicious looking
fiddling with page reference counts in the aio_migratepage() operation.
To fix this, make migrate_page_move_mapping() take an extra_count parameter
that allows aio to tell the code about its own reference count on the page
being migrated.

While cleaning up aio_migratepage(), make it validate that the old page
being passed in is actually what aio_migratepage() expects to prevent
misbehaviour in the case of races.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-12-21 17:56:08 -05:00
Olof Johansson
40b64acd17 mm: fix build of split ptlock code
Commit 597d795a2a ('mm: do not allocate page->ptl dynamically, if
spinlock_t fits to long') restructures some allocators that are compiled
even if USE_SPLIT_PTLOCKS arn't used.  It results in compilation
failure:

  mm/memory.c:4282:6: error: 'struct page' has no member named 'ptl'
  mm/memory.c:4288:12: error: 'struct page' has no member named 'ptl'

Add in the missing ifdef.

Fixes: 597d795a2a ('mm: do not allocate page->ptl dynamically, if spinlock_t fits to long')
Signed-off-by: Olof Johansson <olof@lixom.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-20 15:41:27 -08:00
Kirill A. Shutemov
597d795a2a mm: do not allocate page->ptl dynamically, if spinlock_t fits to long
In struct page we have enough space to fit long-size page->ptl there,
but we use dynamically-allocated page->ptl if size(spinlock_t) is larger
than sizeof(int).

It hurts 64-bit architectures with CONFIG_GENERIC_LOCKBREAK, where
sizeof(spinlock_t) == 8, but it easily fits into struct page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-20 12:25:45 -08:00
Johannes Weiner
fff4068cba mm: page_alloc: revert NUMA aspect of fair allocation policy
Commit 81c0a2bb51 ("mm: page_alloc: fair zone allocator policy") meant
to bring aging fairness among zones in system, but it was overzealous
and badly regressed basic workloads on NUMA systems.

Due to the way kswapd and page allocator interacts, we still want to
make sure that all zones in any given node are used equally for all
allocations to maximize memory utilization and prevent thrashing on the
highest zone in the node.

While the same principle applies to NUMA nodes - memory utilization is
obviously improved by spreading allocations throughout all nodes -
remote references can be costly and so many workloads prefer locality
over memory utilization.  The original change assumed that
zone_reclaim_mode would be a good enough predictor for that, but it
turned out to be as indicative as a coin flip.

Revert the NUMA aspect of the fairness until we can find a proper way to
make it configurable and agree on a sane default.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@kernel.org> # 3.12
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-20 12:19:18 -08:00
Mel Gorman
8798cee2f9 Revert "mm: page_alloc: exclude unreclaimable allocations from zone fairness policy"
This reverts commit 73f038b863.  The NUMA behaviour of this patch is
less than ideal.  An alternative approch is to interleave allocations
only within local zones which is implemented in the next patch.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-20 12:19:18 -08:00
Jianguo Wu
98398c32f6 mm/hugetlb: check for pte NULL pointer in __page_check_address()
In __page_check_address(), if address's pud is not present,
huge_pte_offset() will return NULL, we should check the return value.

Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: qiuxishi <qiuxishi@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:52 -08:00
Wanpeng Li
11c731e81b mm/mempolicy: fix !vma in new_vma_page()
BUG_ON(!vma) assumption is introduced by commit 0bf598d863 ("mbind:
add BUG_ON(!vma) in new_vma_page()"), however, even if

    address = __vma_address(page, vma);

and

    vma->start < address < vma->end

page_address_in_vma() may still return -EFAULT because of many other
conditions in it.  As a result the while loop in new_vma_page() may end
with vma=NULL.

This patch revert the commit and also fix the potential dereference NULL
pointer reported by Dan.

   http://marc.info/?l=linux-mm&m=137689530323257&w=2

  kernel BUG at mm/mempolicy.c:1204!
  invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  CPU: 3 PID: 7056 Comm: trinity-child3 Not tainted 3.13.0-rc3+ #2
  task: ffff8801ca5295d0 ti: ffff88005ab20000 task.ti: ffff88005ab20000
  RIP: new_vma_page+0x70/0x90
  RSP: 0000:ffff88005ab21db0  EFLAGS: 00010246
  RAX: fffffffffffffff2 RBX: 0000000000000000 RCX: 0000000000000000
  RDX: 0000000008040075 RSI: ffff8801c3d74600 RDI: ffffea00079a8b80
  RBP: ffff88005ab21dc8 R08: 0000000000000004 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: fffffffffffffff2
  R13: ffffea00079a8b80 R14: 0000000000400000 R15: 0000000000400000

  FS:  00007ff49c6f4740(0000) GS:ffff880244e00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007ff49c68f994 CR3: 000000005a205000 CR4: 00000000001407e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Stack:
   ffffea00079a8b80 ffffea00079a8bc0 ffffea00079a8ba0 ffff88005ab21e50
   ffffffff811adc7a 0000000000000000 ffff8801ca5295d0 0000000464e224f8
   0000000000000000 0000000000000002 0000000000000000 ffff88020ce75c00
  Call Trace:
    migrate_pages+0x12a/0x850
    SYSC_mbind+0x513/0x6a0
    SyS_mbind+0xe/0x10
    ia32_do_call+0x13/0x13
  Code: 85 c0 75 2f 4c 89 e1 48 89 da 31 f6 bf da 00 02 00 65 44 8b 04 25 08 f7 1c 00 e8 ec fd ff ff 5b 41 5c 41 5d 5d c3 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 4c 89 e6 48 89 df ba 01 00 00 00 e8 48
  RIP  [<ffffffff8119f200>] new_vma_page+0x70/0x90
   RSP <ffff88005ab21db0>

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:52 -08:00
Jianguo Wu
a49ecbcd7b mm/memory-failure.c: recheck PageHuge() after hugetlb page migrate successfully
After a successful hugetlb page migration by soft offline, the source
page will either be freed into hugepage_freelists or buddy(over-commit
page).  If page is in buddy, page_hstate(page) will be NULL.  It will
hit a NULL pointer dereference in dequeue_hwpoisoned_huge_page().

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
  IP: [<ffffffff81163761>] dequeue_hwpoisoned_huge_page+0x131/0x1d0
  PGD c23762067 PUD c24be2067 PMD 0
  Oops: 0000 [#1] SMP

So check PageHuge(page) after call migrate_pages() successfully.

Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:52 -08:00
Joonsoo Kim
6815bf3f23 mm/compaction: respect ignore_skip_hint in update_pageblock_skip
update_pageblock_skip() only fits to compaction which tries to isolate
by pageblock unit.  If isolate_migratepages_range() is called by CMA, it
try to isolate regardless of pageblock unit and it don't reference
get_pageblock_skip() by ignore_skip_hint.  We should also respect it on
update_pageblock_skip() to prevent from setting the wrong information.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: <stable@vger.kernel.org>	[3.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:52 -08:00
Joonsoo Kim
b0e5fd7359 mm/mempolicy: correct putback method for isolate pages if failed
queue_pages_range() isolates hugetlbfs pages and putback_lru_pages()
can't handle these.  We should change it to putback_movable_pages().

Naoya said that it is worth going into stable, because it can break
in-use hugepage list.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: <stable@vger.kernel.org>	[3.12.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:52 -08:00
Sima Baymani
a844f38671 mm: add missing dependency in Kconfig
Eliminate the following (rand)config warning by adding missing PROC_FS
dependency:

  warning: (HWPOISON_INJECT && MEM_SOFT_DIRTY) selects PROC_PAGE_MONITOR which has unmet direct dependencies (PROC_FS && MMU)

Signed-off-by: Sima Baymani <sima.baymani@gmail.com>
Suggested-by: David Rientjes <rientjes@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:52 -08:00
Johannes Weiner
73f038b863 mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
Dave Hansen noted a regression in a microbenchmark that loops around
open() and close() on an 8-node NUMA machine and bisected it down to
commit 81c0a2bb51 ("mm: page_alloc: fair zone allocator policy").
That change forces the slab allocations of the file descriptor to spread
out to all 8 nodes, causing remote references in the page allocator and
slab.

The round-robin policy is only there to provide fairness among memory
allocations that are reclaimed involuntarily based on pressure in each
zone.  It does not make sense to apply it to unreclaimable kernel
allocations that are freed manually, in this case instantly after the
allocation, and incur the remote reference costs twice for no reason.

Only round-robin allocations that are usually freed through page reclaim
or slab shrinking.

Bisected by Dave Hansen.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:51 -08:00
Mel Gorman
b0943d61b8 mm: numa: defer TLB flush for THP migration as long as possible
THP migration can fail for a variety of reasons.  Avoid flushing the TLB
to deal with THP migration races until the copy is ready to start.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:51 -08:00
Rik van Riel
2084140594 mm: fix TLB flush race between migration, and change_protection_range
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.

The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.

During that time, a CPU may continue writing to the page.

This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.

When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.

This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).

The basic race looks like this:

CPU A			CPU B			CPU C

						load TLB entry
make entry PTE/PMD_NUMA
			fault on entry
						read/write old page
			start migrating page
			change PTE/PMD to new page
						read/write old page [*]
flush TLB
						reload TLB from new entry
						read/write new page
						lose data

[*] the old page may belong to a new user at this point!

The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.

This should fix both NUMA migration and compaction.

[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:51 -08:00
Mel Gorman
de466bd628 mm: numa: avoid unnecessary disruption of NUMA hinting during migration
do_huge_pmd_numa_page() handles the case where there is parallel THP
migration.  However, by the time it is checked the NUMA hinting
information has already been disrupted.  This patch adds an earlier
check with some helpers.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:51 -08:00
Mel Gorman
1667918b64 mm: numa: clear numa hinting information on mprotect
On a protection change it is no longer clear if the page should be still
accessible.  This patch clears the NUMA hinting fault bits on a
protection change.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:51 -08:00
Mel Gorman
eb4489f69f mm: numa: avoid unnecessary work on the failure path
If a PMD changes during a THP migration then migration aborts but the
failure path is doing more work than is necessary.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:51 -08:00
Mel Gorman
c3a489cac3 mm: numa: ensure anon_vma is locked to prevent parallel THP splits
The anon_vma lock prevents parallel THP splits and any associated
complexity that arises when handling splits during THP migration.  This
patch checks if the lock was successfully acquired and bails from THP
migration if it failed for any reason.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:51 -08:00
Mel Gorman
0c5f83c23c mm: numa: do not clear PTE for pte_numa update
The TLB must be flushed if the PTE is updated but change_pte_range is
clearing the PTE while marking PTEs pte_numa without necessarily
flushing the TLB if it reinserts the same entry.  Without the flush,
it's conceivable that two processors have different TLBs for the same
virtual address and at the very least it would generate spurious faults.

This patch only unmaps the pages in change_pte_range for a full
protection change.

[riel@redhat.com: write pte_numa pte back to the page tables]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:51 -08:00
Mel Gorman
5a6dac3ec5 mm: numa: do not clear PMD during PTE update scan
If the PMD is flushed then a parallel fault in handle_mm_fault() will
enter the pmd_none and do_huge_pmd_anonymous_page() path where it'll
attempt to insert a huge zero page.  This is wasteful so the patch
avoids clearing the PMD when setting pmd_numa.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:51 -08:00
Mel Gorman
67f87463d3 mm: clear pmd_numa before invalidating
On x86, PMD entries are similar to _PAGE_PROTNONE protection and are
handled as NUMA hinting faults.  The following two page table protection
bits are what defines them

	_PAGE_NUMA:set	_PAGE_PRESENT:clear

A PMD is considered present if any of the _PAGE_PRESENT, _PAGE_PROTNONE,
_PAGE_PSE or _PAGE_NUMA bits are set.  If pmdp_invalidate encounters a
pmd_numa, it clears the present bit leaving _PAGE_NUMA which will be
considered not present by the CPU but present by pmd_present.  The
existing caller of pmdp_invalidate should handle it but it's an
inconsistent state for a PMD.  This patch keeps the state consistent
when calling pmdp_invalidate.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:51 -08:00
Mel Gorman
f714f4f20e mm: numa: call MMU notifiers on THP migration
MMU notifiers must be called on THP page migration or secondary MMUs
will get very confused.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:51 -08:00
Mel Gorman
2b4847e730 mm: numa: serialise parallel get_user_page against THP migration
Base pages are unmapped and flushed from cache and TLB during normal
page migration and replaced with a migration entry that causes any
parallel NUMA hinting fault or gup to block until migration completes.

THP does not unmap pages due to a lack of support for migration entries
at a PMD level.  This allows races with get_user_pages and
get_user_pages_fast which commit 3f926ab945 ("mm: Close races between
THP migration and PMD numa clearing") made worse by introducing a
pmd_clear_flush().

This patch forces get_user_page (fast and normal) on a pmd_numa page to
go through the slow get_user_page path where it will serialise against
THP migration and properly account for the NUMA hinting fault.  On the
migration side the page table lock is taken for each PTE update.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:50 -08:00
Johannes Weiner
1f14c1ac19 mm: memcg: do not allow task about to OOM kill to bypass the limit
Commit 4942642080 ("mm: memcg: handle non-error OOM situations more
gracefully") allowed tasks that already entered a memcg OOM condition to
bypass the memcg limit on subsequent allocation attempts hoping this
would expedite finishing the page fault and executing the kill.

David Rientjes is worried that this breaks memcg isolation guarantees
and since there is no evidence that the bypass actually speeds up fault
processing just change it so that these subsequent charge attempts fail
outright.  The notable exception being __GFP_NOFAIL charges which are
required to bypass the limit regardless.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-bt: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-12 18:19:26 -08:00
Johannes Weiner
96f1c58d85 mm: memcg: fix race condition between memcg teardown and swapin
There is a race condition between a memcg being torn down and a swapin
triggered from a different memcg of a page that was recorded to belong
to the exiting memcg on swapout (with CONFIG_MEMCG_SWAP extension).  The
result is unreclaimable pages pointing to dead memcgs, which can lead to
anything from endless loops in later memcg teardown (the page is charged
to all hierarchical parents but is not on any LRU list) or crashes from
following the dangling memcg pointer.

Memcgs with tasks in them can not be torn down and usually charges don't
show up in memcgs without tasks.  Swapin with the CONFIG_MEMCG_SWAP
extension is the notable exception because it charges the cgroup that
was recorded as owner during swapout, which may be empty and in the
process of being torn down when a task in another memcg triggers the
swapin:

  teardown:                 swapin:

                            lookup_swap_cgroup_id()
                            rcu_read_lock()
                            mem_cgroup_lookup()
                            css_tryget()
                            rcu_read_unlock()
  disable css_tryget()
  call_rcu()
    offline_css()
      reparent_charges()
                            res_counter_charge() (hierarchical!)
                            css_put()
                              css_free()
                            pc->mem_cgroup = dead memcg
                            add page to dead lru

Add a final reparenting step into css_free() to make sure any such raced
charges are moved out of the memcg before it's finally freed.

In the longer term it would be cleaner to have the css_tryget() and the
res_counter charge under the same RCU lock section so that the charge
reparenting is deferred until the last charge whose tryget succeeded is
visible.  But this will require more invasive changes that will be
harder to evaluate and backport into stable, so better defer them to a
separate change set.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-12 18:19:26 -08:00
Kirill A. Shutemov
3592806cfa thp: move preallocated PTE page table on move_huge_pmd()
Andrey Wagin reported crash on VM_BUG_ON() in pgtable_pmd_page_dtor() with
fallowing backtrace:

  free_pgd_range+0x2bf/0x410
  free_pgtables+0xce/0x120
  unmap_region+0xe0/0x120
  do_munmap+0x249/0x360
  move_vma+0x144/0x270
  SyS_mremap+0x3b9/0x510
  system_call_fastpath+0x16/0x1b

The crash can be reproduce with this test case:

  #define _GNU_SOURCE
  #include <sys/mman.h>
  #include <stdio.h>
  #include <unistd.h>

  #define MB (1024 * 1024UL)
  #define GB (1024 * MB)

  int main(int argc, char **argv)
  {
	char *p;
	int i;

	p = mmap((void *) GB, 10 * MB, PROT_READ | PROT_WRITE,
			MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
	for (i = 0; i < 10 * MB; i += 4096)
		p[i] = 1;
	mremap(p, 10 * MB, 10 * MB, MREMAP_FIXED | MREMAP_MAYMOVE, 2 * GB);
	return 0;
  }

Due to split PMD lock, we now store preallocated PTE tables for THP
pages per-PMD table.  It means we need to move them to other PMD table
if huge PMD moved there.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Andrey Vagin <avagin@openvz.org>
Tested-by: Andrey Vagin <avagin@openvz.org>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-12 18:19:26 -08:00
Johannes Weiner
a0d8b00a33 mm: memcg: do not declare OOM from __GFP_NOFAIL allocations
Commit 84235de394 ("fs: buffer: move allocation failure loop into the
allocator") started recognizing __GFP_NOFAIL in memory cgroups but
forgot to disable the OOM killer.

Any task that does not fail allocation will also not enter the OOM
completion path.  So don't declare an OOM state in this case or it'll be
leaked and the task be able to bypass the limit until the next
userspace-triggered page fault cleans up the OOM state.

Reported-by: William Dauchy <wdauchy@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>	[3.12.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-12 18:19:26 -08:00
Eric Paris
c727709092 security: shmem: implement kernel private shmem inodes
We have a problem where the big_key key storage implementation uses a
shmem backed inode to hold the key contents.  Because of this detail of
implementation LSM checks are being done between processes trying to
read the keys and the tmpfs backed inode.  The LSM checks are already
being handled on the key interface level and should not be enforced at
the inode level (since the inode is an implementation detail, not a
part of the security model)

This patch implements a new function shmem_kernel_file_setup() which
returns the equivalent to shmem_file_setup() only the underlying inode
has S_PRIVATE set.  This means that all LSM checks for the inode in
question are skipped.  It should only be used for kernel internal
operations where the inode is not exposed to userspace without proper
LSM checking.  It is possible that some other users of
shmem_file_setup() should use the new interface, but this has not been
explored.

Reproducing this bug is a little bit difficult.  The steps I used on
Fedora are:

 (1) Turn off selinux enforcing:

	setenforce 0

 (2) Create a huge key

	k=`dd if=/dev/zero bs=8192 count=1 | keyctl padd big_key test-key @s`

 (3) Access the key in another context:

	runcon system_u:system_r:httpd_t:s0-s0:c0.c1023 keyctl print $k >/dev/null

 (4) Examine the audit logs:

	ausearch -m AVC -i --subject httpd_t | audit2allow

If the last command's output includes a line that looks like:

	allow httpd_t user_tmpfs_t:file { open read };

There was an inode check between httpd and the tmpfs filesystem.  With
this patch no such denial will be seen.  (NOTE! you should clear your
audit log if you have tested for this previously)

(Please return you box to enforcing)

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hugh Dickins <hughd@google.com>
cc: linux-mm@kvack.org
2013-12-02 11:24:19 +00:00
Linus Torvalds
24f971abbd Merge branch 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux
Pull SLAB changes from Pekka Enberg:
 "The patches from Joonsoo Kim switch mm/slab.c to use 'struct page' for
  slab internals similar to mm/slub.c.  This reduces memory usage and
  improves performance:

    https://lkml.org/lkml/2013/10/16/155

  Rest of the changes are bug fixes from various people"

* 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (21 commits)
  mm, slub: fix the typo in mm/slub.c
  mm, slub: fix the typo in include/linux/slub_def.h
  slub: Handle NULL parameter in kmem_cache_flags
  slab: replace non-existing 'struct freelist *' with 'void *'
  slab: fix to calm down kmemleak warning
  slub: proper kmemleak tracking if CONFIG_SLUB_DEBUG disabled
  slab: rename slab_bufctl to slab_freelist
  slab: remove useless statement for checking pfmemalloc
  slab: use struct page for slab management
  slab: replace free and inuse in struct slab with newly introduced active
  slab: remove SLAB_LIMIT
  slab: remove kmem_bufctl_t
  slab: change the management method of free objects of the slab
  slab: use __GFP_COMP flag for allocating slab pages
  slab: use well-defined macro, virt_to_slab()
  slab: overloading the RCU head over the LRU for RCU free
  slab: remove cachep in struct slab_rcu
  slab: remove nodeid in struct slab
  slab: remove colouroff in struct slab
  slab: change return type of kmem_getpages() to struct page
  ...
2013-11-22 08:10:34 -08:00
David Rientjes
b7a9f420ed mm, mempolicy: silence gcc warning
Fengguang Wu reports that compiling mm/mempolicy.c results in a warning:

  mm/mempolicy.c: In function 'mpol_to_str':
  mm/mempolicy.c:2878:2: error: format not a string literal and no format arguments

Kees says this is because he is using -Wformat-security.

Silence the warning.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-21 16:42:27 -08:00
Andrea Arcangeli
27c73ae759 mm: hugetlbfs: fix hugetlbfs optimization
Commit 7cb2ef56e6 ("mm: fix aio performance regression for database
caused by THP") can cause dereference of a dangling pointer if
split_huge_page runs during PageHuge() if there are updates to the
tail_page->private field.

Also it is repeating compound_head twice for hugetlbfs and it is running
compound_head+compound_trans_head for THP when a single one is needed in
both cases.

The new code within the PageSlab() check doesn't need to verify that the
THP page size is never bigger than the smallest hugetlbfs page size, to
avoid memory corruption.

A longstanding theoretical race condition was found while fixing the
above (see the change right after the skip_unlock label, that is
relevant for the compound_lock path too).

By re-establishing the _mapcount tail refcounting for all compound
pages, this also fixes the below problem:

  echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

  BUG: Bad page state in process bash  pfn:59a01
  page:ffffea000139b038 count:0 mapcount:10 mapping:          (null) index:0x0
  page flags: 0x1c00000000008000(tail)
  Modules linked in:
  CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  Call Trace:
    dump_stack+0x55/0x76
    bad_page+0xd5/0x130
    free_pages_prepare+0x213/0x280
    __free_pages+0x36/0x80
    update_and_free_page+0xc1/0xd0
    free_pool_huge_page+0xc2/0xe0
    set_max_huge_pages.part.58+0x14c/0x220
    nr_hugepages_store_common.isra.60+0xd0/0xf0
    nr_hugepages_store+0x13/0x20
    kobj_attr_store+0xf/0x20
    sysfs_write_file+0x189/0x1e0
    vfs_write+0xc5/0x1f0
    SyS_write+0x55/0xb0
    system_call_fastpath+0x16/0x1b

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-21 16:42:27 -08:00
Dave Hansen
30b0a105d9 mm: thp: give transparent hugepage code a separate copy_page
Right now, the migration code in migrate_page_copy() uses copy_huge_page()
for hugetlbfs and thp pages:

       if (PageHuge(page) || PageTransHuge(page))
                copy_huge_page(newpage, page);

So, yay for code reuse.  But:

  void copy_huge_page(struct page *dst, struct page *src)
  {
        struct hstate *h = page_hstate(src);

and a non-hugetlbfs page has no page_hstate().  This works 99% of the
time because page_hstate() determines the hstate from the page order
alone.  Since the page order of a THP page matches the default hugetlbfs
page order, it works.

But, if you change the default huge page size on the boot command-line
(say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
so page_hstate() returns null and copy_huge_page() oopses pretty fast
since copy_huge_page() dereferences the hstate:

  void copy_huge_page(struct page *dst, struct page *src)
  {
        struct hstate *h = page_hstate(src);
        if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
  ...

Mel noticed that the migration code is really the only user of these
functions.  This moves all the copy code over to migrate.c and makes
copy_huge_page() work for THP by checking for it explicitly.

I believe the bug was introduced in commit b32967ff10 ("mm: numa: Add
THP migration for the NUMA working set scanning fault case")

[akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-21 16:42:27 -08:00
Linus Torvalds
8b2e9b712f Revert "mm: create a separate slab for page->ptl allocation"
This reverts commit ea1e7ed337.

Al points out that while the commit *does* actually create a separate
slab for the page->ptl allocation, that slab is never actually used, and
the code continues to use kmalloc/kfree.

Damien Wyart points out that the original patch did have the conversion
to use kmem_cache_alloc/free, so it got lost somewhere on its way to me.

Revert the half-arsed attempt that didn't do anything.  If we really do
want the special slab (remember: this is all relevant just for debug
builds, so it's not necessarily all that critical) we might as well redo
the patch fully.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-20 14:41:47 -08:00
Linus Torvalds
9073e1a804 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "Usual earth-shaking, news-breaking, rocket science pile from
  trivial.git"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
  doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
  doc: add missing files to timers/00-INDEX
  timekeeping: Fix some trivial typos in comments
  mm: Fix some trivial typos in comments
  irq: Fix some trivial typos in comments
  NUMA: fix typos in Kconfig help text
  mm: update 00-INDEX
  doc: Documentation/DMA-attributes.txt fix typo
  DRM: comment: `halve' -> `half'
  Docs: Kconfig: `devlopers' -> `developers'
  doc: typo on word accounting in kprobes.c in mutliple architectures
  treewide: fix "usefull" typo
  treewide: fix "distingush" typo
  mm/Kconfig: Grammar s/an/a/
  kexec: Typo s/the/then/
  Documentation/kvm: Update cpuid documentation for steal time and pv eoi
  treewide: Fix common typo in "identify"
  __page_to_pfn: Fix typo in comment
  Correct some typos for word frequency
  clk: fixed-factor: Fix a trivial typo
  ...
2013-11-15 16:47:22 -08:00
Stefani Seibold
498d319bb5 kfifo API type safety
This patch enhances the type safety for the kfifo API.  It is now safe
to put const data into a non const FIFO and the API will now generate a
compiler warning when reading from the fifo where the destination
address is pointing to a const variable.

As a side effect the kfifo_put() does now expect the value of an element
instead a pointer to the element.  This was suggested Russell King.  It
make the handling of the kfifo_put easier since there is no need to
create a helper variable for getting the address of a pointer or to pass
integers of different sizes.

IMHO the API break is okay, since there are currently only six users of
kfifo_put().

The code is also cleaner by kicking out the "if (0)" expressions.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:23 +09:00
Kirill A. Shutemov
ea1e7ed337 mm: create a separate slab for page->ptl allocation
If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
is 72 bytes.  For page->ptl they will be allocated from kmalloc-96 slab,
so we loose 24 on each.  An average system can easily allocate few tens
thousands of page->ptl and overhead is significant.

Let's create a separate slab for page->ptl allocation to solve this.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:20 +09:00
Peter Zijlstra
539edb5846 mm: properly separate the bloated ptl from the regular case
Use kernel/bounds.c to convert build-time spinlock_t size check into a
preprocessor symbol and apply that to properly separate the page::ptl
situation.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:20 +09:00
Kirill A. Shutemov
49076ec2cc mm: dynamically allocate page->ptl if it cannot be embedded to struct page
If split page table lock is in use, we embed the lock into struct page
of table's page.  We have to disable split lock, if spinlock_t is too
big be to be embedded, like when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC
enabled.

This patch add support for dynamic allocation of split page table lock
if we can't embed it to struct page.

page->ptl is unsigned long now and we use it as spinlock_t if
sizeof(spinlock_t) <= sizeof(long), otherwise it's pointer to spinlock_t.

The spinlock_t allocated in pgtable_page_ctor() for PTE table and in
pgtable_pmd_page_ctor() for PMD table.  All other helpers converted to
support dynamically allocated page->ptl.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:20 +09:00
Kirill A. Shutemov
e009bb30c8 mm: implement split page table lock for PMD level
The basic idea is the same as with PTE level: the lock is embedded into
struct page of table's page.

We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
take mm->page_table_lock anymore.  Let's reuse page->lru of table's page
for that.

pgtable_pmd_page_ctor() returns true, if initialization is successful
and false otherwise.  Current implementation never fails, but assumption
that constructor can fail will help to port it to -rt where spinlock_t
is rather huge and cannot be embedded into struct page -- dynamic
allocation is required.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:15 +09:00
Kirill A. Shutemov
c4088ebdca mm: convert the rest to new page table lock api
Only trivial cases left. Let's convert them altogether.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:15 +09:00
Kirill A. Shutemov
cb900f4121 mm, hugetlb: convert hugetlbfs to use split pmd lock
Hugetlb supports multiple page sizes. We use split lock only for PMD
level, but not for PUD.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:14 +09:00
Kirill A. Shutemov
c389a250ab mm, thp: do not access mm->pmd_huge_pte directly
Currently mm->pmd_huge_pte protected by page table lock.  It will not
work with split lock.  We have to have per-pmd pmd_huge_pte for proper
access serialization.

For now, let's just introduce wrapper to access mm->pmd_huge_pte.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:14 +09:00
Kirill A. Shutemov
117b0791ac mm, thp: move ptl taking inside page_check_address_pmd()
With split page table lock we can't know which lock we need to take
before we find the relevant pmd.

Let's move lock taking inside the function.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:14 +09:00
Kirill A. Shutemov
bf929152e9 mm, thp: change pmd_trans_huge_lock() to return taken lock
With split ptlock it's important to know which lock
pmd_trans_huge_lock() took.  This patch adds one more parameter to the
function to return the lock.

In most places migration to new api is trivial.  Exception is
move_huge_pmd(): we need to take two locks if pmd tables are different.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:14 +09:00
Kirill A. Shutemov
e1f56c89b0 mm: convert mm->nr_ptes to atomic_long_t
With split page table lock for PMD level we can't hold mm->page_table_lock
while updating nr_ptes.

Let's convert it to atomic_long_t to avoid races.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:14 +09:00
Kirill A. Shutemov
e9bb18c7b9 mm: avoid increase sizeof(struct page) due to split page table lock
Alex Thorlton noticed that some massively threaded workloads work poorly,
if THP enabled.  This patchset fixes this by introducing split page table
lock for PMD tables.  hugetlbfs is not covered yet.

This patchset is based on work by Naoya Horiguchi.

: akpm result summary:
:
: THP off, v3.12-rc2: 18.059261877 seconds time elapsed
: THP off, patched:   16.768027318 seconds time elapsed
:
: THP on, v3.12-rc2:  42.162306788 seconds time elapsed
: THP on, patched:    8.397885779 seconds time elapsed
:
: HUGETLB, v3.12-rc2: 47.574936948 seconds time elapsed
: HUGETLB, patched:   19.447481153 seconds time elapsed

THP off, v3.12-rc2:
-------------------

 Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

    1037072.835207 task-clock                #   57.426 CPUs utilized            ( +-  3.59% )
            95,093 context-switches          #    0.092 K/sec                    ( +-  3.93% )
               140 cpu-migrations            #    0.000 K/sec                    ( +-  5.28% )
        10,000,550 page-faults               #    0.010 M/sec                    ( +-  0.00% )
 2,455,210,400,261 cycles                    #    2.367 GHz                      ( +-  3.62% ) [83.33%]
 2,429,281,882,056 stalled-cycles-frontend   #   98.94% frontend cycles idle     ( +-  3.67% ) [83.33%]
 1,975,960,019,659 stalled-cycles-backend    #   80.48% backend  cycles idle     ( +-  3.88% ) [66.68%]
    46,503,296,013 instructions              #    0.02  insns per cycle
                                             #   52.24  stalled cycles per insn  ( +-  3.21% ) [83.34%]
     9,278,997,542 branches                  #    8.947 M/sec                    ( +-  4.00% ) [83.34%]
        89,881,640 branch-misses             #    0.97% of all branches          ( +-  1.17% ) [83.33%]

      18.059261877 seconds time elapsed                                          ( +-  2.65% )

THP on, v3.12-rc2:
------------------

 Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

    3114745.395974 task-clock                #   73.875 CPUs utilized            ( +-  1.84% )
           267,356 context-switches          #    0.086 K/sec                    ( +-  1.84% )
                99 cpu-migrations            #    0.000 K/sec                    ( +-  1.40% )
            58,313 page-faults               #    0.019 K/sec                    ( +-  0.28% )
 7,416,635,817,510 cycles                    #    2.381 GHz                      ( +-  1.83% ) [83.33%]
 7,342,619,196,993 stalled-cycles-frontend   #   99.00% frontend cycles idle     ( +-  1.88% ) [83.33%]
 6,267,671,641,967 stalled-cycles-backend    #   84.51% backend  cycles idle     ( +-  2.03% ) [66.67%]
   117,819,935,165 instructions              #    0.02  insns per cycle
                                             #   62.32  stalled cycles per insn  ( +-  4.39% ) [83.34%]
    28,899,314,777 branches                  #    9.278 M/sec                    ( +-  4.48% ) [83.34%]
        71,787,032 branch-misses             #    0.25% of all branches          ( +-  1.03% ) [83.33%]

      42.162306788 seconds time elapsed                                          ( +-  1.73% )

HUGETLB, v3.12-rc2:
-------------------

 Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):

    2588052.787264 task-clock                #   54.400 CPUs utilized            ( +-  3.69% )
           246,831 context-switches          #    0.095 K/sec                    ( +-  4.15% )
               138 cpu-migrations            #    0.000 K/sec                    ( +-  5.30% )
            21,027 page-faults               #    0.008 K/sec                    ( +-  0.01% )
 6,166,666,307,263 cycles                    #    2.383 GHz                      ( +-  3.68% ) [83.33%]
 6,086,008,929,407 stalled-cycles-frontend   #   98.69% frontend cycles idle     ( +-  3.77% ) [83.33%]
 5,087,874,435,481 stalled-cycles-backend    #   82.51% backend  cycles idle     ( +-  4.41% ) [66.67%]
   133,782,831,249 instructions              #    0.02  insns per cycle
                                             #   45.49  stalled cycles per insn  ( +-  4.30% ) [83.34%]
    34,026,870,541 branches                  #   13.148 M/sec                    ( +-  4.24% ) [83.34%]
        68,670,942 branch-misses             #    0.20% of all branches          ( +-  3.26% ) [83.33%]

      47.574936948 seconds time elapsed                                          ( +-  2.09% )

THP off, patched:
-----------------

 Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

     943301.957892 task-clock                #   56.256 CPUs utilized            ( +-  3.01% )
            86,218 context-switches          #    0.091 K/sec                    ( +-  3.17% )
               121 cpu-migrations            #    0.000 K/sec                    ( +-  6.64% )
        10,000,551 page-faults               #    0.011 M/sec                    ( +-  0.00% )
 2,230,462,457,654 cycles                    #    2.365 GHz                      ( +-  3.04% ) [83.32%]
 2,204,616,385,805 stalled-cycles-frontend   #   98.84% frontend cycles idle     ( +-  3.09% ) [83.32%]
 1,778,640,046,926 stalled-cycles-backend    #   79.74% backend  cycles idle     ( +-  3.47% ) [66.69%]
    45,995,472,617 instructions              #    0.02  insns per cycle
                                             #   47.93  stalled cycles per insn  ( +-  2.51% ) [83.34%]
     9,179,700,174 branches                  #    9.731 M/sec                    ( +-  3.04% ) [83.35%]
        89,166,529 branch-misses             #    0.97% of all branches          ( +-  1.45% ) [83.33%]

      16.768027318 seconds time elapsed                                          ( +-  2.47% )

THP on, patched:
----------------

 Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

     458793.837905 task-clock                #   54.632 CPUs utilized            ( +-  0.79% )
            41,831 context-switches          #    0.091 K/sec                    ( +-  0.97% )
                98 cpu-migrations            #    0.000 K/sec                    ( +-  1.66% )
            57,829 page-faults               #    0.126 K/sec                    ( +-  0.62% )
 1,077,543,336,716 cycles                    #    2.349 GHz                      ( +-  0.81% ) [83.33%]
 1,067,403,802,964 stalled-cycles-frontend   #   99.06% frontend cycles idle     ( +-  0.87% ) [83.33%]
   864,764,616,143 stalled-cycles-backend    #   80.25% backend  cycles idle     ( +-  0.73% ) [66.68%]
    16,129,177,440 instructions              #    0.01  insns per cycle
                                             #   66.18  stalled cycles per insn  ( +-  7.94% ) [83.35%]
     3,618,938,569 branches                  #    7.888 M/sec                    ( +-  8.46% ) [83.36%]
        33,242,032 branch-misses             #    0.92% of all branches          ( +-  2.02% ) [83.32%]

       8.397885779 seconds time elapsed                                          ( +-  0.18% )

HUGETLB, patched:
-----------------

 Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):

     395353.076837 task-clock                #   20.329 CPUs utilized            ( +-  8.16% )
            55,730 context-switches          #    0.141 K/sec                    ( +-  5.31% )
               138 cpu-migrations            #    0.000 K/sec                    ( +-  4.24% )
            21,027 page-faults               #    0.053 K/sec                    ( +-  0.00% )
   930,219,717,244 cycles                    #    2.353 GHz                      ( +-  8.21% ) [83.32%]
   914,295,694,103 stalled-cycles-frontend   #   98.29% frontend cycles idle     ( +-  8.35% ) [83.33%]
   704,137,950,187 stalled-cycles-backend    #   75.70% backend  cycles idle     ( +-  9.16% ) [66.69%]
    30,541,538,385 instructions              #    0.03  insns per cycle
                                             #   29.94  stalled cycles per insn  ( +-  3.98% ) [83.35%]
     8,415,376,631 branches                  #   21.286 M/sec                    ( +-  3.61% ) [83.36%]
        32,645,478 branch-misses             #    0.39% of all branches          ( +-  3.41% ) [83.32%]

      19.447481153 seconds time elapsed                                          ( +-  2.00% )

This patch (of 11):

CONFIG_GENERIC_LOCKBREAK increases sizeof(spinlock_t) to 8 bytes.  It
leads to increase sizeof(struct page) by 4 bytes on 32-bit system if split
page table lock is in use, since page->ptl shares space in union with
longs and pointers.

Let's disable split page table lock on 32-bit systems with
GENERIC_LOCKBREAK enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:13 +09:00
Kirill A. Shutemov
b77d88d493 mm: drop actor argument of do_generic_file_read()
There's only one caller of do_generic_file_read() and the only actor is
file_read_actor().  No reason to have a callback parameter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:13 +09:00
Linus Torvalds
5e30025a31 Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core locking changes from Ingo Molnar:
 "The biggest changes:

   - add lockdep support for seqcount/seqlocks structures, this
     unearthed both bugs and required extra annotation.

   - move the various kernel locking primitives to the new
     kernel/locking/ directory"

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  block: Use u64_stats_init() to initialize seqcounts
  locking/lockdep: Mark __lockdep_count_forward_deps() as static
  lockdep/proc: Fix lock-time avg computation
  locking/doc: Update references to kernel/mutex.c
  ipv6: Fix possible ipv6 seqlock deadlock
  cpuset: Fix potential deadlock w/ set_mems_allowed
  seqcount: Add lockdep functionality to seqcount/seqlock structures
  net: Explicitly initialize u64_stats_sync structures for lockdep
  locking: Move the percpu-rwsem code to kernel/locking/
  locking: Move the lglocks code to kernel/locking/
  locking: Move the rwsem code to kernel/locking/
  locking: Move the rtmutex code to kernel/locking/
  locking: Move the semaphore core to kernel/locking/
  locking: Move the spinlock code to kernel/locking/
  locking: Move the lockdep code to kernel/locking/
  locking: Move the mutex code to kernel/locking/
  hung_task debugging: Add tracepoint to report the hang
  x86/locking/kconfig: Update paravirt spinlock Kconfig description
  lockstat: Report avg wait and hold times
  lockdep, x86/alternatives: Drop ancient lockdep fixup message
  ...
2013-11-14 16:30:30 +09:00
Linus Torvalds
0910c0bdf7 Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-block
Pull block IO core updates from Jens Axboe:
 "This is the pull request for the core changes in the block layer for
  3.13.  It contains:

   - The new blk-mq request interface.

     This is a new and more scalable queueing model that marries the
     best part of the request based interface we currently have (which
     is fully featured, but scales poorly) and the bio based "interface"
     which the new drivers for high IOPS devices end up using because
     it's much faster than the request based one.

     The bio interface has no block layer support, since it taps into
     the stack much earlier.  This means that drivers end up having to
     implement a lot of functionality on their own, like tagging,
     timeout handling, requeue, etc.  The blk-mq interface provides all
     these.  Some drivers even provide a switch to select bio or rq and
     has code to handle both, since things like merging only works in
     the rq model and hence is faster for some workloads.  This is a
     huge mess.  Conversion of these drivers nets us a substantial code
     reduction.  Initial results on converting SCSI to this model even
     shows an 8x improvement on single queue devices.  So while the
     model was intended to work on the newer multiqueue devices, it has
     substantial improvements for "classic" hardware as well.  This code
     has gone through extensive testing and development, it's now ready
     to go.  A pull request is coming to convert virtio-blk to this
     model will be will be coming as well, with more drivers scheduled
     for 3.14 conversion.

   - Two blktrace fixes from Jan and Chen Gang.

   - A plug merge fix from Alireza Haghdoost.

   - Conversion of __get_cpu_var() from Christoph Lameter.

   - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.

   - A fix for a race between request completion and the timeout
     handling from Jeff Moyer.  This is what caused the merge conflict
     with blk-mq/core, in case you are looking at that.

   - A dm stacking fix from Mike Snitzer.

   - A code consolidation fix and duplicated code removal from Kent
     Overstreet.

   - A handful of block bug fixes from Mikulas Patocka, fixing a loop
     crash and memory corruption on blk cg.

   - Elevator switch bug fix from Tomoki Sekiyama.

  A heads-up that I had to rebase this branch.  Initially the immutable
  bio_vecs had been queued up for inclusion, but a week later, it became
  clear that it wasn't fully cooked yet.  So the decision was made to
  pull this out and postpone it until 3.14.  It was a straight forward
  rebase, just pruning out the immutable series and the later fixes of
  problems with it.  The rest of the patches applied directly and no
  further changes were made"

* 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
  block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
  block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
  block: Do not call sector_div() with a 64-bit divisor
  kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
  block: Consolidate duplicated bio_trim() implementations
  block: Use rw_copy_check_uvector()
  block: Enable sysfs nomerge control for I/O requests in the plug list
  block: properly stack underlying max_segment_size to DM device
  elevator: acquire q->sysfs_lock in elevator_change()
  elevator: Fix a race in elevator switching and md device initialization
  block: Replace __get_cpu_var uses
  bdi: test bdi_init failure
  block: fix a probe argument to blk_register_region
  loop: fix crash if blk_alloc_queue fails
  blk-core: Fix memory corruption if blkcg_init_queue fails
  block: fix race between request completion and timeout handling
  blktrace: Send BLK_TN_PROCESS events to all running traces
  blk-mq: don't disallow request merges for req->special being set
  blk-mq: mq plug list breakage
  blk-mq: fix for flush deadlock
  ...
2013-11-14 12:08:14 +09:00
Linus Torvalds
42a2d923cc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) The addition of nftables.  No longer will we need protocol aware
    firewall filtering modules, it can all live in userspace.

    At the core of nftables is a, for lack of a better term, virtual
    machine that executes byte codes to inspect packet or metadata
    (arriving interface index, etc.) and make verdict decisions.

    Besides support for loading packet contents and comparing them, the
    interpreter supports lookups in various datastructures as
    fundamental operations.  For example sets are supports, and
    therefore one could create a set of whitelist IP address entries
    which have ACCEPT verdicts attached to them, and use the appropriate
    byte codes to do such lookups.

    Since the interpreted code is composed in userspace, userspace can
    do things like optimize things before giving it to the kernel.

    Another major improvement is the capability of atomically updating
    portions of the ruleset.  In the existing netfilter implementation,
    one has to update the entire rule set in order to make a change and
    this is very expensive.

    Userspace tools exist to create nftables rules using existing
    netfilter rule sets, but both kernel implementations will need to
    co-exist for quite some time as we transition from the old to the
    new stuff.

    Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have
    worked so hard on this.

 2) Daniel Borkmann and Hannes Frederic Sowa made several improvements
    to our pseudo-random number generator, mostly used for things like
    UDP port randomization and netfitler, amongst other things.

    In particular the taus88 generater is updated to taus113, and test
    cases are added.

 3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet
    and Yang Yingliang.

 4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin
    Sujir.

 5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet,
    Neal Cardwell, and Yuchung Cheng.

 6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary
    control message data, much like other socket option attributes.
    From Francesco Fusco.

 7) Allow applications to specify a cap on the rate computed
    automatically by the kernel for pacing flows, via a new
    SO_MAX_PACING_RATE socket option.  From Eric Dumazet.

 8) Make the initial autotuned send buffer sizing in TCP more closely
    reflect actual needs, from Eric Dumazet.

 9) Currently early socket demux only happens for TCP sockets, but we
    can do it for connected UDP sockets too.  Implementation from Shawn
    Bohrer.

10) Refactor inet socket demux with the goal of improving hash demux
    performance for listening sockets.  With the main goals being able
    to use RCU lookups on even request sockets, and eliminating the
    listening lock contention.  From Eric Dumazet.

11) The bonding layer has many demuxes in it's fast path, and an RCU
    conversion was started back in 3.11, several changes here extend the
    RCU usage to even more locations.  From Ding Tianhong and Wang
    Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav
    Falico.

12) Allow stackability of segmentation offloads to, in particular, allow
    segmentation offloading over tunnels.  From Eric Dumazet.

13) Significantly improve the handling of secret keys we input into the
    various hash functions in the inet hashtables, TCP fast open, as
    well as syncookies.  From Hannes Frederic Sowa.  The key fundamental
    operation is "net_get_random_once()" which uses static keys.

    Hannes even extended this to ipv4/ipv6 fragmentation handling and
    our generic flow dissector.

14) The generic driver layer takes care now to set the driver data to
    NULL on device removal, so it's no longer necessary for drivers to
    explicitly set it to NULL any more.  Many drivers have been cleaned
    up in this way, from Jingoo Han.

15) Add a BPF based packet scheduler classifier, from Daniel Borkmann.

16) Improve CRC32 interfaces and generic SKB checksum iterators so that
    SCTP's checksumming can more cleanly be handled.  Also from Daniel
    Borkmann.

17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces
    using the interface MTU value.  This helps avoid PMTU attacks,
    particularly on DNS servers.  From Hannes Frederic Sowa.

18) Use generic XPS for transmit queue steering rather than internal
    (re-)implementation in virtio-net.  From Jason Wang.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
  random32: add test cases for taus113 implementation
  random32: upgrade taus88 generator to taus113 from errata paper
  random32: move rnd_state to linux/random.h
  random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized
  random32: add periodic reseeding
  random32: fix off-by-one in seeding requirement
  PHY: Add RTL8201CP phy_driver to realtek
  xtsonic: add missing platform_set_drvdata() in xtsonic_probe()
  macmace: add missing platform_set_drvdata() in mace_probe()
  ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe()
  ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh
  vlan: Implement vlan_dev_get_egress_qos_mask as an inline.
  ixgbe: add warning when max_vfs is out of range.
  igb: Update link modes display in ethtool
  netfilter: push reasm skb through instead of original frag skbs
  ip6_output: fragment outgoing reassembled skb properly
  MAINTAINERS: mv643xx_eth: take over maintainership from Lennart
  net_sched: tbf: support of 64bit rates
  ixgbe: deleting dfwd stations out of order can cause null ptr deref
  ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS
  ...
2013-11-13 17:40:34 +09:00
Linus Torvalds
5cbb3d216e Merge branch 'akpm' (patches from Andrew Morton)
Merge first patch-bomb from Andrew Morton:
 "Quite a lot of other stuff is banked up awaiting further
  next->mainline merging, but this batch contains:

   - Lots of random misc patches
   - OCFS2
   - Most of MM
   - backlight updates
   - lib/ updates
   - printk updates
   - checkpatch updates
   - epoll tweaking
   - rtc updates
   - hfs
   - hfsplus
   - documentation
   - procfs
   - update gcov to gcc-4.7 format
   - IPC"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (269 commits)
  ipc, msg: fix message length check for negative values
  ipc/util.c: remove unnecessary work pending test
  devpts: plug the memory leak in kill_sb
  ./Makefile: export initial ramdisk compression config option
  init/Kconfig: add option to disable kernel compression
  drivers: w1: make w1_slave::flags long to avoid memory corruption
  drivers/w1/masters/ds1wm.cuse dev_get_platdata()
  drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
  drivers/memstick/core/mspro_block.c: fix attributes array allocation
  drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
  kernel/panic.c: reduce 1 byte usage for print tainted buffer
  gcov: reuse kbasename helper
  kernel/gcov/fs.c: use pr_warn()
  kernel/module.c: use pr_foo()
  gcov: compile specific gcov implementation based on gcc version
  gcov: add support for gcc 4.7 gcov format
  gcov: move gcov structs definitions to a gcc version specific file
  kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
  kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
  kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
  ...
2013-11-13 15:45:43 +09:00
Linus Torvalds
9bc9ccd7db Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "All kinds of stuff this time around; some more notable parts:

   - RCU'd vfsmounts handling
   - new primitives for coredump handling
   - files_lock is gone
   - Bruce's delegations handling series
   - exportfs fixes

  plus misc stuff all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
  ecryptfs: ->f_op is never NULL
  locks: break delegations on any attribute modification
  locks: break delegations on link
  locks: break delegations on rename
  locks: helper functions for delegation breaking
  locks: break delegations on unlink
  namei: minor vfs_unlink cleanup
  locks: implement delegations
  locks: introduce new FL_DELEG lock flag
  vfs: take i_mutex on renamed file
  vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
  vfs: don't use PARENT/CHILD lock classes for non-directories
  vfs: pull ext4's double-i_mutex-locking into common code
  exportfs: fix quadratic behavior in filehandle lookup
  exportfs: better variable name
  exportfs: move most of reconnect_path to helper function
  exportfs: eliminate unused "noprogress" counter
  exportfs: stop retrying once we race with rename/remove
  exportfs: clear DISCONNECTED on all parents sooner
  exportfs: more detailed comment for path_reconnect
  ...
2013-11-13 15:34:18 +09:00
Linus Torvalds
a998646456 Merge branch 'for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup changes from Tejun Heo:
 "Not too much activity this time around.  css_id is finally killed and
  a minor update to device_cgroup"

* 'for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  device_cgroup: remove can_attach
  cgroup: kill css_id
  memcg: stop using css id
  memcg: fail to create cgroup if the cgroup id is too big
  memcg: convert to use cgroup id
  memcg: convert to use cgroup_is_descendant()
2013-11-13 15:21:53 +09:00
Linus Torvalds
c08acff054 Merge branch 'for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu changes from Tejun Heo:
 "Two smallish changes for percpu.  Two patches to remove unused
  this_cpu_xor() and one to fix a bug in percpu init failure path so
  that it can reach the proper BUG() instead of oopsing earlier"

* 'for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  x86: remove this_cpu_xor() implementation
  percpu: remove this_cpu_xor() implementation
  percpu: fix bootmem error handling in pcpu_page_first_chunk()
2013-11-13 15:17:16 +09:00
Mel Gorman
72403b4a0f mm: numa: return the number of base pages altered by protection changes
Commit 0255d49184 ("mm: Account for a THP NUMA hinting update as one
PTE update") was added to account for the number of PTE updates when
marking pages prot_numa.  task_numa_work was using the old return value
to track how much address space had been updated.  Altering the return
value causes the scanner to do more work than it is configured or
documented to in a single unit of work.

This patch reverts that commit and accounts for the number of THP
updates separately in vmstat.  It is up to the administrator to
interpret the pair of values correctly.  This is a straight-forward
operation and likely to only be of interest when actively debugging NUMA
balancing problems.

The impact of this patch is that the NUMA PTE scanner will scan slower
when THP is enabled and workloads may converge slower as a result.  On
the flip size system CPU usage should be lower than recent tests
reported.  This is an illustrative example of a short single JVM specjbb
test

specjbb
                       3.12.0                3.12.0
                      vanilla      acctupdates
TPut 1      26143.00 (  0.00%)     25747.00 ( -1.51%)
TPut 7     185257.00 (  0.00%)    183202.00 ( -1.11%)
TPut 13    329760.00 (  0.00%)    346577.00 (  5.10%)
TPut 19    442502.00 (  0.00%)    460146.00 (  3.99%)
TPut 25    540634.00 (  0.00%)    549053.00 (  1.56%)
TPut 31    512098.00 (  0.00%)    519611.00 (  1.47%)
TPut 37    461276.00 (  0.00%)    474973.00 (  2.97%)
TPut 43    403089.00 (  0.00%)    414172.00 (  2.75%)

              3.12.0      3.12.0
             vanillaacctupdates
User         5169.64     5184.14
System        100.45       80.02
Elapsed       252.75      251.85

Performance is similar but note the reduction in system CPU time.  While
this showed a performance gain, it will not be universal but at least
it'll be behaving as documented.  The vmstats are obviously different but
here is an obvious interpretation of them from mmtests.

                                3.12.0      3.12.0
                               vanillaacctupdates
NUMA page range updates        1408326    11043064
NUMA huge PMD updates                0       21040
NUMA PTE updates               1408326      291624

"NUMA page range updates" == nr_pte_updates and is the value returned to
the NUMA pte scanner.  NUMA huge PMD updates were the number of THP
updates which in combination can be used to calculate how many ptes were
updated from userspace.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Alex Thorlton <athorlton@sgi.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:11 +09:00
Jerome Marchand
00619bcc44 mm: factor commit limit calculation
The same calculation is currently done in three differents places.
Factor that code so future changes has to be made at only one place.

[akpm@linux-foundation.org: uninline vm_commit_limit()]
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:11 +09:00
Zhi Yong Wu
a1aeb65a4c mm/page_alloc.c: fix comment in zlc_setup()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:11 +09:00
Weijie Yang
0ab0abcf51 mm/zswap: refactor the get/put routines
The refcount routine was not fit the kernel get/put semantic exactly,
There were too many judgement statements on refcount and it could be
minus.

This patch does the following:

 - move refcount judgement to zswap_entry_put() to hide resource free function.

 - add a new function zswap_entry_find_get(), so that callers can use
   easily in the following pattern:

     zswap_entry_find_get
     .../* do something */
     zswap_entry_put

 - to eliminate compile error, move some functions declaration

This patch is based on Minchan Kim <minchan@kernel.org> 's idea and suggestion.

Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Cc: Seth Jennings <sjennings@variantweb.net>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:11 +09:00
Weijie Yang
67d13fe846 mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
Consider the following scenario:

thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
	finished, entry x and its zbud is not freed as its refcount != 0
	now, the swap_map[x] = 0
thread 0: now call zswap_get_swap_cache_page
	swapcache_prepare return -ENOENT because entry x is not used any more
	zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
	zswap_writeback_entry do nothing except put refcount

Now, the memory of zswap_entry x and its zpage leak.

Modify:
 - check the refcount in fail path, free memory if it is not referenced.

 - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
   can be not only caused by nomem but also by invalidate.

Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:10 +09:00
Qiang Huang
7a67d7abcc memcg, kmem: use cache_from_memcg_idx instead of hard code
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:10 +09:00
Qiang Huang
2ade4de871 memcg, kmem: rename cache_from_memcg to cache_from_memcg_idx
We can't see the relationship with memcg from the parameters,
so the name with memcg_idx would be more reasonable.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:10 +09:00
Qiang Huang
f35c3a8eed memcg, kmem: use is_root_cache instead of hard code
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:10 +09:00
Akira Takeuchi
2afc745f3e mm: ensure get_unmapped_area() returns higher address than mmap_min_addr
This patch fixes the problem that get_unmapped_area() can return illegal
address and result in failing mmap(2) etc.

In case that the address higher than PAGE_SIZE is set to
/proc/sys/vm/mmap_min_addr, the address lower than mmap_min_addr can be
returned by get_unmapped_area(), even if you do not pass any virtual
address hint (i.e.  the second argument).

This is because the current get_unmapped_area() code does not take into
account mmap_min_addr.

This leads to two actual problems as follows:

1. mmap(2) can fail with EPERM on the process without CAP_SYS_RAWIO,
   although any illegal parameter is not passed.

2. The bottom-up search path after the top-down search might not work in
   arch_get_unmapped_area_topdown().

Note: The first and third chunk of my patch, which changes "len" check,
are for more precise check using mmap_min_addr, and not for solving the
above problem.

[How to reproduce]

	--- test.c -------------------------------------------------
	#include <stdio.h>
	#include <unistd.h>
	#include <sys/mman.h>
	#include <sys/errno.h>

	int main(int argc, char *argv[])
	{
		void *ret = NULL, *last_map;
		size_t pagesize = sysconf(_SC_PAGESIZE);

		do {
			last_map = ret;
			ret = mmap(0, pagesize, PROT_NONE,
				MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
	//		printf("ret=%p\n", ret);
		} while (ret != MAP_FAILED);

		if (errno != ENOMEM) {
			printf("ERR: unexpected errno: %d (last map=%p)\n",
			errno, last_map);
		}

		return 0;
	}
	---------------------------------------------------------------

	$ gcc -m32 -o test test.c
	$ sudo sysctl -w vm.mmap_min_addr=65536
	vm.mmap_min_addr = 65536
	$ ./test  (run as non-priviledge user)
	ERR: unexpected errno: 1 (last map=0x10000)

Signed-off-by: Akira Takeuchi <takeuchi.akr@jp.panasonic.com>
Signed-off-by: Kiyoshi Owada <owada.kiyoshi@jp.panasonic.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:10 +09:00
KOSAKI Motohiro
0cbef29a78 mm: __rmqueue_fallback() should respect pageblock type
When __rmqueue_fallback() doesn't find a free block with the required size
it splits a larger page and puts the rest of the page onto the free list.

But it has one serious mistake.  When putting back, __rmqueue_fallback()
always use start_migratetype if type is not CMA.  However,
__rmqueue_fallback() is only called when all of the start_migratetype
queue is empty.  That said, __rmqueue_fallback always puts back memory to
the wrong queue except try_to_steal_freepages() changed pageblock type
(i.e.  requested size is smaller than half of page block).  The end result
is that the antifragmentation framework increases fragmenation instead of
decreasing it.

Mel's original anti fragmentation does the right thing.  But commit
47118af076 ("mm: mmzone: MIGRATE_CMA migration type added") broke it.

This patch restores sane and old behavior.  It also removes an incorrect
comment which was introduced by commit fef903efcf ("mm/page_alloc.c:
restructure free-page stealing code and fix a bug").

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:10 +09:00
KOSAKI Motohiro
52c8f6a5ae mm: get rid of unnecessary overhead of trace_mm_page_alloc_extfrag()
In general, every tracepoint should be zero overhead if it is disabled.
However, trace_mm_page_alloc_extfrag() is one of exception.  It evaluate
"new_type == start_migratetype" even if tracepoint is disabled.

However, the code can be moved into tracepoint's TP_fast_assign() and
TP_fast_assign exist exactly such purpose.  This patch does it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:10 +09:00
KOSAKI Motohiro
5d0f3f72ef mm: fix page_group_by_mobility_disabled breakage
Currently, set_pageblock_migratetype() screws up MIGRATE_CMA and
MIGRATE_ISOLATE if page_group_by_mobility_disabled is true.  It rewrites
the argument to MIGRATE_UNMOVABLE and we lost these attribute.

The problem was introduced by commit 49255c619f ("page allocator: move
check for disabled anti-fragmentation out of fastpath").  So a 4 year
old issue may mean that nobody uses page_group_by_mobility_disabled.

But anyway, this patch fixes the problem.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:09 +09:00
Damien Ramonda
af248a0c67 readahead: fix sequential read cache miss detection
The kernel's readahead algorithm sometimes interprets random read
accesses as sequential and triggers unnecessary data prefecthing from
storage device (impacting random read average latency).

In order to identify sequential cache read misses, the readahead
algorithm intends to check whether offset - previous offset == 1
(trivial sequential reads) or offset - previous offset == 0 (sequential
reads not aligned on page boundary):

  if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)

The current offset is stored in the "offset" variable of type "pgoff_t"
(unsigned long), while previous offset is stored in "ra->prev_pos" of
type "loff_t" (long long).  Therefore, operands of the if statement are
implicitly converted to type long long.  Consequently, when previous
offset > current offset (which happens on random pattern), the if
condition is true and access is wrongly interpeted as sequential.  An
unnecessary data prefetching is triggered, impacting the average random
read latency.

Storing the previous offset value in a "pgoff_t" variable (unsigned
long) fixes the sequential read detection logic.

Signed-off-by: Damien Ramonda <damien.ramonda@intel.com>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Pierre Tardy <pierre.tardy@intel.com>
Acked-by: David Cohen <david.a.cohen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:09 +09:00
Daeseok Youn
4a099fb4bd mm/bootmem.c: remove unused local `map'
Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:09 +09:00
Toshi Kani
807a1bd2b2 mm: clear N_CPU from node_states at CPU offline
vmstat_cpuup_callback() is a CPU notifier callback, which marks N_CPU to a
node at CPU online event.  However, it does not update this N_CPU info at
CPU offline event.

Changed vmstat_cpuup_callback() to clear N_CPU when the last CPU in the
node is put into offline, i.e.  the node no longer has any online CPU.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Tested-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:09 +09:00
Toshi Kani
d7e0b37a87 mm: set N_CPU to node_states during boot
After a system booted, N_CPU is not set to any node as has_cpu shows an
empty line.

  # cat /sys/devices/system/node/has_cpu
  (show-empty-line)

setup_vmstat() registers its CPU notifier callback,
vmstat_cpuup_callback(), which marks N_CPU to a node when a CPU is put
into online.  However, setup_vmstat() is called after all CPUs are
launched in the boot sequence.

Changed setup_vmstat() to mark N_CPU to the nodes with online CPUs at
boot, which is consistent with other operations in
vmstat_cpuup_callback(), i.e.  start_cpu_timer() and
refresh_zone_stat_thresholds().

Also added get_online_cpus() to protect the for_each_online_cpu() loop.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Tested-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:09 +09:00
Tang Chen
c5320926e3 mem-hotplug: introduce movable_node boot option
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed.  So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE.  By doing this, the
kernel cannot use memory in movable nodes.  This will cause NUMA
performance down.  And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.

To achieve this, the movable_node boot option will control the memblock
allocation direction.  That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches.  So if movable_node boot option is set, the kernel
does the following:

1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
   top down.

Users can specify "movable_node" in kernel commandline to enable this
functionality.  For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything.  The kernel
will work as before.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:09 +09:00
Tang Chen
79442ed189 mm/memblock.c: introduce bottom-up allocation mode
The Linux kernel cannot migrate pages used by the kernel.  As a result,
kernel pages cannot be hot-removed.  So we cannot allocate hotpluggable
memory for the kernel.

ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
info.  But before SRAT is parsed, memblock has already started to allocate
memory for the kernel.  So we need to prevent memblock from doing this.

In a memory hotplug system, any numa node the kernel resides in should be
unhotpluggable.  And for a modern server, each node could have at least
16GB memory.  So memory around the kernel image is highly likely
unhotpluggable.

So the basic idea is: Allocate memory from the end of the kernel image and
to the higher memory.  Since memory allocation before SRAT is parsed won't
be too much, it could highly likely be in the same node with kernel image.

The current memblock can only allocate memory top-down.  So this patch
introduces a new bottom-up allocation mode to allocate memory bottom-up.
And later when we use this allocation direction to allocate memory, we
will limit the start address above the kernel.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:08 +09:00
Tang Chen
1402899e43 mm/memblock.c: factor out of top-down allocation
[Problem]

The current Linux cannot migrate pages used by the kernel because of the
kernel direct mapping.  In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and keep the
va unmodified.  So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not
migratable.  For example, the physical address may be cached somewhere and
will be used.  It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device.  But if pages
are used by the kernel, they are not migratable.  As a result, memory used
by the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do.  And
it may cause the kernel performance down and unstable.  So we use the
following way to do memory hotplug.

[What we are doing]

In Linux, memory in one numa node is divided into several zones.  One of
the zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these
memory.  To do this, we need ACPI's help.

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info.  The
memory affinities in SRAT record every memory range in the system, and
also, flags specifying if the memory range is hotpluggable.  (Please refer
to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
   ZONE_MOVABLE.
   (This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
   hotpluggable memory for the kernel before the memory initialization
   finishes.

The problem 2 is the key problem we are going to solve. But before solving it,
we need some preparation. Please see below.

[Preparation]

Bootloader has to load the kernel image into memory.  And this memory must
be unhotpluggable.  We cannot prevent this anyway.  So in a memory hotplug
system, we can assume any node the kernel resides in is not hotpluggable.

Before SRAT is parsed, we don't know which memory ranges are hotpluggable.
 But memblock has already started to work.  In the current kernel,
memblock allocates the following memory before SRAT is parsed:

setup_arch()
 |->memblock_x86_fill()            /* memblock is ready */
 |......
 |->early_reserve_e820_mpc_new()   /* allocate memory under 1MB */
 |->reserve_real_mode()            /* allocate memory under 1MB */
 |->init_mem_mapping()             /* allocate page tables, about 2MB to map 1GB memory */
 |->dma_contiguous_reserve()       /* specified by user, should be low */
 |->setup_log_buf()                /* specified by user, several mega bytes */
 |->relocate_initrd()              /* could be large, but will be freed after boot, should reorder */
 |->acpi_initrd_override()         /* several mega bytes */
 |->reserve_crashkernel()          /* could be large, should reorder */
 |......
 |->initmem_init()                 /* Parse SRAT */

According to Tejun's advice, before SRAT is parsed, we should try our best
to allocate memory near the kernel image.  Since the whole node the kernel
resides in won't be hotpluggable, and for a modern server, a node may have
at least 16GB memory, allocating several mega bytes memory around the
kernel image won't cross to hotpluggable memory.

[About this patchset]

So this patchset is the preparation for the problem 2 that we want to
solve.  It does the following:

1. Make memblock be able to allocate memory bottom up.
   1) Keep all the memblock APIs' prototype unmodified.
   2) When the direction is bottom up, keep the start address greater than the
      end of kernel image.

2. Improve init_mem_mapping() to support allocate page tables in
   bottom up direction.

3. Introduce "movable_node" boot option to enable and disable this
   functionality.

This patch (of 6):

Create a new function __memblock_find_range_top_down to factor out of
top-down allocation from memblock_find_in_range_node.  This is a
preparation because we will introduce a new bottom-up allocation mode in
the following patch.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:08 +09:00
Heiko Carstens
4e99b02131 mmap: arch_get_unmapped_area(): use proper mmap base for bottom up direction
This is more or less the generic variant of commit 41aacc1eea ("x86
get_unmapped_area: Access mmap_legacy_base through mm_struct member").

So effectively architectures which use an own arch_pick_mmap_layout()
implementation but call the generic arch_get_unmapped_area() now can
also randomize their mmap_base.

All architectures which have an own arch_pick_mmap_layout() and call the
generic arch_get_unmapped_area() (arm64, s390, tile) currently set
mmap_base to TASK_UNMAPPED_BASE.  This is also true for the generic
arch_pick_mmap_layout() function.  So this change is a no-op currently.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Radu Caragea <sinaelgl@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:08 +09:00
Weijie Yang
b349acc76b mm/zswap: avoid unnecessary page scanning
Add SetPageReclaim() before __swap_writepage() so that page can be moved
to the tail of the inactive list, which can avoid unnecessary page
scanning as this page was reclaimed by swap subsystem before.

Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:08 +09:00
Zhang Yanfei
bfc4f9d520 mm/page_alloc.c: remove unused marco LONG_ALIGN
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:07 +09:00
Krzysztof Kozlowski
58e97ba6b1 frontswap: enable call to invalidate area on swapoff
During swapoff the frontswap_map was NULL-ified before calling
frontswap_invalidate_area().  However the frontswap_invalidate_area()
exits early if frontswap_map is NULL.  Invalidate was never called
during swapoff.

This patch moves frontswap_map_set() in swapoff just after calling
frontswap_invalidate_area() so outside of locks (swap_lock and
swap_info_struct->lock).  This shouldn't be a problem as during swapon
the frontswap_map_set() is called also outside of any locks.

Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Reviewed-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:07 +09:00
Seth Jennings
2de1a7e40a mm/swapfile.c: fix comment typos
Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:07 +09:00
Catalin Marinas
7f88f88f83 mm: kmemleak: avoid false negatives on vmalloc'ed objects
Commit 248ac0e194 ("mm/vmalloc: remove guard page from between vmap
blocks") had the side effect of making vmap_area.va_end member point to
the next vmap_area.va_start.  This was creating an artificial reference
to vmalloc'ed objects and kmemleak was rarely reporting vmalloc() leaks.

This patch marks the vmap_area containing pointers explicitly and
reduces the min ref_count to 2 as vm_struct still contains a reference
to the vmalloc'ed object.  The kmemleak add_scan_area() function has
been improved to allow a SIZE_MAX argument covering the rest of the
object (for simpler calling sites).

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:07 +09:00
Zhang Yanfei
81556b0252 mm/sparsemem: fix a bug in free_map_bootmem when CONFIG_SPARSEMEM_VMEMMAP
We pass the number of pages which hold page structs of a memory section
to free_map_bootmem().  This is right when !CONFIG_SPARSEMEM_VMEMMAP but
wrong when CONFIG_SPARSEMEM_VMEMMAP.  When CONFIG_SPARSEMEM_VMEMMAP, we
should pass the number of pages of a memory section to free_map_bootmem.

So the fix is removing the nr_pages parameter.  When
CONFIG_SPARSEMEM_VMEMMAP, we directly use the prefined marco
PAGES_PER_SECTION in free_map_bootmem.  When !CONFIG_SPARSEMEM_VMEMMAP,
we calculate page numbers needed to hold the page structs for a memory
section and use the value in free_map_bootmem().

This was found by reading the code.  And I have no machine that support
memory hot-remove to test the bug now.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:06 +09:00
Zhang Yanfei
85b35feaec mm/sparsemem: use PAGES_PER_SECTION to remove redundant nr_pages parameter
For below functions,

- sparse_add_one_section()
- kmalloc_section_memmap()
- __kmalloc_section_memmap()
- __kfree_section_memmap()

they are always invoked to operate on one memory section, so it is
redundant to always pass a nr_pages parameter, which is the page numbers
in one section.  So we can directly use predefined macro PAGES_PER_SECTION
instead of passing the parameter.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:06 +09:00
Ying Han
071aee1384 memcg: support hierarchical memory.numa_stats
The memory.numa_stat file was not hierarchical.  Memory charged to the
children was not shown in parent's numa_stat.

This change adds the "hierarchical_" stats to the existing stats.  The
new hierarchical stats include the sum of all children's values in
addition to the value of the memcg.

Tested: Create cgroup a, a/b and run workload under b.  The values of
b are included in the "hierarchical_*" under a.

$ cd /sys/fs/cgroup
$ echo 1 > memory.use_hierarchy
$ mkdir a a/b

Run workload in a/b:
$ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) &

The hierarchical_ fields in parent (a) show use of workload in a/b:
$ cat a/memory.numa_stat
total=0 N0=0 N1=0 N2=0 N3=0
file=0 N0=0 N1=0 N2=0 N3=0
anon=0 N0=0 N1=0 N2=0 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

$ cat a/b/memory.numa_stat
total=908 N0=552 N1=317 N2=39 N3=0
file=850 N0=549 N1=301 N2=0 N3=0
anon=58 N0=3 N1=16 N2=39 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:06 +09:00
Greg Thelen
25485de6e9 memcg: refactor mem_control_numa_stat_show()
Refactor mem_control_numa_stat_show() to use a new stats structure for
smaller and simpler code.  This consolidates nearly identical code.

    text      data      bss        dec      hex   filename
  8,137,679 1,703,496 1,896,448 11,737,623 b31a17 vmlinux.before
  8,136,911 1,703,496 1,896,448 11,736,855 b31717 vmlinux.after

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Ying Han <yinghan@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:06 +09:00
Jianguo Wu
b76ac7e734 mm/mempolicy: use NUMA_NO_NODE
Use more appropriate NUMA_NO_NODE instead of -1

Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:06 +09:00