Commit Graph

8827 Commits

Author SHA1 Message Date
Joonsoo Kim
48c96a3685 mm/page_owner: keep track of page owners
This is the page owner tracking code which is introduced so far ago.  It
is resident on Andrew's tree, though, nobody tried to upstream so it
remain as is.  Our company uses this feature actively to debug memory leak
or to find a memory hogger so I decide to upstream this feature.

This functionality help us to know who allocates the page.  When
allocating a page, we store some information about allocation in extra
memory.  Later, if we need to know status of all pages, we can get and
analyze it from this stored information.

In previous version of this feature, extra memory is statically defined in
struct page, but, in this version, extra memory is allocated outside of
struct page.  It enables us to turn on/off this feature at boottime
without considerable memory waste.

Although we already have tracepoint for tracing page allocation/free,
using it to analyze page owner is rather complex.  We need to enlarge the
trace buffer for preventing overlapping until userspace program launched.
And, launched program continually dump out the trace buffer for later
analysis and it would change system behaviour with more possibility rather
than just keeping it in memory, so bad for debug.

Moreover, we can use page_owner feature further for various purposes.  For
example, we can use it for fragmentation statistics implemented in this
patch.  And, I also plan to implement some CMA failure debugging feature
using this interface.

I'd like to give the credit for all developers contributed this feature,
but, it's not easy because I don't know exact history.  Sorry about that.
Below is people who has "Signed-off-by" in the patches in Andrew's tree.

Contributor:
Alexander Nyberg <alexn@dsv.su.se>
Mel Gorman <mgorman@suse.de>
Dave Hansen <dave@linux.vnet.ibm.com>
Minchan Kim <minchan@kernel.org>
Michal Nazarewicz <mina86@mina86.com>
Andrew Morton <akpm@linux-foundation.org>
Jungsoo Son <jungsoo.son@lge.com>

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:48 -08:00
Joonsoo Kim
dbc8358c72 mm/nommu: use alloc_pages_exact() rather than its own implementation
do_mmap_private() in nommu.c try to allocate physically contiguous pages
with arbitrary size in some cases and we now have good abstract function
to do exactly same thing, alloc_pages_exact().  So, change to use it.

There is no functional change.  This is the preparation step for support
page owner feature accurately.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:48 -08:00
Joonsoo Kim
031bc5743f mm/debug-pagealloc: make debug-pagealloc boottime configurable
Now, we have prepared to avoid using debug-pagealloc in boottime.  So
introduce new kernel-parameter to disable debug-pagealloc in boottime, and
makes related functions to be disabled in this case.

Only non-intuitive part is change of guard page functions.  Because guard
page is effective only if debug-pagealloc is enabled, turning off
according to debug-pagealloc is reasonable thing to do.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:48 -08:00
Joonsoo Kim
e30825f186 mm/debug-pagealloc: prepare boottime configurable on/off
Until now, debug-pagealloc needs extra flags in struct page, so we need to
recompile whole source code when we decide to use it.  This is really
painful, because it takes some time to recompile and sometimes rebuild is
not possible due to third party module depending on struct page.  So, we
can't use this good feature in many cases.

Now, we have the page extension feature that allows us to insert extra
flags to outside of struct page.  This gets rid of third party module
issue mentioned above.  And, this allows us to determine if we need extra
memory for this page extension in boottime.  With these property, we can
avoid using debug-pagealloc in boottime with low computational overhead in
the kernel built with CONFIG_DEBUG_PAGEALLOC.  This will help our
development process greatly.

This patch is the preparation step to achive above goal.  debug-pagealloc
originally uses extra field of struct page, but, after this patch, it will
use field of struct page_ext.  Because memory for page_ext is allocated
later than initialization of page allocator in CONFIG_SPARSEMEM, we should
disable debug-pagealloc feature temporarily until initialization of
page_ext.  This patch implements this.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:48 -08:00
Joonsoo Kim
eefa864b70 mm/page_ext: resurrect struct page extending code for debugging
When we debug something, we'd like to insert some information to every
page.  For this purpose, we sometimes modify struct page itself.  But,
this has drawbacks.  First, it requires re-compile.  This makes us
hesitate to use the powerful debug feature so development process is
slowed down.  And, second, sometimes it is impossible to rebuild the
kernel due to third party module dependency.  At third, system behaviour
would be largely different after re-compile, because it changes size of
struct page greatly and this structure is accessed by every part of
kernel.  Keeping this as it is would be better to reproduce errornous
situation.

This feature is intended to overcome above mentioned problems.  This
feature allocates memory for extended data per page in certain place
rather than the struct page itself.  This memory can be accessed by the
accessor functions provided by this code.  During the boot process, it
checks whether allocation of huge chunk of memory is needed or not.  If
not, it avoids allocating memory at all.  With this advantage, we can
include this feature into the kernel in default and can avoid rebuild and
solve related problems.

Until now, memcg uses this technique.  But, now, memcg decides to embed
their variable to struct page itself and it's code to extend struct page
has been removed.  I'd like to use this code to develop debug feature, so
this patch resurrect it.

To help these things to work well, this patch introduces two callbacks for
clients.  One is the need callback which is mandatory if user wants to
avoid useless memory allocation at boot-time.  The other is optional, init
callback, which is used to do proper initialization after memory is
allocated.  Detailed explanation about purpose of these functions is in
code comment.  Please refer it.

Others are completely same with previous extension code in memcg.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:48 -08:00
Zhang Zhen
056b7ccef4 mm/memcontrol.c: remove the unused arg in __memcg_kmem_get_cache()
The gfp was passed in but never used in this function.

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:47 -08:00
Jesse Barnes
e1d6d01ab4 mm: export find_extend_vma() and handle_mm_fault() for driver use
This lets drivers like the AMD IOMMUv2 driver handle faults a bit more
simply, rather than doing tricks with page refs and get_user_pages().

Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Oded Gabbay <oded.gabbay@amd.com>
Cc: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:47 -08:00
Luiz Capitulino
7d9ca0004f hugetlb: hugetlb_register_all_nodes(): add __init marker
This function is only called during initialization.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:47 -08:00
Luiz Capitulino
df994ead54 hugetlb: alloc_bootmem_huge_page(): use IS_ALIGNED()
No reason to duplicate the code of an existing macro.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:47 -08:00
Vladimir Davydov
6f185c290e memcg: turn memcg_kmem_skip_account into a bit field
It isn't supposed to stack, so turn it into a bit-field to save 4 bytes on
the task_struct.

Also, remove the memcg_stop/resume_kmem_account helpers - it is clearer to
set/clear the flag inline.  Regarding the overwhelming comment to the
helpers, which is removed by this patch too, we already have a compact yet
accurate explanation in memcg_schedule_cache_create, no need in yet
another one.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:47 -08:00
Vladimir Davydov
4e701d7b37 memcg: only check memcg_kmem_skip_account in __memcg_kmem_get_cache
__memcg_kmem_get_cache can recurse if it calls kmalloc (which it does if
the cgroup's kmem cache doesn't exist), because kmalloc may call
__memcg_kmem_get_cache internally again.  To avoid the recursion, we use
the task_struct->memcg_kmem_skip_account flag.

However, there's no need checking the flag in memcg_kmem_newpage_charge,
because there's no way how this function could result in recursion, if
called from memcg_kmem_get_cache.  So let's remove the redundant code.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:47 -08:00
Vladimir Davydov
900a38f027 memcg: zap kmem_account_flags
The only such flag is KMEM_ACCOUNTED_ACTIVE, but it's set iff
mem_cgroup->kmemcg_id is initialized, so we can check kmemcg_id instead of
having a separate flags field.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:46 -08:00
Weijie Yang
c313dc5ded mm: mincore: add hwpoison page handle
When the encountered pte is a swap entry, the current code handles two
cases: migration and normal swapentry, but we have a third case: hwpoison
page.

This patch adds hwpoison page handle, consider hwpoison page incore as
same as migration.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:46 -08:00
Davidlohr Bueso
b258d86065 mm/rmap: calculate page offset when needed
Call page_to_pgoff() to get the page offset once we are sure we actually
need it, and any very obvious initial function checks have passed.
Trivial micro-optimization, and potentially save some cycles.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:46 -08:00
Joonsoo Kim
2847cf95c6 mm/debug-pagealloc: cleanup page guard code
Page guard is used by debug-pagealloc feature.  Currently, it is
open-coded, but, I think that more abstraction of it makes core page
allocator code more readable.

There is no functional difference.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Gioh Kim <gioh.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:46 -08:00
Tony Luck
4308ce17f6 mm/memblock.c: refactor functions to set/clear MEMBLOCK_HOTPLUG
There is a lot of duplication in the rubric around actually setting or
clearing a mem region flag.  Create a new helper function to do this and
reduce each of memblock_mark_hotplug() and memblock_clear_hotplug() to a
single line.

This will be useful if someone were to add a new mem region flag - which
I hope to be doing some day soon. But it looks like a plausible cleanup
even without that - so I'd like to get it out of the way now.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Emil Medve <Emilian.Medve@freescale.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:46 -08:00
Vladimir Davydov
95fc3c5010 memcg: do not abuse memcg_kmem_skip_account
task_struct->memcg_kmem_skip_account was initially introduced to avoid
recursion during kmem cache creation: memcg_kmem_get_cache, which is
called by kmem_cache_alloc to determine the per-memcg cache to account
allocation to, may issue lazy cache creation if the needed cache doesn't
exist, which means issuing yet another kmem_cache_alloc.  We can't just
pass a flag to the nested kmem_cache_alloc disabling kmem accounting,
because there are hidden allocations, e.g.  in INIT_WORK.  So we
introduced a flag on the task_struct, memcg_kmem_skip_account, making
memcg_kmem_get_cache return immediately.

By its nature, the flag may also be used to disable accounting for
allocations shared among different cgroups, and currently it is used this
way in memcg_activate_kmem.  Using it like this looks like abusing it to
me.  If we want to disable accounting for some allocations (which we will
definitely want one day), we should either add GFP_NO_MEMCG or GFP_MEMCG
flag in order to blacklist/whitelist some allocations.

For now, let's simply remove memcg_stop/resume_kmem_account from
memcg_activate_kmem.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:46 -08:00
Vladimir Davydov
9d100c5e47 memcg: don't check mm in __memcg_kmem_{get_cache,newpage_charge}
We already assured the current task has mm in memcg_kmem_should_charge,
no need to double check.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:46 -08:00
Vladimir Davydov
bfda7e8fe4 memcg: __mem_cgroup_free: remove stale disarm_static_keys comment
cpuset code stopped using cgroup_lock in favor of cpuset_mutex long ago.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:46 -08:00
Gregory Fong
b5be83e308 mm: cma: align to physical address, not CMA region position
The alignment in cma_alloc() was done w.r.t. the bitmap.  This is a
problem when, for example:

- a device requires 16M (order 12) alignment
- the CMA region is not 16 M aligned

In such a case, can result with the CMA region starting at, say,
0x2f800000 but any allocation you make from there will be aligned from
there.  Requesting an allocation of 32 M with 16 M alignment will result
in an allocation from 0x2f800000 to 0x31800000, which doesn't work very
well if your strange device requires 16M alignment.

Change to use bitmap_find_next_zero_area_off() to account for the
difference in alignment at reserve-time and alloc-time.

Signed-off-by: Gregory Fong <gregory.0xf0@gmail.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kukjin Kim <kgene.kim@samsung.com>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:46 -08:00
Davidlohr Bueso
c8475d144a mm/memory.c: share the i_mmap_rwsem
The unmap_mapping_range family of functions do the unmapping of user pages
(ultimately via zap_page_range_single) without touching the actual
interval tree, thus share the lock.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:46 -08:00
Davidlohr Bueso
1acf2e0407 mm/nommu: share the i_mmap_rwsem
Shrinking/truncate logic can call nommu_shrink_inode_mappings() to verify
that any shared mappings of the inode in question aren't broken (dead
zone).  afaict the only user being ramfs to handle the size change
attribute.

Pretty much a no-brainer to share the lock.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:46 -08:00
Davidlohr Bueso
d28eb9c861 mm/memory-failure: share the i_mmap_rwsem
No brainer conversion: collect_procs_file() only schedules a process for
later kill, share the lock, similarly to the anon vma variant.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:45 -08:00
Davidlohr Bueso
874bfcaf79 mm/xip: share the i_mmap_rwsem
__xip_unmap() will remove the xip sparse page from the cache and take down
pte mapping, without altering the interval tree, thus share the
i_mmap_rwsem when searching for the ptes to unmap.

Additionally, tidy up the function a bit and make variables only local to
the interval tree walk loop.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:45 -08:00
Davidlohr Bueso
3dec0ba0be mm/rmap: share the i_mmap_rwsem
Similarly to the anon memory counterpart, we can share the mapping's lock
ownership as the interval tree is not modified when doing doing the walk,
only the file page.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:45 -08:00
Davidlohr Bueso
c8c06efa8b mm: convert i_mmap_mutex to rwsem
The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
similar data, one for file backed pages and the other for anon memory.  To
this end, this lock can also be a rwsem.  In addition, there are some
important opportunities to share the lock when there are no tree
modifications.

This conversion is straightforward.  For now, all users take the write
lock.

[sfr@canb.auug.org.au: update fremap.c]
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:45 -08:00
Davidlohr Bueso
83cde9e8ba mm: use new helper functions around the i_mmap_mutex
Convert all open coded mutex_lock/unlock calls to the
i_mmap_[lock/unlock]_write() helpers.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:45 -08:00
Linus Torvalds
a7cb7bb664 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree update from Jiri Kosina:
 "Usual stuff: documentation updates, printk() fixes, etc"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits)
  intel_ips: fix a type in error message
  cpufreq: cpufreq-dt: Move newline to end of error message
  ps3rom: fix error return code
  treewide: fix typo in printk and Kconfig
  ARM: dts: bcm63138: change "interupts" to "interrupts"
  Replace mentions of "list_struct" to "list_head"
  kernel: trace: fix printk message
  scsi: mpt2sas: fix ioctl in comment
  zbud, zswap: change module author email
  clocksource: Fix 'clcoksource' typo in comment
  arm: fix wording of "Crotex" in CONFIG_ARCH_EXYNOS3 help
  gpio: msm-v1: make boolean argument more obvious
  usb: Fix typo in usb-serial-simple.c
  PCI: Fix comment typo 'COMFIG_PM_OPS'
  powerpc: Fix comment typo 'CONIFG_8xx'
  powerpc: Fix comment typos 'CONFiG_ALTIVEC'
  clk: st: Spelling s/stucture/structure/
  isci: Spelling s/stucture/structure/
  usb: gadget: zero: Spelling s/infrastucture/infrastructure/
  treewide: Fix company name in module descriptions
  ...
2014-12-12 10:08:06 -08:00
Linus Torvalds
2756d373a3 Merge branch 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup update from Tejun Heo:
 "cpuset got simplified a bit.  cgroup core got a fix on unified
  hierarchy and grew some effective css related interfaces which will be
  used for blkio support for writeback IO traffic which is currently
  being worked on"

* 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: implement cgroup_get_e_css()
  cgroup: add cgroup_subsys->css_e_css_changed()
  cgroup: add cgroup_subsys->css_released()
  cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
  cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
  cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
  cpuset: lock vs unlock typo
  cpuset: simplify cpuset_node_allowed API
  cpuset: convert callback_mutex to a spinlock
2014-12-11 18:57:19 -08:00
Linus Torvalds
eedb3d3304 Merge branch 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:
 "Nothing interesting.  A patch to convert the remaining __get_cpu_var()
  users, another to fix non-critical off-by-one in an assertion and a
  cosmetic conversion to lockless_dereference() in percpu-ref.

  The back-merge from mainline is to receive lockless_dereference()"

* 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: Replace smp_read_barrier_depends() with lockless_dereference()
  percpu: Convert remaining __get_cpu_var uses in 3.18-rcX
  percpu: off by one in BUG_ON()
2014-12-11 18:36:26 -08:00
Linus Torvalds
140cd7fb04 powerpc updates for 3.19
Some nice cleanups like removing bootmem, and removal of __get_cpu_var().
 
 There is one patch to mm/gup.c. This is the generic GUP implementation, but is
 only used by us and arm(64). We have an ack from Steve Capper, and although we
 didn't get an ack from Andrew he told us to take the patch through the powerpc
 tree.
 
 There's one cxl patch. This is in drivers/misc, but Greg said he was happy for
 us to manage fixes for it.
 
 There is an infrastructure patch to support an IPMI driver for OPAL. That patch
 also appears in Corey Minyard's IPMI tree, you may see a conflict there.
 
 There is also an RTC driver for OPAL. We weren't able to get any response from
 the RTC maintainer, Alessandro Zummo, so in the end we just merged the driver.
 
 The usual batch of Freescale updates from Scott.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUiSTSAAoJEFHr6jzI4aWAirQP/3rIEng0LzLu5kW2zkGylIaM
 SNDum1vze3mHiTFl+CFcSIGpC1UEULoB49HA+2oE/ExKpIceG6lpL2LP+wNh2FW5
 mozjMjS6mZt4w1Fu1D2ZtgQc3O1T1pxkqsnZmPa8gVf5k5d5IQNPY6yB0pgVWwbV
 gwBKxe4VwPAzJjppE9i9MDhNTJwmHZq0lI8XuoTXOOU/f+4G1WxmjrbyveQ7cRP5
 i/sq2cKjxpWA+KDeIXo0GR0DpXR7qMeAvFX5xXY7oKuUJIFDM4kSHfmMYP6qLf5c
 2vlsJqHVqfOgQdve41z1ooaPzNtg7ezVo+VqqguSgtSgwy2JUo/uHpnzz3gD1Olo
 AP5+6xj8LZac0rTPxF4n4Hoyrp7AaaFjEFt1zqT9PWniZW4B41wtia0QORBNUf1S
 UEmKAC9T3WZJ47mH7WMSadtOPF9E3Yd/zuiPD4udtptCNKPbr6/k1MpJPIW2D4Rn
 BJ0QZTRd7V0yRofXxZtHxaMxq8pWd/Tip7J/zr/ghz+ulnH8BuFamuhCCLuJlESU
 +A2PMfuseyTMpH9sMAmmTwSGPDKjaUFWvmFvY/n88NZL7r2LlomNrDWFSSQOIHUP
 FxjYmjUMpZeexsfyRdgFV/INhYC3o3cso2fRGO45YK6nkxNnjNFEBS6WhQLvNLBu
 sknd1WjXkuJtoMC15SrQ
 =jvyT
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-3.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux

Pull powerpc updates from Michael Ellerman:
 "Some nice cleanups like removing bootmem, and removal of
  __get_cpu_var().

  There is one patch to mm/gup.c.  This is the generic GUP
  implementation, but is only used by us and arm(64).  We have an ack
  from Steve Capper, and although we didn't get an ack from Andrew he
  told us to take the patch through the powerpc tree.

  There's one cxl patch.  This is in drivers/misc, but Greg said he was
  happy for us to manage fixes for it.

  There is an infrastructure patch to support an IPMI driver for OPAL.

  There is also an RTC driver for OPAL.  We weren't able to get any
  response from the RTC maintainer, Alessandro Zummo, so in the end we
  just merged the driver.

  The usual batch of Freescale updates from Scott"

* tag 'powerpc-3.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (101 commits)
  powerpc/powernv: Return to cpu offline loop when finished in KVM guest
  powerpc/book3s: Fix partial invalidation of TLBs in MCE code.
  powerpc/mm: don't do tlbie for updatepp request with NO HPTE fault
  powerpc/xmon: Cleanup the breakpoint flags
  powerpc/xmon: Enable HW instruction breakpoint on POWER8
  powerpc/mm/thp: Use tlbiel if possible
  powerpc/mm/thp: Remove code duplication
  powerpc/mm/hugetlb: Sanity check gigantic hugepage count
  powerpc/oprofile: Disable pagefaults during user stack read
  powerpc/mm: Check for matching hpte without taking hpte lock
  powerpc: Drop useless warning in eeh_init()
  powerpc/powernv: Cleanup unused MCE definitions/declarations.
  powerpc/eeh: Dump PHB diag-data early
  powerpc/eeh: Recover EEH error on ownership change for BCM5719
  powerpc/eeh: Set EEH_PE_RESET on PE reset
  powerpc/eeh: Refactor eeh_reset_pe()
  powerpc: Remove more traces of bootmem
  powerpc/pseries: Initialise nvram_pstore_info's buf_lock
  cxl: Name interrupts in /proc/interrupt
  cxl: Return error to PSL if IRQ demultiplexing fails & print clearer warning
  ...
2014-12-11 17:48:14 -08:00
Linus Torvalds
27afc5dbda Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Martin Schwidefsky:
 "The most notable change for this pull request is the ftrace rework
  from Heiko.  It brings a small performance improvement and the ground
  work to support a new gcc option to replace the mcount blocks with a
  single nop.

  Two new s390 specific system calls are added to emulate user space
  mmio for PCI, an artifact of the how PCI memory is accessed.

  Two patches for the memory management with changes to common code.
  For KVM mm_forbids_zeropage is added which disables the empty zero
  page for an mm that is used by a KVM process.  And an optimization,
  pmdp_get_and_clear_full is added analog to ptep_get_and_clear_full.

  Some micro optimization for the cmpxchg and the spinlock code.

  And as usual bug fixes and cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (46 commits)
  s390/cputime: fix 31-bit compile
  s390/scm_block: make the number of reqs per HW req configurable
  s390/scm_block: handle multiple requests in one HW request
  s390/scm_block: allocate aidaw pages only when necessary
  s390/scm_block: use mempool to manage aidaw requests
  s390/eadm: change timeout value
  s390/mm: fix memory leak of ptlock in pmd_free_tlb
  s390: use local symbol names in entry[64].S
  s390/ptrace: always include vector registers in core files
  s390/simd: clear vector register pointer on fork/clone
  s390: translate cputime magic constants to macros
  s390/idle: convert open coded idle time seqcount
  s390/idle: add missing irq off lockdep annotation
  s390/debug: avoid function call for debug_sprintf_*
  s390/kprobes: fix instruction copy for out of line execution
  s390: remove diag 44 calls from cpu_relax()
  s390/dasd: retry partition detection
  s390/dasd: fix list corruption for sleep_on requests
  s390/dasd: fix infinite term I/O loop
  s390/dasd: remove unused code
  ...
2014-12-11 17:30:55 -08:00
Linus Torvalds
b6da0076ba Merge branch 'akpm' (patchbomb from Andrew)
Merge first patchbomb from Andrew Morton:
 - a few minor cifs fixes
 - dma-debug upadtes
 - ocfs2
 - slab
 - about half of MM
 - procfs
 - kernel/exit.c
 - panic.c tweaks
 - printk upates
 - lib/ updates
 - checkpatch updates
 - fs/binfmt updates
 - the drivers/rtc tree
 - nilfs
 - kmod fixes
 - more kernel/exit.c
 - various other misc tweaks and fixes

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
  exit: pidns: fix/update the comments in zap_pid_ns_processes()
  exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
  exit: exit_notify: re-use "dead" list to autoreap current
  exit: reparent: call forget_original_parent() under tasklist_lock
  exit: reparent: avoid find_new_reaper() if no children
  exit: reparent: introduce find_alive_thread()
  exit: reparent: introduce find_child_reaper()
  exit: reparent: document the ->has_child_subreaper checks
  exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper()
  exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting
  exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
  exit: proc: don't try to flush /proc/tgid/task/tgid
  exit: release_task: fix the comment about group leader accounting
  exit: wait: drop tasklist_lock before psig->c* accounting
  exit: wait: don't use zombie->real_parent
  exit: wait: cleanup the ptrace_reparented() checks
  usermodehelper: kill the kmod_thread_locker logic
  usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper()
  fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp
  nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
  ...
2014-12-10 18:34:42 -08:00
Johannes Weiner
9edad6ea0f mm: move page->mem_cgroup bad page handling into generic code
Now that the external page_cgroup data structure and its lookup is
gone, let the generic bad_page() check for page->mem_cgroup sanity.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:09 -08:00
Johannes Weiner
5d1ea48bdd mm: page_cgroup: rename file to mm/swap_cgroup.c
Now that the external page_cgroup data structure and its lookup is gone,
the only code remaining in there is swap slot accounting.

Rename it and move the conditional compilation into mm/Makefile.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:09 -08:00
Johannes Weiner
1306a85aed mm: embed the memcg pointer directly into struct page
Memory cgroups used to have 5 per-page pointers.  To allow users to
disable that amount of overhead during runtime, those pointers were
allocated in a separate array, with a translation layer between them and
struct page.

There is now only one page pointer remaining: the memcg pointer, that
indicates which cgroup the page is associated with when charged.  The
complexity of runtime allocation and the runtime translation overhead is
no longer justified to save that *potential* 0.19% of memory.  With
CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
after the page->private member and doesn't even increase struct page,
and then this patch actually saves space.  Remaining users that care can
still compile their kernels without CONFIG_MEMCG.

     text    data     bss     dec     hex     filename
  8828345 1725264  983040 11536649 b00909  vmlinux.old
  8827425 1725264  966656 11519345 afc571  vmlinux.new

[mhocko@suse.cz: update Documentation/cgroups/memory.txt]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:09 -08:00
Johannes Weiner
22811c6bc3 mm: memcontrol: remove stale page_cgroup_lock comment
There is no cgroup-specific page lock anymore.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:08 -08:00
Li Haifeng
a1ad28973d mm/frontswap.c: fix the condition in BUG_ON
The largest index of swap device is MAX_SWAPFILES-1.  So the type should
be less than MAX_SWAPFILES.

Signed-off-by: Haifeng Li <omycle@gmail.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:08 -08:00
Wei Yuan
26086de3fc mm: fix a spelling mistake
Signed-off-by Wei Yuan <weiyuan.wei@huawei.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:08 -08:00
Hillf Danton
569f48b858 mm: hugetlb: fix __unmap_hugepage_range()
First, after flushing TLB, we have no need to scan pte from start again.
Second, before bail out loop, the address is forwarded one step.

Signed-off-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:08 -08:00
Michal Hocko
e4bd6a0248 mm, memcg: fix potential undefined behaviour in page stat accounting
Since commit d7365e783e ("mm: memcontrol: fix missed end-writeback
page accounting") mem_cgroup_end_page_stat consumes locked and flags
variables directly rather than via pointers which might trigger C
undefined behavior as those variables are initialized only in the slow
path of mem_cgroup_begin_page_stat.

Although mem_cgroup_end_page_stat handles parameters correctly and
touches them only when they hold a sensible value it is caller which
loads a potentially uninitialized value which then might allow compiler
to do crazy things.

I haven't seen any warning from gcc and it seems that the current
version (4.9) doesn't exploit this type undefined behavior but Sasha has
reported the following:

  UBSan: Undefined behaviour in mm/rmap.c:1084:2
  load of value 255 is not a valid value for type '_Bool'
  CPU: 4 PID: 8304 Comm: rngd Not tainted 3.18.0-rc2-next-20141029-sasha-00039-g77ed13d-dirty #1427
  Call Trace:
    dump_stack (lib/dump_stack.c:52)
    ubsan_epilogue (lib/ubsan.c:159)
    __ubsan_handle_load_invalid_value (lib/ubsan.c:482)
    page_remove_rmap (mm/rmap.c:1084 mm/rmap.c:1096)
    unmap_page_range (./arch/x86/include/asm/atomic.h:27 include/linux/mm.h:463 mm/memory.c:1146 mm/memory.c:1258 mm/memory.c:1279 mm/memory.c:1303)
    unmap_single_vma (mm/memory.c:1348)
    unmap_vmas (mm/memory.c:1377 (discriminator 3))
    exit_mmap (mm/mmap.c:2837)
    mmput (kernel/fork.c:659)
    do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:462 kernel/exit.c:747)
    do_group_exit (include/linux/sched.h:775 kernel/exit.c:873)
    SyS_exit_group (kernel/exit.c:901)
    tracesys_phase2 (arch/x86/kernel/entry_64.S:529)

Fix this by using pointer parameters for both locked and flags and be
more robust for future compiler changes even though the current code is
implemented correctly.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:08 -08:00
Johannes Weiner
2314b42db6 mm: memcontrol: drop bogus RCU locking from mem_cgroup_same_or_subtree()
None of the mem_cgroup_same_or_subtree() callers actually require it to
take the RCU lock, either because they hold it themselves or they have css
references.  Remove it.

To make the API change clear, rename the leftover helper to
mem_cgroup_is_descendant() to match cgroup_is_descendant().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:08 -08:00
Johannes Weiner
413918bb61 mm: memcontrol: pull the NULL check from __mem_cgroup_same_or_subtree()
The NULL in mm_match_cgroup() comes from a possibly exiting mm->owner.  It
makes a lot more sense to check where it's looked up, rather than check
for it in __mem_cgroup_same_or_subtree() where it's unexpected.

No other callsite passes NULL to __mem_cgroup_same_or_subtree().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:08 -08:00
Johannes Weiner
c01f46c7c7 mm: memcontrol: remove bogus NULL check after mem_cgroup_from_task()
That function acts like a typecast - unless NULL is passed in, no NULL can
come out.  task_in_mem_cgroup() callers don't pass NULL tasks.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:08 -08:00
Johannes Weiner
312722cbb2 mm: memcontrol: shorten the page statistics update slowpath
While moving charges from one memcg to another, page stat updates must
acquire the old memcg's move_lock to prevent double accounting.  That
situation is denoted by an increased memcg->move_accounting.  However, the
charge moving code declares this way too early for now, even before
summing up the RSS and pre-allocating destination charges.

Shorten this slowpath mode by increasing memcg->move_accounting only right
before walking the task's address space with the intention of actually
moving the pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:08 -08:00
Kirill A. Shutemov
e544a4e74e thp: do not mark zero-page pmd write-protected explicitly
Zero pages can be used only in anonymous mappings, which never have
writable vma->vm_page_prot: see protection_map in mm/mmap.c and __PX1X
definitions.

Let's drop redundant pmd_wrprotect() in set_huge_zero_page().

Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:08 -08:00
Vladimir Davydov
b047501cd9 memcg: use generic slab iterators for showing slabinfo
Let's use generic slab_start/next/stop for showing memcg caches info.  In
contrast to the current implementation, this will work even if all memcg
caches' info doesn't fit into a seq buffer (a page), plus it simply looks
neater.

Actually, the main reason I do this isn't mere cleanup.  I'm going to zap
the memcg_slab_caches list, because I find it useless provided we have the
slab_caches list, and this patch is a step in this direction.

It should be noted that before this patch an attempt to read
memory.kmem.slabinfo of a cgroup that doesn't have kmem limit set resulted
in -EIO, while after this patch it will silently show nothing except the
header, but I don't think it will frustrate anyone.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:07 -08:00
Vladimir Davydov
4ef461e8f4 memcg: remove mem_cgroup_reclaimable check from soft reclaim
mem_cgroup_reclaimable() checks whether a cgroup has reclaimable pages on
*any* NUMA node.  However, the only place where it's called is
mem_cgroup_soft_reclaim(), which tries to reclaim memory from a *specific*
zone.  So the way it is used is incorrect - it will return true even if
the cgroup doesn't have pages on the zone we're scanning.

I think we can get rid of this check completely, because
mem_cgroup_shrink_node_zone(), which is called by
mem_cgroup_soft_reclaim() if mem_cgroup_reclaimable() returns true, is
equivalent to shrink_lruvec(), which exits almost immediately if the
lruvec passed to it is empty.  So there's no need to optimize anything
here.  Besides, we don't have such a check in the general scan path
(shrink_zone) either.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:07 -08:00
Johannes Weiner
247b1447b6 mm: memcontrol: fold mem_cgroup_start_move()/mem_cgroup_end_move()
Having these functions and their documentation split out and somewhere
makes it harder, not easier, to follow what's going on.

Inline them directly where charge moving is prepared and finished, and put
an explanation right next to it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:07 -08:00
Johannes Weiner
4e2f245d38 mm: memcontrol: don't pass a NULL memcg to mem_cgroup_end_move()
mem_cgroup_end_move() checks if the passed memcg is NULL, along with a
lengthy comment to explain why this seemingly non-sensical situation is
even possible.

Check in cancel_attach() itself whether can_attach() set up the move
context or not, it's a lot more obvious from there.  Then remove the check
and comment in mem_cgroup_end_move().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:07 -08:00
Johannes Weiner
354a4783a2 mm: memcontrol: inline memcg->move_lock locking
The wrappers around taking and dropping the memcg->move_lock spinlock add
nothing of value.  Inline the spinlock calls into the callsites.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:07 -08:00
Johannes Weiner
2983331575 mm: memcontrol: remove unnecessary PCG_USED pc->mem_cgroup valid flag
pc->mem_cgroup had to be left intact after uncharge for the final LRU
removal, and !PCG_USED indicated whether the page was uncharged.  But
since commit 0a31bc97c8 ("mm: memcontrol: rewrite uncharge API") pages
are uncharged after the final LRU removal.  Uncharge can simply clear
the pointer and the PCG_USED/PageCgroupUsed sites can test that instead.

Because this is the last page_cgroup flag, this patch reduces the memcg
per-page overhead to a single pointer.

[akpm@linux-foundation.org: remove unneeded initialization of `memcg', per Michal]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:07 -08:00
Johannes Weiner
f4aaa8b43d mm: memcontrol: remove unnecessary PCG_MEM memory charge flag
PCG_MEM is a remnant from an earlier version of 0a31bc97c8 ("mm:
memcontrol: rewrite uncharge API"), used to tell whether migration cleared
a charge while leaving pc->mem_cgroup valid and PCG_USED set.  But in the
final version, mem_cgroup_migrate() directly uncharges the source page,
rendering this distinction unnecessary.  Remove it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:07 -08:00
Johannes Weiner
18eca2e636 mm: memcontrol: remove unnecessary PCG_MEMSW memory+swap charge flag
Now that mem_cgroup_swapout() fully uncharges the page, every page that is
still in use when reaching mem_cgroup_uncharge() is known to carry both
the memory and the memory+swap charge.  Simplify the uncharge path and
remove the PCG_MEMSW page flag accordingly.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:07 -08:00
Johannes Weiner
7bdd143c37 mm: memcontrol: uncharge pages on swapout
This series gets rid of the remaining page_cgroup flags, thus cutting the
memcg per-page overhead down to one pointer.

This patch (of 4):

mem_cgroup_swapout() is called with exclusive access to the page at the
end of the page's lifetime.  Instead of clearing the PCG_MEMSW flag and
deferring the uncharge, just do it right away.  This allows follow-up
patches to simplify the uncharge code.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:07 -08:00
Michal Hocko
b9982f8d27 mm: memcontrol: micro-optimize mem_cgroup_split_huge_fixup()
Don't call lookup_page_cgroup() when memcg is disabled.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:07 -08:00
Vladimir Davydov
8c0145b62e memcg: remove activate_kmem_mutex
The activate_kmem_mutex is used to serialize memcg.kmem.limit updates, but
we already serialize them with memcg_limit_mutex so let's remove the
former.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:07 -08:00
Johannes Weiner
7d5e324573 mm: memcontrol: clarify migration where old page is uncharged
Better explain re-entrant migration when compaction races with reclaim,
and also mention swapcache readahead pages as possible uncharged migration
sources.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:06 -08:00
Johannes Weiner
dfe0e773d0 mm: memcontrol: update mem_cgroup_page_lruvec() documentation
Commit 7512102cf6 ("memcg: fix GPF when cgroup removal races with last
exit") added a pc->mem_cgroup reset into mem_cgroup_page_lruvec() to
prevent a crash where an anon page gets uncharged on unmap, the memcg is
released, and then the final LRU isolation on free dereferences the
stale pc->mem_cgroup pointer.

But since commit 0a31bc97c8 ("mm: memcontrol: rewrite uncharge API"),
pages are only uncharged AFTER that final LRU isolation, which
guarantees the memcg's lifetime until then.  pc->mem_cgroup now only
needs to be reset for swapcache readahead pages.

Update the comment and callsite requirements accordingly.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:06 -08:00
Vladimir Davydov
bc2f2e7ffe memcg: simplify unreclaimable groups handling in soft limit reclaim
If we fail to reclaim anything from a cgroup during a soft reclaim pass
we want to get the next largest cgroup exceeding its soft limit. To
achieve this, we should obviously remove the current group from the tree
and then pick the largest group. Currently we have a weird loop instead.
Let's simplify it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:06 -08:00
Vlastimil Babka
fdaf7f5c40 mm, compaction: more focused lru and pcplists draining
The goal of memory compaction is to create high-order freepages through
page migration.  Page migration however puts pages on the per-cpu lru_add
cache, which is later flushed to per-cpu pcplists, and only after pcplists
are drained the pages can actually merge.  This can happen due to the
per-cpu caches becoming full through further freeing, or explicitly.

During direct compaction, it is useful to do the draining explicitly so
that pages merge as soon as possible and compaction can detect success
immediately and keep the latency impact at minimum.  However the current
implementation is far from ideal.  Draining is done only in
__alloc_pages_direct_compact(), after all zones were already compacted,
and the decisions to continue or stop compaction in individual zones was
done without the last batch of migrations being merged.  It is also
missing the draining of lru_add cache before the pcplists.

This patch moves the draining for direct compaction into compact_zone().
It adds the missing lru_cache draining and uses the newly introduced
single zone pcplists draining to reduce overhead and avoid impact on
unrelated zones.  Draining is only performed when it can actually lead to
merging of a page of desired order (passed by cc->order).  This means it
is only done when migration occurred in the previously scanned cc->order
aligned block(s) and the migration scanner is now pointing to the next
cc->order aligned block.

The patch has been tested with stress-highalloc benchmark from mmtests.
Although overal allocation success rates of the benchmark were not
affected, the number of detected compaction successes has doubled.  This
suggests that allocations were previously successful due to implicit
merging caused by background activity, making a later allocation attempt
succeed immediately, but not attributing the success to compaction.  Since
stress-highalloc always tries to allocate almost the whole memory, it
cannot show the improvement in its reported success rate metric.  However
after this patch, compaction should detect success and terminate earlier,
reducing the direct compaction latencies in a real scenario.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:06 -08:00
Vlastimil Babka
6bace090a2 mm, compaction: always update cached scanner positions
Compaction caches the migration and free scanner positions between
compaction invocations, so that the whole zone gets eventually scanned and
there is no bias towards the initial scanner positions at the
beginning/end of the zone.

The cached positions are continuously updated as scanners progress and the
updating stops as soon as a page is successfully isolated.  The reasoning
behind this is that a pageblock where isolation succeeded is likely to
succeed again in near future and it should be worth revisiting it.

However, the downside is that potentially many pages are rescanned without
successful isolation.  At worst, there might be a page where isolation
from LRU succeeds but migration fails (potentially always).  So upon
encountering this page, cached position would always stop being updated
for no good reason.  It might have been useful to let such page be
rescanned with sync compaction after async one failed, but this is now
handled by caching scanner position for async and sync mode separately
since commit 35979ef339 ("mm, compaction: add per-zone migration pfn
cache for async compaction").

After this patch, cached positions are updated unconditionally.  In
stress-highalloc benchmark, this has decreased the numbers of scanned
pages by few percent, without affecting allocation success rates.

To prevent free scanner from leaving free pages behind after they are
returned due to page migration failure, the cached scanner pfn is changed
to point to the pageblock of the returned free page with the highest pfn,
before leaving compact_zone().

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:06 -08:00
Vlastimil Babka
f866979539 mm, compaction: defer only on COMPACT_COMPLETE
Deferred compaction is employed to avoid compacting zone where sync direct
compaction has recently failed.  As such, it makes sense to only defer
when a full zone was scanned, which is when compact_zone returns with
COMPACT_COMPLETE.  It's less useful to defer when compact_zone returns
with apparent success (COMPACT_PARTIAL), followed by a watermark check
failure, which can happen due to parallel allocation activity.  It also
does not make much sense to defer compaction which was completely skipped
(COMPACT_SKIP) for being unsuitable in the first place.

This patch therefore makes deferred compaction trigger only when
COMPACT_COMPLETE is returned from compact_zone().  Results of
stress-highalloc becnmark show the difference is within measurement error,
so the issue is rather cosmetic.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:06 -08:00
Vlastimil Babka
97d47a65be mm, compaction: simplify deferred compaction
Since commit 53853e2d2b ("mm, compaction: defer each zone individually
instead of preferred zone"), compaction is deferred for each zone where
sync direct compaction fails, and reset where it succeeds.  However, it
was observed that for DMA zone compaction often appeared to succeed
while subsequent allocation attempt would not, due to different outcome
of watermark check.

In order to properly defer compaction in this zone, the candidate zone
has to be passed back to __alloc_pages_direct_compact() and compaction
deferred in the zone after the allocation attempt fails.

The large source of mismatch between watermark check in compaction and
allocation was the lack of alloc_flags and classzone_idx values in
compaction, which has been fixed in the previous patch.  So with this
problem fixed, we can simplify the code by removing the candidate_zone
parameter and deferring in __alloc_pages_direct_compact().

After this patch, the compaction activity during stress-highalloc
benchmark is still somewhat increased, but it's negligible compared to the
increase that occurred without the better watermark checking.  This
suggests that it is still possible to apparently succeed in compaction but
fail to allocate, possibly due to parallel allocation activity.

[akpm@linux-foundation.org: fix build]
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:06 -08:00
Vlastimil Babka
ebff398017 mm, compaction: pass classzone_idx and alloc_flags to watermark checking
Compaction relies on zone watermark checks for decisions such as if it's
worth to start compacting in compaction_suitable() or whether compaction
should stop in compact_finished().  The watermark checks take
classzone_idx and alloc_flags parameters, which are related to the memory
allocation request.  But from the context of compaction they are currently
passed as 0, including the direct compaction which is invoked to satisfy
the allocation request, and could therefore know the proper values.

The lack of proper values can lead to mismatch between decisions taken
during compaction and decisions related to the allocation request.  Lack
of proper classzone_idx value means that lowmem_reserve is not taken into
account.  This has manifested (during recent changes to deferred
compaction) when DMA zone was used as fallback for preferred Normal zone.
compaction_suitable() without proper classzone_idx would think that the
watermarks are already satisfied, but watermark check in
get_page_from_freelist() would fail.  Because of this problem, deferring
compaction has extra complexity that can be removed in the following
patch.

The issue (not confirmed in practice) with missing alloc_flags is opposite
in nature.  For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
CMA-enabled systems) the watermark checking in compaction with 0 passed
will be stricter than in get_page_from_freelist().  In these cases
compaction might be running for a longer time than is really needed.

Another issue compaction_suitable() is that the check for "does the zone
need compaction at all?" comes only after the check "does the zone have
enough free free pages to succeed compaction".  The latter considers extra
pages for migration and can therefore in some situations fail and return
COMPACT_SKIPPED, although the high-order allocation would succeed and we
should return COMPACT_PARTIAL.

This patch fixes these problems by adding alloc_flags and classzone_idx to
struct compact_control and related functions involved in direct compaction
and watermark checking.  Where possible, all other callers of
compaction_suitable() pass proper values where those are known.  This is
currently limited to classzone_idx, which is sometimes known in kswapd
context.  However, the direct reclaim callers should_continue_reclaim()
and compaction_ready() do not currently know the proper values, so the
coordination between reclaim and compaction may still not be as accurate
as it could.  This can be fixed later, if it's shown to be an issue.

Additionaly the checks in compact_suitable() are reordered to address the
second issue described above.

The effect of this patch should be slightly better high-order allocation
success rates and/or less compaction overhead, depending on the type of
allocations and presence of CMA.  It allows simplifying deferred
compaction code in a followup patch.

When testing with stress-highalloc, there was some slight improvement
(which might be just due to variance) in success rates of non-THP-like
allocations.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:06 -08:00
Jamie Liu
1da58ee2a0 mm: vmscan: count only dirty pages as congested
shrink_page_list() counts all pages with a mapping, including clean pages,
toward nr_congested if they're on a write-congested BDI.
shrink_inactive_list() then sets ZONE_CONGESTED if nr_dirty ==
nr_congested.  Fix this apples-to-oranges comparison by only counting
pages for nr_congested if they count for nr_dirty.

Signed-off-by: Jamie Liu <jamieliu@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:06 -08:00
Yu Zhao
ab1f306fa9 mm: verify compound order when freeing a page
This allows us to catch the bug fixed in the previous patch (mm: free
compound page with correct order).

Here we also verify whether a page is tail page or not -- tail pages are
supposed to be freed along with their head, not by themselves.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Cc: Bob Liu <lliubbo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:06 -08:00
Vlastimil Babka
c05543293e mm, memory_hotplug/failure: drain single zone pcplists
Memory hotplug and failure mechanisms have several places where pcplists
are drained so that pages are returned to the buddy allocator and can be
e.g. prepared for offlining.  This is always done in the context of a
single zone, we can reduce the pcplists drain to the single zone, which
is now possible.

The change should make memory offlining due to hotremove or failure
faster and not disturbing unrelated pcplists anymore.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:05 -08:00
Vlastimil Babka
510f550788 mm, cma: drain single zone pcplists
CMA allocation drains pcplists so that pages can merge back to buddy
allocator.  Since it operates on a single zone, we can reduce the
pcplists drain to the single zone, which is now possible.

The change should make CMA allocations faster and not disturbing
unrelated pcplists anymore.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:05 -08:00
Vlastimil Babka
ec25af84b2 mm, page_isolation: drain single zone pcplists
When setting MIGRATETYPE_ISOLATE on a pageblock, pcplists are drained to
have a better chance that all pages will be successfully isolated and
not left in the per-cpu caches.  Since isolation is always concerned
with a single zone, we can reduce the pcplists drain to the single zone,
which is now possible.

The change should make memory isolation faster and not disturbing
unrelated pcplists anymore.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:05 -08:00
Vlastimil Babka
93481ff0e5 mm: introduce single zone pcplists drain
The functions for draining per-cpu pages back to buddy allocators
currently always operate on all zones.  There are however several cases
where the drain is only needed in the context of a single zone, and
spilling other pcplists is a waste of time both due to the extra
spilling and later refilling.

This patch introduces new zone pointer parameter to drain_all_pages()
and changes the dummy parameter of drain_local_pages() to be also a zone
pointer.  When NULL is passed, the functions operate on all zones as
usual.  Passing a specific zone pointer reduces the work to the single
zone.

All callers are updated to pass the NULL pointer in this patch.
Conversion to single zone (where appropriate) is done in further
patches.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:05 -08:00
Pintu Kumar
8612c6639b mm/vmscan.c: replace printk with pr_err
This patch replaces printk(KERN_ERR..) with pr_err found under
shrink_slab.  Thus it also reduces one line extra because of formatting.

Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:05 -08:00
Pintu Kumar
0cbc8533b7 mm/vmalloc.c: replace printk with pr_warn
This patch replaces printk(KERN_WARNING..) with pr_warn.
Thus it also reduces one line extra because of formatting.

Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:05 -08:00
Anton Blanchard
f88dfff5f1 mm/page_alloc.c: convert boot printks without log level to pr_info
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:05 -08:00
Johannes Weiner
6d3d6aa22a mm: memcontrol: remove synchronous stock draining code
With charge reparenting, the last synchronous stock drainer left.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:05 -08:00
Johannes Weiner
b2052564e6 mm: memcontrol: continue cache reclaim from offlined groups
On cgroup deletion, outstanding page cache charges are moved to the parent
group so that they're not lost and can be reclaimed during pressure
on/inside said parent.  But this reparenting is fairly tricky and its
synchroneous nature has led to several lock-ups in the past.

Since c2931b70a3 ("cgroup: iterate cgroup_subsys_states directly") css
iterators now also include offlined css, so memcg iterators can be changed
to include offlined children during reclaim of a group, and leftover cache
can just stay put.

There is a slight change of behavior in that charges of deleted groups no
longer show up as local charges in the parent.  But they are still
included in the parent's hierarchical statistics.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:05 -08:00
Johannes Weiner
64f2199389 mm: memcontrol: remove obsolete kmemcg pinning tricks
As charges now pin the css explicitely, there is no more need for kmemcg
to acquire a proxy reference for outstanding pages during offlining, or
maintain state to identify such "dead" groups.

This was the last user of the uncharge functions' return values, so remove
them as well.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:05 -08:00
Johannes Weiner
e8ea14cc6e mm: memcontrol: take a css reference for each charged page
Charges currently pin the css indirectly by playing tricks during
css_offline(): user pages stall the offlining process until all of them
have been reparented, whereas kmemcg acquires a keep-alive reference if
outstanding kernel pages are detected at that point.

In preparation for removing all this complexity, make the pinning explicit
and acquire a css references for every charged page.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:05 -08:00
Johannes Weiner
5ac8fb31ad mm: memcontrol: convert reclaim iterator to simple css refcounting
The memcg reclaim iterators use a complicated weak reference scheme to
prevent pinning cgroups indefinitely in the absence of memory pressure.

However, during the ongoing cgroup core rework, css lifetime has been
decoupled such that a pinned css no longer interferes with removal of
the user-visible cgroup, and all this complexity is now unnecessary.

[mhocko@suse.cz: ensure that the cached reference is always released]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:05 -08:00
Johannes Weiner
71f87bee38 mm: hugetlb_cgroup: convert to lockless page counters
Abandon the spinlock-protected byte counters in favor of the unlocked
page counters in the hugetlb controller as well.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:04 -08:00
Johannes Weiner
3e32cb2e0a mm: memcontrol: lockless page counters
Memory is internally accounted in bytes, using spinlock-protected 64-bit
counters, even though the smallest accounting delta is a page.  The
counter interface is also convoluted and does too many things.

Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it.  The translation from and to bytes then only
happens when interfacing with userspace.

The removed locking overhead is noticable when scaling beyond the per-cpu
charge caches - on a 4-socket machine with 144-threads, the following test
shows the performance differences of 288 memcgs concurrently running a
page fault benchmark:

vanilla:

   18631648.500498      task-clock (msec)         #  140.643 CPUs utilized            ( +-  0.33% )
         1,380,638      context-switches          #    0.074 K/sec                    ( +-  0.75% )
            24,390      cpu-migrations            #    0.001 K/sec                    ( +-  8.44% )
     1,843,305,768      page-faults               #    0.099 M/sec                    ( +-  0.00% )
50,134,994,088,218      cycles                    #    2.691 GHz                      ( +-  0.33% )
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
 8,049,712,224,651      instructions              #    0.16  insns per cycle          ( +-  0.04% )
 1,586,970,584,979      branches                  #   85.176 M/sec                    ( +-  0.05% )
     1,724,989,949      branch-misses             #    0.11% of all branches          ( +-  0.48% )

     132.474343877 seconds time elapsed                                          ( +-  0.21% )

lockless:

   12195979.037525      task-clock (msec)         #  133.480 CPUs utilized            ( +-  0.18% )
           832,850      context-switches          #    0.068 K/sec                    ( +-  0.54% )
            15,624      cpu-migrations            #    0.001 K/sec                    ( +- 10.17% )
     1,843,304,774      page-faults               #    0.151 M/sec                    ( +-  0.00% )
32,811,216,801,141      cycles                    #    2.690 GHz                      ( +-  0.18% )
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
 9,999,265,091,727      instructions              #    0.30  insns per cycle          ( +-  0.10% )
 2,076,759,325,203      branches                  #  170.282 M/sec                    ( +-  0.12% )
     1,656,917,214      branch-misses             #    0.08% of all branches          ( +-  0.55% )

      91.369330729 seconds time elapsed                                          ( +-  0.45% )

On top of improved scalability, this also gets rid of the icky long long
types in the very heart of memcg, which is great for 32 bit and also makes
the code a lot more readable.

Notable differences between the old and new API:

- res_counter_charge() and res_counter_charge_nofail() become
  page_counter_try_charge() and page_counter_charge() resp. to match
  the more common kernel naming scheme of try_do()/do()

- res_counter_uncharge_until() is only ever used to cancel a local
  counter and never to uncharge bigger segments of a hierarchy, so
  it's replaced by the simpler page_counter_cancel()

- res_counter_set_limit() is replaced by page_counter_limit(), which
  expects its callers to serialize against themselves

- res_counter_memparse_write_strategy() is replaced by
  page_counter_limit(), which rounds down to the nearest page size -
  rather than up.  This is more reasonable for explicitely requested
  hard upper limits.

- to keep charging light-weight, page_counter_try_charge() charges
  speculatively, only to roll back if the result exceeds the limit.
  Because of this, a failing bigger charge can temporarily lock out
  smaller charges that would otherwise succeed.  The error is bounded
  to the difference between the smallest and the biggest possible
  charge size, so for memcg, this means that a failing THP charge can
  send base page charges into reclaim upto 2MB (4MB) before the limit
  would have been reached.  This should be acceptable.

[akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:04 -08:00
Pranith Kumar
8df0c2dcf6 slab: replace smp_read_barrier_depends() with lockless_dereference()
Recently lockless_dereference() was added which can be used in place of
hard-coding smp_read_barrier_depends().  The following PATCH makes the
change.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:04 -08:00
Andrew Morton
c871ac4e96 slab: improve checking for invalid gfp_flags
The code goes BUG, but doesn't tell us which bits were unexpectedly set.
Print that out.

Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:04 -08:00
Andrey Ryabinin
f6edde9cbe mm: slub: fix format mismatches in slab_err() callers
Adding __printf(3, 4) to slab_err exposed following:

  mm/slub.c: In function `check_slab':
  mm/slub.c:852:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
      s->name, page->objects, maxobj);
      ^
  mm/slub.c:852:4: warning: too many arguments for format [-Wformat-extra-args]
  mm/slub.c:857:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
      s->name, page->inuse, page->objects);
      ^
  mm/slub.c:857:4: warning: too many arguments for format [-Wformat-extra-args]

  mm/slub.c: In function `on_freelist':
  mm/slub.c:905:4: warning: format `%d' expects argument of type `int', but argument 5 has type `long unsigned int' [-Wformat=]
      "should be %d", page->objects, max_objects);

Fix first two warnings by removing redundant s->name.
Fix the last by changing type of max_object from unsigned long to int.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:04 -08:00
Joonsoo Kim
5436205738 mm/slab: reverse iteration on find_mergeable()
Unlike SLUB, sometimes, object isn't started at the beginning of the slab
in the SLAB.  This causes the unalignment problem when after slab merging
is supported by commit 12220dea07 ("mm/slab: support slab merge").
Alignment mismatch check is introduced ("mm/slab: fix unalignment problem
on Malta with EVA due to slab merge") to prevent merge in this case.

This causes undesirable result that merging happens between infrequently
used kmem_caches if there are kmem_caches with same size and is 256 bytes,
are merged into pool_workqueue rather than kmalloc-256, because
kmem_caches for kmalloc are at the tail of the list.

To prevent this situation, this patch reverses iteration order in
find_mergeable() to find frequently used kmem_caches.  This change helps
to merge kmem_cache to frequently used kmem_caches, such as kmalloc
kmem_caches.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:04 -08:00
Vladimir Davydov
1df3b26f20 slab: print slabinfo header in seq show
Currently we print the slabinfo header in the seq start method, which
makes it unusable for showing leaks, so we have leaks_show, which does
practically the same as s_show except it doesn't show the header.

However, we can print the header in the seq show method - we only need
to check if the current element is the first on the list.  This will
allow us to use the same set of seq iterators for both leaks and
slabinfo reporting, which is nice.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:04 -08:00
LQYMGT
b455def28d mm: slab/slub: coding style: whitespaces and tabs mixture
Some code in mm/slab.c and mm/slub.c use whitespaces in indent.
Clean them up.

Signed-off-by: LQYMGT <lqymgt@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:04 -08:00
Joonsoo Kim
6b101e2a3c mm/CMA: fix boot regression due to physical address of high_memory
high_memory isn't direct mapped memory so retrieving it's physical address
isn't appropriate.  But, it would be useful to check physical address of
highmem boundary so it's justfiable to get physical address from it.  In
x86, there is a validation check if CONFIG_DEBUG_VIRTUAL and it triggers
following boot failure reported by Ingo.

  ...
  BUG: Int 6: CR2 00f06f53
  ...
  Call Trace:
    dump_stack+0x41/0x52
    early_idt_handler+0x6b/0x6b
    cma_declare_contiguous+0x33/0x212
    dma_contiguous_reserve_area+0x31/0x4e
    dma_contiguous_reserve+0x11d/0x125
    setup_arch+0x7b5/0xb63
    start_kernel+0xb8/0x3e6
    i386_start_kernel+0x79/0x7d

To fix boot regression, this patch implements workaround to avoid
validation check in x86 when retrieving physical address of high_memory.
__pa_nodebug() used by this patch is implemented only in x86 so there is
no choice but to use dirty #ifdef.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Ingo Molnar <mingo@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:02 -08:00
Linus Torvalds
cbfe0de303 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS changes from Al Viro:
 "First pile out of several (there _definitely_ will be more).  Stuff in
  this one:

   - unification of d_splice_alias()/d_materialize_unique()

   - iov_iter rewrite

   - killing a bunch of ->f_path.dentry users (and f_dentry macro).

     Getting that completed will make life much simpler for
     unionmount/overlayfs, since then we'll be able to limit the places
     sensitive to file _dentry_ to reasonably few.  Which allows to have
     file_inode(file) pointing to inode in a covered layer, with dentry
     pointing to (negative) dentry in union one.

     Still not complete, but much closer now.

   - crapectomy in lustre (dead code removal, mostly)

   - "let's make seq_printf return nothing" preparations

   - assorted cleanups and fixes

  There _definitely_ will be more piles"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  copy_from_iter_nocache()
  new helper: iov_iter_kvec()
  csum_and_copy_..._iter()
  iov_iter.c: handle ITER_KVEC directly
  iov_iter.c: convert copy_to_iter() to iterate_and_advance
  iov_iter.c: convert copy_from_iter() to iterate_and_advance
  iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
  iov_iter.c: convert iov_iter_zero() to iterate_and_advance
  iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
  iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
  iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
  iov_iter.c: iterate_and_advance
  iov_iter.c: macros for iterating over iov_iter
  kill f_dentry macro
  dcache: fix kmemcheck warning in switch_names
  new helper: audit_file()
  nfsd_vfs_write(): use file_inode()
  ncpfs: use file_inode()
  kill f_dentry uses
  lockd: get rid of ->f_path.dentry->d_sb
  ...
2014-12-10 16:10:49 -08:00
Linus Torvalds
c9f861c772 Merge branch 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS update from Ingo Molnar:
 "The biggest change in this cycle is better support for UCNA
  (UnCorrected No Action) events:

    "Handle all uncorrected error reports in the same way (soft
     offline the page). We used to only do that for SRAO
     (software recoverable action optional) machine checks, but
     it makes sense to also do it for UCNA (UnCorrected No
     Action) logs found by CMCI or polling."

  plus various x86 MCE handling updates and fixes"

* 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Spell "panicked" correctly
  x86, mce: Support memory error recovery for both UCNA and Deferred error in machine_check_poll
  x86, mce, severity: Extend the the mce_severity mechanism to handle UCNA/DEFERRED error
  x86, MCE, AMD: Assign interrupt handler only when bank supports it
  x86, MCE, AMD: Drop software-defined bank in error thresholding
  x86, MCE, AMD: Move invariant code out from loop body
  x86, MCE, AMD: Correct thresholding error logging
  x86, MCE, AMD: Use macros to compute bank MSRs
  RAS, HWPOISON: Fix wrong error recovery status
  GHES: Make ghes_estatus_caches static
  APEI, GHES: Cleanup unnecessary function for lockless list
2014-12-10 14:20:10 -08:00
Linus Torvalds
3eb5b893eb Merge branch 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 MPX support from Thomas Gleixner:
 "This enables support for x86 MPX.

  MPX is a new debug feature for bound checking in user space.  It
  requires kernel support to handle the bound tables and decode the
  bound violating instruction in the trap handler"

* 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  asm-generic: Remove asm-generic arch_bprm_mm_init()
  mm: Make arch_unmap()/bprm_mm_init() available to all architectures
  x86: Cleanly separate use of asm-generic/mm_hooks.h
  x86 mpx: Change return type of get_reg_offset()
  fs: Do not include mpx.h in exec.c
  x86, mpx: Add documentation on Intel MPX
  x86, mpx: Cleanup unused bound tables
  x86, mpx: On-demand kernel allocation of bounds tables
  x86, mpx: Decode MPX instruction to get bound violation information
  x86, mpx: Add MPX-specific mmap interface
  x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific
  x86, mpx: Add MPX to disabled features
  ia64: Sync struct siginfo with general version
  mips: Sync struct siginfo with general version
  mpx: Extend siginfo structure to include bound violation information
  x86, mpx: Rename cfg_reg_u and status_reg
  x86: mpx: Give bndX registers actual names
  x86: Remove arbitrary instruction size limit in instruction decoder
2014-12-10 09:34:43 -08:00
Linus Torvalds
b64bb1d758 arm64 updates for 3.19
Changes include:
  - Support for alternative instruction patching from Andre
  - seccomp from Akashi
  - Some AArch32 instruction emulation, required by the Android folks
  - Optimisations for exception entry/exit code, cmpxchg, pcpu atomics
  - mmu_gather range calculations moved into core code
  - EFI updates from Ard, including long-awaited SMBIOS support
  - /proc/cpuinfo fixes to align with the format used by arch/arm/
  - A few non-critical fixes across the architecture
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABCgAGBQJUhbSAAAoJELescNyEwWM07PQH/AolxqOJTTg8TKe2wvRC+DwY
 R98bcECMwhXvwep1KhTBew7z7NRzXJvVVs+EePSpXWX2+KK2aWN4L50rAb9ow4ty
 PZ5EFw564g3rUpc7cbqIrM/lasiYWuIWw/BL+wccOm3mWbZfokBB2t0tn/2rVv0K
 5tf2VCLLxgiFJPLuYk61uH7Nshvv5uJ6ODwdXjbrH+Mfl6xsaiKv17ZrfP4D/M4o
 hrLoXxVTuuWj3sy/lBJv8vbTbKbQ6BGl9JQhBZGZHeKOdvX7UnbKH4N5vWLUFZya
 QYO92AK1xGolu8a9bEfzrmxn0zXeAHgFTnRwtDCekOvy0kTR9MRIqXASXKO3ZEU=
 =rnFX
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "Here's the usual mixed bag of arm64 updates, also including some
  related EFI changes (Acked by Matt) and the MMU gather range cleanup
  (Acked by you).

  Changes include:
   - support for alternative instruction patching from Andre
   - seccomp from Akashi
   - some AArch32 instruction emulation, required by the Android folks
   - optimisations for exception entry/exit code, cmpxchg, pcpu atomics
   - mmu_gather range calculations moved into core code
   - EFI updates from Ard, including long-awaited SMBIOS support
   - /proc/cpuinfo fixes to align with the format used by arch/arm/
   - a few non-critical fixes across the architecture"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (70 commits)
  arm64: remove the unnecessary arm64_swiotlb_init()
  arm64: add module support for alternatives fixups
  arm64: perf: Prevent wraparound during overflow
  arm64/include/asm: Fixed a warning about 'struct pt_regs'
  arm64: Provide a namespace to NCAPS
  arm64: bpf: lift restriction on last instruction
  arm64: Implement support for read-mostly sections
  arm64: compat: align cacheflush syscall with arch/arm
  arm64: add seccomp support
  arm64: add SIGSYS siginfo for compat task
  arm64: add seccomp syscall for compat task
  asm-generic: add generic seccomp.h for secure computing mode 1
  arm64: ptrace: allow tracer to skip a system call
  arm64: ptrace: add NT_ARM_SYSTEM_CALL regset
  arm64: Move some head.text functions to executable section
  arm64: jump labels: NOP out NOP -> NOP replacement
  arm64: add support to dump the kernel page tables
  arm64: Add FIX_HOLE to permanent fixed addresses
  arm64: alternatives: fix pr_fmt string for consistency
  arm64: vmlinux.lds.S: don't discard .exit.* sections at link-time
  ...
2014-12-09 13:12:47 -08:00
Al Viro
ba00410b81 Merge branch 'iov_iter' into for-next 2014-12-08 20:39:29 -05:00
Al Viro
aa583096d9 copy_from_iter_nocache()
BTW, do we want memcpy_nocache()?

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-08 20:25:23 -05:00
Al Viro
abb78f875f new helper: iov_iter_kvec()
initialization of kvec-backed iov_iter

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-08 20:25:23 -05:00
Al Viro
a604ec7e9f csum_and_copy_..._iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-08 20:25:22 -05:00
Al Viro
a280455fa8 iov_iter.c: handle ITER_KVEC directly
... without bothering with copy_..._user()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-08 19:52:00 -05:00
Dave Airlie
8c86394470 Linux 3.18
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUhNLZAAoJEHm+PkMAQRiGAEcH/iclYDW7k2GKemMqboy+Ohmh
 +ELbQothNhlGZlS1wWdD69LBiiXkkQ+ufVYFh/hC0oy0gUdfPMt5t+bOHy6cjn6w
 9zOcACtpDKnqbOwRqXZjZgNmIabk7lRjbn7GK4GQqpIaW4oO0FWcT91FFhtGSPDa
 tjtmGRqDmbNsqfzr18h0WPEpUZmT6MxIdv17AYDliPB1MaaRuAv1Kss05TJrXdfL
 Oucv+C0uwnybD9UWAz6pLJ3H/HR9VJFdkaJ4Y0pbCHAuxdd1+swoTpicluHlsJA1
 EkK5iWQRMpcmGwKvB0unCAQljNpaJiq4/Tlmmv8JlYpMlmIiVLT0D8BZx5q05QQ=
 =oGNw
 -----END PGP SIGNATURE-----

Merge tag 'v3.18' into drm-next

Linux 3.18

Backmerge Linus tree into -next as we had conflicts in i915/radeon/nouveau,
and everyone was solving them individually.

* tag 'v3.18': (57 commits)
  Linux 3.18
  watchdog: s3c2410_wdt: Fix the mask bit offset for Exynos7
  uapi: fix to export linux/vm_sockets.h
  i2c: cadence: Set the hardware time-out register to maximum value
  i2c: davinci: generate STP always when NACK is received
  ahci: disable MSI on SAMSUNG 0xa800 SSD
  context_tracking: Restore previous state in schedule_user
  slab: fix nodeid bounds check for non-contiguous node IDs
  lib/genalloc.c: export devm_gen_pool_create() for modules
  mm: fix anon_vma_clone() error treatment
  mm: fix swapoff hang after page migration and fork
  fat: fix oops on corrupted vfat fs
  ipc/sem.c: fully initialize sem_array before making it visible
  drivers/input/evdev.c: don't kfree() a vmalloc address
  cxgb4: Fill in supported link mode for SFP modules
  xen-netfront: Remove BUGs on paged skb data which crosses a page boundary
  mm/vmpressure.c: fix race in vmpressure_work_fn()
  mm: frontswap: invalidate expired data on a dup-store failure
  mm: do not overwrite reserved pages counter at show_mem()
  drm/radeon: kernel panic in drm_calc_vbltimestamp_from_scanoutpos with 3.18.0-rc6
  ...

Conflicts:
	drivers/gpu/drm/i915/intel_display.c
	drivers/gpu/drm/nouveau/nouveau_drm.c
	drivers/gpu/drm/radeon/radeon_cs.c
2014-12-08 10:33:52 +10:00
Paul Mackerras
7c3fbbdd04 slab: fix nodeid bounds check for non-contiguous node IDs
The bounds check for nodeid in ____cache_alloc_node gives false
positives on machines where the node IDs are not contiguous, leading to
a panic at boot time.  For example, on a POWER8 machine the node IDs are
typically 0, 1, 16 and 17.  This means that num_online_nodes() returns
4, so when ____cache_alloc_node is called with nodeid = 16 the VM_BUG_ON
triggers, like this:

  kernel BUG at /home/paulus/kernel/kvm/mm/slab.c:3079!
  Call Trace:
    .____cache_alloc_node+0x5c/0x270 (unreliable)
    .kmem_cache_alloc_node_trace+0xdc/0x360
    .init_list+0x3c/0x128
    .kmem_cache_init+0x1dc/0x258
    .start_kernel+0x2a0/0x568
    start_here_common+0x20/0xa8

To fix this, we instead compare the nodeid with MAX_NUMNODES, and
additionally make sure it isn't negative (since nodeid is an int).  The
check is there mainly to protect the array dereference in the get_node()
call in the next line, and the array being dereferenced is of size
MAX_NUMNODES.  If the nodeid is in range but invalid (for example if the
node is off-line), the BUG_ON in the next line will catch that.

Fixes: 14e50c6a9b ("mm: slab: Verify the nodeid passed to ____cache_alloc_node")
Signed-off-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-03 09:36:04 -08:00
Daniel Forrest
c4ea95d7cd mm: fix anon_vma_clone() error treatment
Andrew Morton noticed that the error return from anon_vma_clone() was
being dropped and replaced with -ENOMEM (which is not itself a bug
because the only error return value from anon_vma_clone() is -ENOMEM).

I did an audit of callers of anon_vma_clone() and discovered an actual
bug where the error return was being lost.  In __split_vma(), between
Linux 3.11 and 3.12 the code was changed so the err variable is used
before the call to anon_vma_clone() and the default initial value of
-ENOMEM is overwritten.  So a failure of anon_vma_clone() will return
success since err at this point is now zero.

Below is a patch which fixes this bug and also propagates the error
return value from anon_vma_clone() in all cases.

Fixes: ef0855d334 ("mm: mempolicy: turn vma_set_policy() into vma_dup_policy()")
Signed-off-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tim Hartrick <tim@edgecast.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>	[3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-03 09:36:04 -08:00