Commit Graph

514113 Commits

Author SHA1 Message Date
Konstantin Khlebnikov
761b06771a mm: completely remove dumping per-cpu lists from show_mem()
It seems nobody needs this.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:01 -07:00
Konstantin Khlebnikov
d1bfcdb8ce mm: hide per-cpu lists in output of show_mem()
This makes show_mem() much less verbose on huge machines.  Instead of huge
and almost useless dump of counters for each per-zone per-cpu lists this
patch prints the sum of these counters for each zone (free_pcp) and size
of per-cpu list for current cpu (local_pcp).

The filter flag SHOW_MEM_PERCPU_LISTS reverts to the old verbose mode.

[akpm@linux-foundation.org: update show_free_areas comment]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:01 -07:00
Konstantin Khlebnikov
b9ea25152e page_writeback: clean up mess around cancel_dirty_page()
This patch replaces cancel_dirty_page() with a helper function
account_page_cleaned() which only updates counters.  It's called from
truncate_complete_page() and from try_to_free_buffers() (hack for ext3).
Page is locked in both cases, page-lock protects against concurrent
dirtiers: see commit 2d6d7f9828 ("mm: protect set_page_dirty() from
ongoing truncation").

Delete_from_page_cache() shouldn't be called for dirty pages, they must
be handled by caller (either written or truncated).  This patch treats
final dirty accounting fixup at the end of __delete_from_page_cache() as
a debug check and adds WARN_ON_ONCE() around it.  If something removes
dirty pages without proper handling that might be a bug and unwritten
data might be lost.

Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough
here.

cancel_dirty_page() in nfs_wb_page_cancel() is redundant.  This is
helper for nfs_invalidate_page() and it's called only in case complete
invalidation.

The mess was started in v2.6.20 after commits 46d2277c79 ("Clean up
and make try_to_free_buffers() not race with dirty pages") and
3e67c0987d ("truncate: clear page dirtiness before running
try_to_free_buffers()") first was reverted right in v2.6.20 in commit
ecdfc9787f ("Resurrect 'try_to_free_buffers()' VM hackery"), second in
v2.6.25 commit a2b345642f ("Fix dirty page accounting leak with ext3
data=journal").

Custom fixes were introduced between these points.  NFS in v2.6.23, commit
1b3b4a1a2d ("NFS: Fix a write request leak in nfs_invalidate_page()").
Kludge in __delete_from_page_cache() in v2.6.24, commit 3a6927906f ("Do
dirty page accounting when removing a page from the page cache").  Since
v2.6.25 all of them are redundant.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:01 -07:00
Ebru Akagunduz
ca0984caa8 mm: incorporate zero pages into transparent huge pages
This patch improves THP collapse rates, by allowing zero pages.

Currently THP can collapse 4kB pages into a THP when there are up to
khugepaged_max_ptes_none pte_none ptes in a 2MB range.  This patch counts
pte none and mapped zero pages with the same variable.

The patch was tested with a program that allocates 800MB of
memory, and performs interleaved reads and writes, in a pattern
that causes some 2MB areas to first see read accesses, resulting
in the zero pfn being mapped there.

To simulate memory fragmentation at allocation time, I modified
do_huge_pmd_anonymous_page to return VM_FAULT_FALLBACK for read faults.

Without the patch, only %50 of the program was collapsed into THP and the
percentage did not increase over time.

With this patch after 10 minutes of waiting khugepaged had collapsed %99
of the program's memory.

[aarcange@redhat.com: fix bogus BUG()]
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:01 -07:00
Joonsoo Kim
2149cdaef6 mm/compaction: enhance compaction finish condition
Compaction has anti fragmentation algorithm.  It is that freepage should
be more than pageblock order to finish the compaction if we don't find any
freepage in requested migratetype buddy list.  This is for mitigating
fragmentation, but, there is a lack of migratetype consideration and it is
too excessive compared to page allocator's anti fragmentation algorithm.

Not considering migratetype would cause premature finish of compaction.
For example, if allocation request is for unmovable migratetype, freepage
with CMA migratetype doesn't help that allocation and compaction should
not be stopped.  But, current logic regards this situation as compaction
is no longer needed, so finish the compaction.

Secondly, condition is too excessive compared to page allocator's logic.
We can steal freepage from other migratetype and change pageblock
migratetype on more relaxed conditions in page allocator.  This is
designed to prevent fragmentation and we can use it here.  Imposing hard
constraint only to the compaction doesn't help much in this case since
page allocator would cause fragmentation again.

To solve these problems, this patch borrows anti fragmentation logic from
page allocator.  It will reduce premature compaction finish in some cases
and reduce excessive compaction work.

stress-highalloc test in mmtests with non movable order 7 allocation shows
considerable increase of compaction success rate.

Compaction success rate (Compaction success * 100 / Compaction stalls, %)
31.82 : 42.20

I tested it on non-reboot 5 runs stress-highalloc benchmark and found that
there is no more degradation on allocation success rate than before.  That
roughly means that this patch doesn't result in more fragmentations.

Vlastimil suggests additional idea that we only test for fallbacks when
migration scanner has scanned a whole pageblock.  It looked good for
fragmentation because chance of stealing increase due to making more free
pages in certain pageblock.  So, I tested it, but, it results in decreased
compaction success rate, roughly 38.00.  I guess the reason that if system
is low memory condition, watermark check could be failed due to not enough
order 0 free page and so, sometimes, we can't reach a fallback check
although migrate_pfn is aligned to pageblock_nr_pages.  I can insert code
to cope with this situation but it makes code more complicated so I don't
include his idea at this patch.

[akpm@linux-foundation.org: fix CONFIG_CMA=n build]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:01 -07:00
Joonsoo Kim
4eb7dce620 mm/page_alloc: factor out fallback freepage checking
This is preparation step to use page allocator's anti fragmentation logic
in compaction.  This patch just separates fallback freepage checking part
from fallback freepage management part.  Therefore, there is no functional
change.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:01 -07:00
Joonsoo Kim
dc67647b78 mm/cma: change fallback behaviour for CMA freepage
Freepage with MIGRATE_CMA can be used only for MIGRATE_MOVABLE and they
should not be expanded to other migratetype buddy list to protect them
from unmovable/reclaimable allocation.  Implementing these requirements in
__rmqueue_fallback(), that is, finding largest possible block of freepage
has bad effect that high order freepage with MIGRATE_CMA are broken
continually although there are suitable order CMA freepage.  Reason is
that they are not be expanded to other migratetype buddy list and next
__rmqueue_fallback() invocation try to finds another largest block of
freepage and break it again.  So, MIGRATE_CMA fallback should be handled
separately.  This patch introduces __rmqueue_cma_fallback(), that just
wrapper of __rmqueue_smallest() and call it before __rmqueue_fallback() if
migratetype == MIGRATE_MOVABLE.

This results in unintended behaviour change that MIGRATE_CMA freepage is
always used first rather than other migratetype as movable allocation's
fallback.  But, as already mentioned above, MIGRATE_CMA can be used only
for MIGRATE_MOVABLE, so it is better to use MIGRATE_CMA freepage first as
much as possible.  Otherwise, we needlessly take up precious freepages
with other migratetype and increase chance of fragmentation.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:01 -07:00
David Rientjes
30467e0b3b mm, hotplug: fix concurrent memory hot-add deadlock
There's a deadlock when concurrently hot-adding memory through the probe
interface and switching a memory block from offline to online.

When hot-adding memory via the probe interface, add_memory() first takes
mem_hotplug_begin() and then device_lock() is later taken when registering
the newly initialized memory block.  This creates a lock dependency of (1)
mem_hotplug.lock (2) dev->mutex.

When switching a memory block from offline to online, dev->mutex is first
grabbed in device_online() when the write(2) transitions an existing
memory block from offline to online, and then online_pages() will take
mem_hotplug_begin().

This creates a lock inversion between mem_hotplug.lock and dev->mutex.
Vitaly reports that this deadlock can happen when kworker handling a probe
event races with systemd-udevd switching a memory block's state.

This patch requires the state transition to take mem_hotplug_begin()
before dev->mutex.  Hot-adding memory via the probe interface creates a
memory block while holding mem_hotplug_begin(), there is no way to take
dev->mutex first in this case.

online_pages() and offline_pages() are only called when transitioning
memory block state.  We now require that mem_hotplug_begin() is taken
before calling them -- this requires exporting the mem_hotplug_begin() and
mem_hotplug_done() to generic code.  In all hot-add and hot-remove cases,
mem_hotplug_begin() is done prior to device_online().  This is all that is
needed to avoid the deadlock.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Sasha Levin
17e0db822b cma: debug: document new debugfs interface
Document the structure and files under the new debugfs interface.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Andrew Morton
875abdb6d4 mm-cma-allocation-trigger-fix
s/CONFIG_CMA_ALIGNMENT/0/, per Joonsoo

Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Sasha Levin
8325330b02 mm: cma: release trigger
Provides a userspace interface to trigger a CMA release.

Usage:

        echo [pages] > free

This would provide testing/fuzzing access to the CMA release paths.

[akpm@linux-foundation.org: coding-style fixes]
[mhocko@suse.cz: fix build]
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Sasha Levin
26b02a1f96 mm: cma: allocation trigger
Provides a userspace interface to trigger a CMA allocation.

Usage:

        echo [pages] > alloc

This would provide testing/fuzzing access to the CMA allocation paths.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Sasha Levin
28b24c1fc8 mm: cma: debugfs interface
I've noticed that there is no interfaces exposed by CMA which would let me
fuzz what's going on in there.

This small patchset exposes some information out to userspace, plus adds
the ability to trigger allocation and freeing from userspace.

This patch (of 3):

Implement a simple debugfs interface to expose information about CMA areas
in the system.

Useful for testing/sanity checks for CMA since it was impossible to
previously retrieve this information in userspace.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Sheng Yong
19c07d5e04 memory hotplug: use macro to switch between section and pfn
Use macro section_nr_to_pfn() to switch between section and pfn, instead
of open-coding it.  No semantic changes.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Johannes Weiner
1575e68b3c mm: memcontrol: update copyright notice
Add myself to the list of copyright holders.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Baoquan He
7fc825b456 mm/memblock.c: rename local variable of memblock_type to `type'
A small cleanup.  Seems in e3239ff9 ("memblock: Rename memblock_region to
memblock_type and memblock_property to memblock_region") this one was
missed.

Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Kirill A. Shutemov
acc3c8d15e mm: move mm_populate()-related code to mm/gup.c
It's odd that we have populate_vma_page_range() and __mm_populate() in
mm/mlock.c.  It's implementation of generic memory population and mlocking
is one of possible side effect, if VM_LOCKED is set.

__get_user_pages() is core of the implementation.  Let's move the code
into mm/gup.c.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Kirill A. Shutemov
c561259ca7 mm: move gup() -> posix mlock() error conversion out of __mm_populate
This is praparation to moving mm_populate()-related code out of
mm/mlock.c.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Kirill A. Shutemov
fc05f56621 mm: rename __mlock_vma_pages_range() to populate_vma_page_range()
__mlock_vma_pages_range() doesn't necessarily mlock pages.  It depends on
vma flags.  The same codepath is used for MAP_POPULATE.

Let's rename __mlock_vma_pages_range() to populate_vma_page_range().

This patch also drops mlock_vma_pages_range() references from
documentation.  It has gone in cea10a19b7 ("mm: directly use
__mlock_vma_pages_range() in find_extend_vma()").

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:00 -07:00
Kirill A. Shutemov
84d33df279 mm: rename FOLL_MLOCK to FOLL_POPULATE
After commit a1fde08c74 ("VM: skip the stack guard page lookup in
get_user_pages only for mlock") FOLL_MLOCK has lost its original
meaning: we don't necessarily mlock the page if the flags is set -- we
also take VM_LOCKED into consideration.

Since we use the same codepath for __mm_populate(), let's rename
FOLL_MLOCK to FOLL_POPULATE.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Fabian Frederick
c21a6daf46 slob: make slob_alloc_node() static and remove EXPORT_SYMBOL()
slob_alloc_node() is only used in slob.c.  Remove the EXPORT_SYMBOL and
make slob_alloc_node() static.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Joe Perches
6f6528a163 slub: use bool function return values of true/false not 1/0
Use the normal return values for bool functions

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
David Rientjes
124dee09f0 mm, slab: correct config option in comment
CONFIG_SLAB_DEBUG doesn't exist, CONFIG_DEBUG_SLAB does.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Chris J Arges
08303a73c6 mm/slub.c: parse slub_debug O option in switch statement
By moving the O option detection into the switch statement, we allow this
parameter to be combined with other options correctly.  Previously options
like slub_debug=OFZ would only detect the 'o' and use DEBUG_DEFAULT_FLAGS
to fill in the rest of the flags.

Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Geert Uytterhoeven
ef2a5153b4 mm/migrate: mark unmap_and_move() "noinline" to avoid ICE in gcc 4.7.3
With gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-12ubuntu1) :

    mm/migrate.c: In function `migrate_pages':
    mm/migrate.c:1148:1: internal compiler error: in push_minipool_fix, at config/arm/arm.c:13500
    Please submit a full bug report,
    with preprocessed source if appropriate.
    See <file:///usr/share/doc/gcc-4.7/README.Bugs> for instructions.
    Preprocessed source stored into /tmp/ccPoM1tr.out file, please attach this to your bugreport.
    make[1]: *** [mm/migrate.o] Error 1
    make: *** [mm/migrate.o] Error 2

Mark unmap_and_move() (which is used in a single place only) "noinline"
to work around this compiler bug.

[akpm@linux-foundation.org: make it conditional on gcc-4.7.3 and arm]
[khilman@kernel.org: fine-tune compiler versions]
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reported-by: Kevin Hilman <khilman@kernel.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Kevin Hilman <khilman@linaro.org>
Tested-by: Lina Iyer <lina.iyer@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Ulrich Obergfell
692297d8f9 watchdog: introduce the hardlockup_detector_disable() function
Have kvm_guest_init() use hardlockup_detector_disable() instead of
watchdog_enable_hardlockup_detector(false).

Remove the watchdog_hardlockup_detector_is_enabled() and the
watchdog_enable_hardlockup_detector() function which are no longer needed.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Ulrich Obergfell
b2f57c3a0d watchdog: clean up some function names and arguments
Rename the update_timers*() functions to update_watchdog*().

Remove the boolean argument from watchdog_enable_all_cpus() because
update_watchdog_all_cpus() is now a generic function to change the run
state of the lockup detectors and to have the lockup detectors use a new
sample period.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Ulrich Obergfell
195daf665a watchdog: enable the new user interface of the watchdog mechanism
With the current user interface of the watchdog mechanism it is only
possible to disable or enable both lockup detectors at the same time.
This series introduces new kernel parameters and changes the semantics of
some existing kernel parameters, so that the hard lockup detector and the
soft lockup detector can be disabled or enabled individually.  With this
series applied, the user interface is as follows.

- parameters in /proc/sys/kernel

  . soft_watchdog
    This is a new parameter to control and examine the run state of
    the soft lockup detector.

  . nmi_watchdog
    The semantics of this parameter have changed. It can now be used
    to control and examine the run state of the hard lockup detector.

  . watchdog
    This parameter is still available to control the run state of both
    lockup detectors at the same time. If this parameter is examined,
    it shows the logical OR of soft_watchdog and nmi_watchdog.

  . watchdog_thresh
    The semantics of this parameter are not affected by the patch.

- kernel command line parameters

  . nosoftlockup
    The semantics of this parameter have changed. It can now be used
    to disable the soft lockup detector at boot time.

  . nmi_watchdog=0 or nmi_watchdog=1
    Disable or enable the hard lockup detector at boot time. The patch
    introduces '=1' as a new option.

  . nowatchdog
    The semantics of this parameter are not affected by the patch. It
    is still available to disable both lockup detectors at boot time.

Also, remove the proc_dowatchdog() function which is no longer needed.

[dzickus@redhat.com: wrote changelog]
[dzickus@redhat.com: update documentation for kernel params and sysctl]
Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Ulrich Obergfell
bcfba4f4bf watchdog: implement error handling for failure to set up hardware perf events
If watchdog_nmi_enable() fails to set up the hardware perf event of one
CPU, the entire hard lockup detector is deemed unreliable.  Hence, disable
the hard lockup detector and shut down the hardware perf events on all
CPUs.

[dzickus@redhat.com: update comments to explain some code]
Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Ulrich Obergfell
83a80a3907 watchdog: introduce separate handlers for parameters in /proc/sys/kernel
Separate handlers for each watchdog parameter in /proc/sys/kernel replace
the proc_dowatchdog() function.  Three of those handlers merely call
proc_watchdog_common() with one different argument.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Ulrich Obergfell
ef246a216b watchdog: introduce proc_watchdog_common()
Three of four handlers for the watchdog parameters in /proc/sys/kernel
essentially have to do the same thing.

  if the parameter is being read {
    return the state of the corresponding bit(s) in 'watchdog_enabled'
  } else {
    set/clear the state of the corresponding bit(s) in 'watchdog_enabled'
    update the run state of the lockup detector(s)
  }

Hence, introduce a common function that can be called by those handlers.
The callers pass a 'bit mask' to this function to indicate which bit(s)
should be set/cleared in 'watchdog_enabled'.

This function handles an uncommon race with watchdog_nmi_enable() where a
concurrent update of 'watchdog_enabled' is possible.  We use 'cmpxchg' to
detect the concurrency.  [This avoids introducing a new spinlock or a
mutex to synchronize updates of 'watchdog_enabled'.  Using the same lock
or mutex in watchdog thread context and in system call context needs to be
considered carefully because it can make the code prone to deadlock
situations in connection with parking/unparking the watchdog threads.]

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:59 -07:00
Ulrich Obergfell
f54c2274f5 watchdog: move definition of 'watchdog_proc_mutex' outside of proc_dowatchdog()
This series removes proc_dowatchdog().  Since multiple new functions need
the 'watchdog_proc_mutex' to serialize access to the watchdog parameters
in /proc/sys/kernel, move the mutex outside of any function.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:58 -07:00
Ulrich Obergfell
a0c9cbb93d watchdog: introduce the proc_watchdog_update() function
This series introduces a separate handler for each watchdog parameter in
/proc/sys/kernel.  The separate handlers need a common function that they
can call to update the run state of the lockup detectors, or to have the
lockup detectors use a new sample period.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:58 -07:00
Ulrich Obergfell
84d56e66b9 watchdog: new definitions and variables, initialization
The hardlockup and softockup had always been tied together.  Due to the
request of KVM folks, they had a need to have one enabled but not the
other.  Internally rework the code to split things apart more cleanly.

There is a bunch of churn here, but the end result should be code that
should be easier to maintain and fix without knowing the internals of what
is going on.

This patch (of 9):

Introduce new definitions and variables to separate the user interface in
/proc/sys/kernel from the internal run state of the lockup detectors.  The
internal run state is represented by two bits in a new variable that is
named 'watchdog_enabled'.  This helps simplify the code, for example:

- In order to check if any of the two lockup detectors is enabled,
  it is sufficient to check if 'watchdog_enabled' is not zero.

- In order to enable/disable one or both lockup detectors,
  it is sufficient to set/clear one or both bits in 'watchdog_enabled'.

- Concurrent updates of 'watchdog_enabled' need not be synchronized via
  a spinlock or a mutex. Updates can either be atomic or concurrency can
  be detected by using 'cmpxchg'.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:58 -07:00
Andrew Morton
1d5b897706 ocfs2: make mlog_errno return the errno
ocfs2 does

        mlog_errno(v);
        return v;

in many places.  Change mlog_errno() so we can do

        return mlog_errno(v);

For some weird reason this patch reduces the size of ocfs2 by 6k:

  akpm3:/usr/src/25> size fs/ocfs2/ocfs2.ko
     text    data     bss     dec     hex filename
  1146613   82767  832192 2061572  1f7504 fs/ocfs2/ocfs2.ko-before
  1140857   82767  832192 2055816  1f5e88 fs/ocfs2/ocfs2.ko-after

[dan.carpenter@oracle.com: double evaluation concerns in mlog_errno()]
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:58 -07:00
alex chen
2f2eca20a0 ocfs2: check if the ocfs2 lock resource has been initialized before calling ocfs2_dlm_lock
If ocfs2 lockres has not been initialized before calling ocfs2_dlm_lock,
the lock won't be dropped and then will lead umount hung.  The case is
described below:

ocfs2_mknod
    ocfs2_mknod_locked
        __ocfs2_mknod_locked
            ocfs2_journal_access_di
            Failed because of -ENOMEM or other reasons, the inode lockres
            has not been initialized yet.

    iput(inode)
        ocfs2_evict_inode
            ocfs2_delete_inode
                ocfs2_inode_lock
                    ocfs2_inode_lock_full_nested
                        __ocfs2_cluster_lock
                        Succeeds and allocates a new dlm lockres.
            ocfs2_clear_inode
                ocfs2_open_unlock
                    ocfs2_drop_inode_locks
                        ocfs2_drop_lock
                        Since lockres has not been initialized, the lock
                        can't be dropped and the lockres can't be
                        migrated, thus umount will hang forever.

Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:58 -07:00
Joe Perches
1543306e75 ocfs2: logging: remove static buffer, use vsprintf extension %pV
Use the vsprintf %pV extension to avoid using a static buffer and remove
the now unnecessary buffer.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:58 -07:00
Chengyu Song
e2ac55b6a8 ocfs2: incorrect check for debugfs returns
debugfs_create_dir and debugfs_create_file may return -ENODEV when debugfs
is not configured, so the return value should be checked against
ERROR_VALUE as well, otherwise the later dereference of the dentry pointer
would crash the kernel.

This patch tries to solve this problem by fixing certain checks. However,
I have that found other call sites are protected by #ifdef CONFIG_DEBUG_FS.
In current implementation, if CONFIG_DEBUG_FS is defined, then the above
two functions will never return any ERROR_VALUE. So another possibility
to fix this is to surround all the buggy checks/functions with the same
#ifdef CONFIG_DEBUG_FS. But I'm not sure if this would break any functionality,
as only OCFS2_FS_STATS declares dependency on DEBUG_FS.

Signed-off-by: Chengyu Song <csong84@gatech.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:58 -07:00
Jakub Wilk
762515a8e9 ocfs2: fix a typo in the copyright statement
Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
Reviewed-by: Eric Ren <zren@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:58 -07:00
Joseph Qi
023d4ea358 ocfs2: fix possible uninitialized variable access
In ocfs2_local_alloc_find_clear_bits and ocfs2_get_dentry, variable
numfound and set may be uninitialized and then used in tracepoint.  In
ocfs2_xattr_block_get and ocfs2_delete_xattr_in_bucket, variable block_off
and xv may be uninitialized and then used in the following logic due to
unchecked return value.

This patch fixes these possible issues.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:58 -07:00
Daeseok Youn
7c01ad8fe7 ocfs2: remove goto statement in ocfs2_check_dir_for_entry()
Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:58 -07:00
Joseph Qi
a47726bcf2 ocfs2: rollback the cleared bits if error occurs after ocfs2_block_group_clear_bits
ocfs2_block_group_clear_bits will clear bits in block group bitmap.
Once it succeeds but fails in the following step, it will cause block
group bitmap mismatch the corresponding count recorded in dinode.
So rollback the cleared bits if error occurs.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:58 -07:00
Joseph Qi
62f8b1f0d6 ocfs2: use ENOENT instead of EEXIST when get system file fails
When ocfs2_get_system_file_inode fails, it is obscure to set the return
value to -EEXIST. So change it to -ENOENT.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:58 -07:00
Joseph Qi
e38a573907 ocfs2: use actual name length when find entry in ocfs2_orphan_del()
If the namelen is 20 and name only has actual length 16, it will fail in
ocfs2_find_entry because of mismatch.  So use actual name length when find
entry.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:58 -07:00
Dan Carpenter
e073fc58df ocfs2: dereferencing freed pointers in ocfs2_reflink()
The code at the "out" label assumes that "default_acl" and "acl" are NULL,
but actually the pointers can be NULL, unitialized, or freed.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:57 -07:00
Joseph Qi
d0ba25b905 ocfs2: fix typo in ocfs2_reserve_local_alloc_bits
In ocfs2_reserve_local_alloc_bits, it calls ocfs2_error if local alloc
inode bitmap used bits mismatch, but the log mistakes it as free bits.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:57 -07:00
Joseph Qi
14a5275d8c ocfs2: do not use ocfs2_zero_extend during direct IO
In ocfs2_direct_IO_write, we use ocfs2_zero_extend to zero allocated
clusters in case of cluster not aligned.  But ocfs2_zero_extend uses page
cache, this may happen that it clears the data which blockdev_direct_IO
has already written.

We should use blkdev_issue_zeroout instead of ocfs2_zero_extend during
direct IO.

So fix this issue by introducing ocfs2_direct_IO_zero_extend and
ocfs2_direct_IO_extend_no_holes.

Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Tested-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:57 -07:00
Joseph Qi
37a8d89aee ocfs2: take inode lock when get clusters
We need take inode lock when calling ocfs2_get_clusters.
And use GFP_NOFS instead of GFP_KERNEL.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:57 -07:00
Joseph Qi
7e9b19551c ocfs2: no need get dinode bh when zeroing extend
Since di_bh won't be used when zeroing extend, set it to NULL.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:57 -07:00
Joseph Qi
bdd86215b3 ocfs2: fix a typing error in ocfs2_direct_IO_write
Only when direct IO succeeds we need consider zeroing out in case of
cluster not aligned.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:48:57 -07:00