Commit Graph

596761 Commits

Author SHA1 Message Date
Michal Hocko
ede3771373 mm: throttle on IO only when there are too many dirty and writeback pages
wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress.

We used to do congestion_wait before commit 0e093d9976 ("writeback: do
not sleep on the congestion queue if there are no congested BDIs or if
significant congestion is not being encountered in the current zone")
but that led to undesirable stalls and sleeping for the full timeout
even when the BDI wasn't congested.  Hence wait_iff_congested was used
instead.

But it seems that even wait_iff_congested doesn't work as expected.  We
might have a small file LRU list with all pages dirty/writeback and yet
the bdi is not congested so this is just a cond_resched in the end and
can end up triggering pre mature OOM.

This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round
of direct reclaim didn't make any progress and dirty+writeback pages are
more than a half of the reclaimable pages on the zone which might be
usable for our target allocation.  This shouldn't reintroduce stalls
fixed by 0e093d9976 because congestion_wait is called only when we are
getting hopeless when sleeping is a better choice than OOM with many
pages under IO.

We have to preserve logic introduced by commit 373ccbe592 ("mm,
vmstat: allow WQ concurrency to discover memory reclaim doesn't make any
progress") into the __alloc_pages_slowpath now that wait_iff_congested
is not used anymore.  As the only remaining user of wait_iff_congested
is shrink_inactive_list we can remove the WQ specific short sleep from
wait_iff_congested because the sleep is needed to be done only once in
the allocation retry cycle.

[mhocko@suse.com: high_zoneidx->ac_classzone_idx to evaluate memory reserves properly]
 Link: http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko
0a0337e0d1 mm, oom: rework oom detection
__alloc_pages_slowpath has traditionally relied on the direct reclaim
and did_some_progress as an indicator that it makes sense to retry
allocation rather than declaring OOM.  shrink_zones had to rely on
zone_reclaimable if shrink_zone didn't make any progress to prevent from
a premature OOM killer invocation - the LRU might be full of dirty or
writeback pages and direct reclaim cannot clean those up.

zone_reclaimable allows to rescan the reclaimable lists several times
and restart if a page is freed.  This is really subtle behavior and it
might lead to a livelock when a single freed page keeps allocator
looping but the current task will not be able to allocate that single
page.  OOM killer would be more appropriate than looping without any
progress for unbounded amount of time.

This patch changes OOM detection logic and pulls it out from shrink_zone
which is too low to be appropriate for any high level decisions such as
OOM which is per zonelist property.  It is __alloc_pages_slowpath which
knows how many attempts have been done and what was the progress so far
therefore it is more appropriate to implement this logic.

The new heuristic is implemented in should_reclaim_retry helper called
from __alloc_pages_slowpath.  It tries to be more deterministic and
easier to follow.  It builds on an assumption that retrying makes sense
only if the currently reclaimable memory + free pages would allow the
current allocation request to succeed (as per __zone_watermark_ok) at
least for one zone in the usable zonelist.

This alone wouldn't be sufficient, though, because the writeback might
get stuck and reclaimable pages might be pinned for a really long time
or even depend on the current allocation context.  Therefore there is a
backoff mechanism implemented which reduces the reclaim target after
each reclaim round without any progress.  This means that we should
eventually converge to only NR_FREE_PAGES as the target and fail on the
wmark check and proceed to OOM.  The backoff is simple and linear with
1/16 of the reclaimable pages for each round without any progress.  We
are optimistic and reset counter for successful reclaim rounds.

Costly high order pages mostly preserve their semantic and those without
__GFP_REPEAT fail right away while those which have the flag set will
back off after the amount of reclaimable pages reaches equivalent of the
requested order.  The only difference is that if there was no progress
during the reclaim we rely on zone watermark check.  This is more
logical thing to do than previous 1<<order attempts which were a result
of zone_reclaimable faking the progress.

[vdavydov@virtuozzo.com: check classzone_idx for shrink_zone]
[hannes@cmpxchg.org: separate the heuristic into should_reclaim_retry]
[rientjes@google.com: use zone_page_state_snapshot for NR_FREE_PAGES]
[rientjes@google.com: shrink_zones doesn't need to return anything]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko
cab1802b5f mm, compaction: abstract compaction feedback to helpers
Compaction can provide a wild variation of feedback to the caller.  Many
of them are implementation specific and the caller of the compaction
(especially the page allocator) shouldn't be bound to specifics of the
current implementation.

This patch abstracts the feedback into three basic types:
	- compaction_made_progress - compaction was active and made some
	  progress.
	- compaction_failed - compaction failed and further attempts to
	  invoke it would most probably fail and therefore it is not
	  worth retrying
	- compaction_withdrawn - compaction wasn't invoked for an
          implementation specific reasons. In the current implementation
          it means that the compaction was deferred, contended or the
          page scanners met too early without any progress. Retrying is
          still worthwhile.

[vbabka@suse.cz: do not change thp back off behavior]
[akpm@linux-foundation.org: fix typo in comment, per Hillf]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko
c5d01d0d18 mm, compaction: simplify __alloc_pages_direct_compact feedback interface
__alloc_pages_direct_compact communicates potential back off by two
variables:
	- deferred_compaction tells that the compaction returned
	  COMPACT_DEFERRED
	- contended_compaction is set when there is a contention on
	  zone->lock resp. zone->lru_lock locks

__alloc_pages_slowpath then backs of for THP allocation requests to
prevent from long stalls. This is rather messy and it would be much
cleaner to return a single compact result value and hide all the nasty
details into __alloc_pages_direct_compact.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko
4f9a358c36 mm, compaction: update compaction_result ordering
compaction_result will be used as the primary feedback channel for
compaction users.  At the same time try_to_compact_pages (and
potentially others) assume a certain ordering where a more specific
feedback takes precendence.

This gets a bit awkward when we have conflicting feedback from different
zones.  E.g one returing COMPACT_COMPLETE meaning the full zone has been
scanned without any outcome while other returns with COMPACT_PARTIAL aka
made some progress.  The caller should get COMPACT_PARTIAL because that
means that the compaction still can make some progress.  The same
applies for COMPACT_PARTIAL vs COMPACT_PARTIAL_SKIPPED.

Reorder PARTIAL to be the largest one so the larger the value is the
more progress we have done.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko
c8f7de0bfa mm, compaction: distinguish between full and partial COMPACT_COMPLETE
COMPACT_COMPLETE now means that compaction and free scanner met.  This
is not very useful information if somebody just wants to use this
feedback and make any decisions based on that.  The current caller might
be a poor guy who just happened to scan tiny portion of the zone and
that could be the reason no suitable pages were compacted.  Make sure we
distinguish the full and partial zone walks.

Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
and be optimistic in retrying.

The existing users of COMPACT_COMPLETE are conservatively changed to use
COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
reconsidered and only defer the compaction only for COMPACT_COMPLETE
with the new semantic.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko
1d4746d395 mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED
try_to_compact_pages() can currently return COMPACT_SKIPPED even when
the compaction is defered for some zone just because zone DMA is skipped
in 99% of cases due to watermark checks.  This makes COMPACT_DEFERRED
basically unusable for the page allocator as a feedback mechanism.

Make sure we distinguish those two states properly and switch their
ordering in the enum.  This would mean that the COMPACT_SKIPPED will be
returned only when all eligible zones are skipped.

As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
will be more precise and we would bail out rather than reclaim.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko
c46649deae mm, compaction: cover all compaction mode in compact_zone
The compiler is complaining after "mm, compaction: change COMPACT_
constants into enum"

  mm/compaction.c: In function `compact_zone':
  mm/compaction.c:1350:2: warning: enumeration value `COMPACT_DEFERRED' not handled in switch [-Wswitch]
    switch (ret) {
    ^
  mm/compaction.c:1350:2: warning: enumeration value `COMPACT_COMPLETE' not handled in switch [-Wswitch]
  mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NO_SUITABLE_PAGE' not handled in switch [-Wswitch]
  mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NOT_SUITABLE_ZONE' not handled in switch [-Wswitch]
  mm/compaction.c:1350:2: warning: enumeration value `COMPACT_CONTENDED' not handled in switch [-Wswitch]

compaction_suitable is allowed to return only COMPACT_PARTIAL,
COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
impossible.  Put a VM_BUG_ON to catch an impossible return value.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko
ea7ab982b6 mm, compaction: change COMPACT_ constants into enum
Compaction code is doing weird dances between COMPACT_FOO -> int ->
unsigned long

But there doesn't seem to be any reason for that.  All functions which
return/use one of those constants are not expecting any other value so it
really makes sense to define an enum for them and make it clear that no
other values are expected.

This is a pure cleanup and shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko
b6459cc154 vmscan: consider classzone_idx in compaction_ready
Motivation:
As pointed out by Linus [2][3] relying on zone_reclaimable as a way to
communicate the reclaim progress is rater dubious. I tend to agree,
not only it is really obscure, it is not hard to imagine cases where a
single page freed in the loop keeps all the reclaimers looping without
getting any progress because their gfp_mask wouldn't allow to get that
page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
rare so it doesn't happen in the practice but the current logic which we
have is rather obscure and hard to follow a also non-deterministic.

This is an attempt to make the OOM detection more deterministic and
easier to follow because each reclaimer basically tracks its own
progress which is implemented at the page allocator layer rather spread
out between the allocator and the reclaim.  The more on the
implementation is described in the first patch.

I have tested several different scenarios but it should be clear that
testing OOM killer is quite hard to be representative.  There is usually
a tiny gap between almost OOM and full blown OOM which is often time
sensitive.  Anyway, I have tested the following 2 scenarios and I would
appreciate if there are more to test.

Testing environment: a virtual machine with 2G of RAM and 2CPUs without
any swap to make the OOM more deterministic.

1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G
   file size, removes the files and starts over again) running in
   parallel for 10s to build up a lot of dirty pages when 100 parallel
   mem_eaters (anon private populated mmap which waits until it gets
   signal) with 80M each.

   This causes an OOM flood of course and I have compared both patched
   and unpatched kernels. The test is considered finished after there
   are no OOM conditions detected. This should tell us whether there are
   any excessive kills or some of them premature (e.g. due to dirty pages):

I have performed two runs this time each after a fresh boot.

* base kernel
$ grep "Out of memory:" base-oom-run1.log | wc -l
78
$ grep "Out of memory:" base-oom-run2.log | wc -l
78

$ grep "Kill process" base-oom-run1.log | tail -n1
[   91.391203] Out of memory: Kill process 3061 (mem_eater) score 39 or sacrifice child
$ grep "Kill process" base-oom-run2.log | tail -n1
[   82.141919] Out of memory: Kill process 3086 (mem_eater) score 39 or sacrifice child

$ grep "DMA32 free:" base-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5376.00 max: 6776.00 avg: 5530.75 std: 166.50 nr: 61
$ grep "DMA32 free:" base-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5416.00 max: 5608.00 avg: 5514.15 std: 42.94 nr: 52

$ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
1
$ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
3

* patched kernel
$ grep "Out of memory:" patched-oom-run1.log | wc -l
78
miso@tiehlicka /mnt/share/devel/miso/kvm $ grep "Out of memory:" patched-oom-run2.log | wc -l
77

e grep "Kill process" patched-oom-run1.log | tail -n1
[  497.317732] Out of memory: Kill process 3108 (mem_eater) score 39 or sacrifice child
$ grep "Kill process" patched-oom-run2.log | tail -n1
[  316.169920] Out of memory: Kill process 3093 (mem_eater) score 39 or sacrifice child

$ grep "DMA32 free:" patched-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5420.00 max: 5808.00 avg: 5513.90 std: 60.45 nr: 78
$ grep "DMA32 free:" patched-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5380.00 max: 6384.00 avg: 5520.94 std: 136.84 nr: 77

e grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
2
$ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
3

The patched kernel run noticeably longer while invoking OOM killer same
number of times. This means that the original implementation is much
more aggressive and triggers the OOM killer sooner. free pages stats
show that neither kernels went OOM too early most of the time, though. I
guess the difference is in the backoff when retries without any progress
do sleep for a while if there is memory under writeback or dirty which
is highly likely considering the parallel IO.
Both kernels have seen races where zone wasn't marked unreclaimable
and we still hit the OOM killer. This is most likely a race where
a task managed to exit between the last allocation attempt and the oom
killer invocation.

2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
   memory as possible without triggering the OOM killer. This required a lot
   of tuning but I've considered 3 consecutive runs in three different boots
   without OOM as a success.

* base kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo)

* patched kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(12*1024)}' /proc/meminfo)

That means 40M more memory was usable without triggering OOM killer. The
base kernel sometimes managed to handle the same as patched but it
wasn't consistent and failed in at least on of the 3 runs. This seems
like a minor improvement.

I was testing also GPF_REPEAT costly requests (hughetlb) with fragmented
memory and under memory pressure. The results are in patch 11 where the
logic is implemented. In short I can see huge improvement there.

I am certainly interested in other usecases as well as well as any
feedback. Especially those which require higher order requests.

This patch (of 14):

While playing with the oom detection rework [1] I have noticed that my
heavy order-9 (hugetlb) load close to OOM ended up in an endless loop
where the reclaim hasn't made any progress but did_some_progress didn't
reflect that and compaction_suitable was backing off because no zone is
above low wmark + 1 << order.

It turned out that this is in fact an old standing bug in
compaction_ready which ignores the requested_highidx and did the
watermark check for 0 classzone_idx.  This succeeds for zone DMA most
of the time as the zone is mostly unused because of lowmem protection.
As a result costly high order allocatios always report a successfull
progress even when there was none.  This wasn't a problem so far
because these allocations usually fail quite early or retry only few
times with __GFP_REPEAT but this will change after later patch in this
series so make sure to not lie about the progress and propagate
requested_highidx down to compaction_ready and use it for both the
watermak check and compaction_suitable to fix this issue.

[1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@kernel.org
[2] https://lkml.org/lkml/2015/10/12/808
[3] https://lkml.org/lkml/2015/10/13/597

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Rik van Riel
59dc76b0d4 mm: vmscan: reduce size of inactive file list
The inactive file list should still be large enough to contain readahead
windows and freshly written file data, but it no longer is the only
source for detecting multiple accesses to file pages.  The workingset
refault measurement code causes recently evicted file pages that get
accessed again after a shorter interval to be promoted directly to the
active list.

With that mechanism in place, we can afford to (on a larger system)
dedicate more memory to the active file list, so we can actually cache
more of the frequently used file pages in memory, and not have them
pushed out by streaming writes, once-used streaming file reads, etc.

This can help things like database workloads, where only half the page
cache can currently be used to cache the database working set.  This
patch automatically increases that fraction on larger systems, using the
same ratio that has already been used for anonymous memory.

[hannes@cmpxchg.org: cgroup-awareness]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Johannes Weiner
bbddabe2e4 mm: filemap: only do access activations on reads
Andres observed that his database workload is struggling with the
transaction journal creating pressure on frequently read pages.

Access patterns like transaction journals frequently write the same
pages over and over, but in the majority of cases those pages are never
read back.  There are no caching benefits to be had for those pages, so
activating them and having them put pressure on pages that do benefit
from caching is a bad choice.

Leave page activations to read accesses and don't promote pages based on
writes alone.

It could be said that partially written pages do contain cache-worthy
data, because even if *userspace* does not access the unwritten part,
the kernel still has to read it from the filesystem for correctness.
However, a counter argument is that these pages enjoy at least *some*
protection over other inactive file pages through the writeback cache,
in the sense that dirty pages are written back with a delay and cache
reclaim leaves them alone until they have been written back to disk.
Should that turn out to be insufficient and we see increased read IO
from partial writes under memory pressure, we can always go back and
update grab_cache_page_write_begin() to take (pos, len) so that it can
tell partial writes from pages that don't need partial reads.  But for
now, keep it simple.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Andres Freund <andres@anarazel.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Rik van Riel
f0281a00fe mm: workingset: only do workingset activations on reads
This is a follow-up to

  http://www.spinics.net/lists/linux-mm/msg101739.html

where Andres reported his database workingset being pushed out by the
minimum size enforcement of the inactive file list - currently 50% of
cache - as well as repeatedly written file pages that are never actually
read.

Two changes fell out of the discussions.  The first change observes that
pages that are only ever written don't benefit from caching beyond what
the writeback cache does for partial page writes, and so we shouldn't
promote them to the active file list where they compete with pages whose
cached data is actually accessed repeatedly.  This change comes in two
patches - one for in-cache write accesses and one for refaults triggered
by writes, neither of which should promote a cache page.

Second, with the refault detection we don't need to set 50% of the cache
aside for used-once cache anymore since we can detect frequently used
pages even when they are evicted between accesses.  We can allow the
active list to be bigger and thus protect a bigger workingset that isn't
challenged by streamers.  Depending on the access patterns, this can
increase major faults during workingset transitions for better
performance during stable phases.

This patch (of 3):

When rewriting a page, the data in that page is replaced with new data.
This means that evicting something else from the active file list, in
order to cache data that will be replaced by something else, is likely
to be a waste of memory.

It is better to save the active list for frequently read pages, because
reads actually use the data that is in the page.

This patch ignores partial writes, because it is unclear whether the
complexity of identifying those is worth any potential performance gain
obtained from better caching pages that see repeated partial writes at
large enough intervals to not get caught by the use-twice promotion code
used for the inactive file list.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Linus Torvalds
6eb59af580 - New Drivers
- Add new driver for MAXIM MAX77620/MAX20024 PMIC
    - Add new driver for Hisilicon HI665X PMIC
  - New Device Support
    - Add support for AXP809 in axp20x-rsb
    - Add support for Power Supply in axp20x
  - New core features
    - devm_mfd_* managed resources
  - Fix-ups
    - Remove unused code; da9063-irq, wm8400-core, tps6105x, smsc-ece1099,
 			 twl4030-power
    - Improve clean-up in error path; intel_quark_i2c_gpio
    - Explicitly include headers; syscon.h
    - Allow building as modules; max77693
    - Use IS_ENABLED() instead of rolling your own; dm355evm_msp, wm8400-core
    - DT adaptions; axp20x, hi655x, arizona, max77620
    - Remove CLK_IS_ROOT flag; intel-lpss, intel_quark
    - Move to gpiochip API; asic3, dm355evm_msp, htc-egpio, htc-i2cpld, sm501,
 				  tc6393xb, tps65010, ucb1x00, vexpress
    - Make use of devm_mfd_* calls; act8945a, as3711, atmel-hlcdc, bcm590xx,
 				   hi6421-pmic-core, lp3943, menf21bmc, mt6397,
 				   rdc321x, rk808, rn5t618, rt5033, sky81452,
 				   stw481x, tps6507x, tps65217, wm8400,
  - Bug Fixes
    - Fix ACPI child matching; mfd-core
    - Fix start-up ordering issues; mt6397-core, arizona-core
    - Fix forgotten register state on resume; intel-lpss
    - Fix Clock related issues; twl6040
    - Fix scheduling whilst atomic; omap-usb-tll
    - Kconfig changes; vexpress
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJXPzwAAAoJEFGvii+H/HdhkPEP/iTvgiL+CpAk7UwAhCNolR5j
 l3vb49lOlmqx87zELdToJmySAd/byiZN0YQEmcn+t4BCs/8CeaWeNkb8vltJvuac
 Fmz88bhXfgFYk87nx/6tRMvuM3fKXlk/YYRZkklV7mkBjcPLiqBZSi/MG/SV53a9
 A+vGW56B2/vHiUgTBkYs9UZNqkFCkmhuVYbHjtFwTfL84lwy9u4tNRrktss6g1lx
 Ak9uiDhaUP/vxKe/7/qCTZXgV/IYb2+tcNjMJ+Cztmyht8VTrhGSXbDPH7MyRYUI
 EBBWRXAQelR5qHxOYDSBNIemZe3AniCBp7tjqcwlN9cdE8q9pJxOk+0XStjC+XeW
 Qt1aIwQisk8jfII8BIGr2pAzc8Jh9/TtcK+wKMRQ2o5g2tvcG90hHIJWQlbdy4ST
 SX799w0KvTItdaMhTHThTOfJRj777v/H2cj8DBCCEeoBHOCHnzbJSIuKahPa9PM3
 W0dyZOpsDXoegyksjBUYjdhGoggjEdirt+oXJe4rY7UxeEml+YZS54fseVzgNzNq
 //Nxk1GMNOVXgo3NrlO8JTs2G5gFPc8VOuPW60G1fm8DyNW13RbUG74QPpSd4U7S
 zZM/OZ3D0E4nrPjXf/GCS3QRwM7p1ubiOgSTTZkaLJYGBcHSezGXK8XpFSNReRop
 Un13GPM09Sl9VN9a2Ybi
 =FSmn
 -----END PGP SIGNATURE-----

Merge tag 'mfd-for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd

Pull MFD updates from Lee Jones:
 "New Drivers:
   - Add new driver for MAXIM MAX77620/MAX20024 PMIC
   - Add new driver for Hisilicon HI665X PMIC

  New Device Support:
   - Add support for AXP809 in axp20x-rsb
   - Add support for Power Supply in axp20x

  New core features:
   - devm_mfd_* managed resources

  Fix-ups:
   - Remove unused code (da9063-irq, wm8400-core, tps6105x,
     smsc-ece1099, twl4030-power)
   - Improve clean-up in error path (intel_quark_i2c_gpio)
   - Explicitly include headers (syscon.h)
   - Allow building as modules (max77693)
   - Use IS_ENABLED() instead of rolling your own (dm355evm_msp,
     wm8400-core)
   - DT adaptions (axp20x, hi655x, arizona, max77620)
   - Remove CLK_IS_ROOT flag (intel-lpss, intel_quark)
   - Move to gpiochip API (asic3, dm355evm_msp, htc-egpio, htc-i2cpld,
     sm501, tc6393xb, tps65010, ucb1x00, vexpress)
   - Make use of devm_mfd_* calls (act8945a, as3711, atmel-hlcdc,
     bcm590xx, hi6421-pmic-core, lp3943, menf21bmc, mt6397, rdc321x,
     rk808, rn5t618, rt5033, sky81452, stw481x, tps6507x, tps65217,
     wm8400)

  Bug Fixes"
   - Fix ACPI child matching (mfd-core)
   - Fix start-up ordering issues (mt6397-core, arizona-core)
   - Fix forgotten register state on resume (intel-lpss)
   - Fix Clock related issues (twl6040)
   - Fix scheduling whilst atomic (omap-usb-tll)
   - Kconfig changes (vexpress)"

* tag 'mfd-for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (73 commits)
  mfd: hi655x: Add MFD driver for hi655x
  mfd: ab8500-debugfs: Trivial fix of spelling mistake on "between"
  mfd: vexpress: Add !ARCH_USES_GETTIMEOFFSET dependency
  mfd: Add device-tree binding doc for PMIC MAX77620/MAX20024
  mfd: max77620: Add core driver for MAX77620/MAX20024
  mfd: arizona: Add defines for GPSW values that can be used from DT
  mfd: omap-usb-tll: Fix scheduling while atomic BUG
  mfd: wm5110: ARIZONA_CLOCK_CONTROL should be volatile
  mfd: axp20x: Add a cell for the ac power_supply part of the axp20x PMICs
  mfd: intel_soc_pmic_core: Terminate panel control GPIO lookup table correctly
  mfd: wl1273-core: Use devm_mfd_add_devices() for mfd_device registration
  mfd: tps65910: Use devm_mfd_add_devices and devm_regmap_add_irq_chip
  mfd: sec: Use devm_mfd_add_devices and devm_regmap_add_irq_chip
  mfd: rc5t583: Use devm_mfd_add_devices and devm_request_threaded_irq
  mfd: max77686: Use devm_mfd_add_devices and devm_regmap_add_irq_chip
  mfd: as3722: Use devm_mfd_add_devices and devm_regmap_add_irq_chip
  mfd: twl4030-power: Remove driver path in file comment
  MAINTAINERS: Add entry for X-Powers AXP family PMIC drivers
  mfd: smsc-ece1099: Remove unnecessarily remove callback
  mfd: Use IS_ENABLED(CONFIG_FOO) instead of checking FOO || FOO_MODULE
  ...
2016-05-20 11:10:24 -07:00
Linus Torvalds
4d230d4d03 HSI changes for the v4.7 series
* merge omap-ssi and omap-ssi-port modules
  * fix omap-ssi module reloading
  * add DVFS support to omap-ssi
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJXPpJpAAoJENju1/PIO/qaRWQP/ia/qFECRFiObARtny5zUtMS
 NX/QIXz96XozP0BhP58yS/qXP2N+RUZzXuft8NnlIgM7xcOScUu07WL3nPZ7sQO/
 QI7NL3NJtuh2UWKFa6mjEZiHkprM/qvcGPu2cY4SLGGczNbKBESBdnl0wUUHftqg
 vSoRtr8mK1zVPFCpoKOFcZkVq2mhzU6re5CxXwwRCsM55qwtaaQkBk2qzn9ZkQvm
 AVB39zh8QitpIZbJtM45ZHkoVzq12MuqYXvLXO7wUUxrS2A9aZn//J+D7UEc5e4l
 QArC+N71wLCioYqGQXvKDUWsPybjtIT0CLljaXVyRn+3ZHoCrWrbNRVz7bU28HQo
 gXM0BhDuexdZsWE6aWvlmSBr1jFi1Hwp7POyYHLwTKpqUmJTbEd+F/HP9UxkVMpo
 d9j0wiIziTeRYvQ/o0maFkkARQkirJGv+zL1nZr5Q7q4WqdXNd7MDwN6qsi2QAD5
 EkJpsetllH8e5JZqT4drFMtAOypgstSpzuU5EuyL5EO6ES7apzGQhNGW+cWlVtCb
 8J0HYpa3T39ekNPIWbuHcGatKJYdPZVlLnDAZiQRjdhGAahS1i52MwjIlB68AWoc
 hYGAccDdwgarjp+ce5QFu6cGiX2jOt7zgO1PhQAXY8Yzt9q7PJPaR8mxdxUNiCvv
 KZ2DzE1GRlBD1TP+NMQc
 =4zdt
 -----END PGP SIGNATURE-----

Merge tag 'hsi-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi

Pull HSI updates from Sebastian Reichel:

 - merge omap-ssi and omap-ssi-port modules

 - fix omap-ssi module reloading

 - add DVFS support to omap-ssi

* tag 'hsi-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi:
  HSI: omap-ssi: move omap_ssi_port_update_fclk
  HSI: omap-ssi: include pinctrl header files
  HSI: omap-ssi: add COMMON_CLK dependency
  HSI: omap-ssi: add clk change support
  HSI: omap_ssi: built omap_ssi and omap_ssi_port into one module
  HSI: omap_ssi: fix removal of port platform device
  HSI: omap_ssi: make sure probe stays available
  HSI: omap_ssi: fix module unloading
  HSI: omap_ssi_port: switch to gpiod API
2016-05-20 11:05:40 -07:00
Linus Torvalds
410b42978a fbdev changes for 4.7
* imxfb: fix lcd power up
 * small fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXPshLAAoJEPo9qoy8lh71QNIQAKlSAb0C4PUK67NaymL3w1FO
 RhCAsmAeJdODV2iK7RhsKMUSnlFPkCZ3urFJz7tHv7B3k09QkzBBSHQdQpF6mWyu
 +Rna7NehCO44k+NM7zTnd3dbL5dq8H7wWCLKambHHWC8/tcsD+DLcoKHsFwdcZwe
 WeWA38R09uH0z6idPYv2vyjMH+cHcAE86SOaeQ/8jgqDmlX8kkXRVPlHaowhUYio
 dap9gtiy/+OWOIVDRTV7uWKtKZwt3KTIafCpZxRuynqZWET0jurDAyGiUqvXTGvC
 0RCuDsKMAhiysLkXRFRRBAeo6Yj6TGsGCNyLec/SE8VZbOxpfUJx+INQg/oqOdSq
 xg3PvMtAJH7JiGxSHZZGAR7mxYuzmWW8dXMk/Fu7UZgkCCiLbjdDmytAY5T6qTHz
 4Nq5omczxpj38d3UbcljXmNwudmIEes82sIYOalN2nxCEXuycCWyQO0U0swmkaqH
 4ISSJY8Gej10QXO8mivv1q9y9BaOJfUeoqtyNR4zQAiulfjZIbtyLeX3FRKLTjI/
 50K2YCGm06GP9CHGefzVl3PBgeZ49NGegu+Z8sRfYxAac3R0V4QaE+40ygFaOHCY
 Gq5dvEKXgJeq2i3nanlKwZkr7kB0FF/XqaH9RSYO7TaoyOg8GgeZYcj1sGzJwplu
 c4lRcKRlzRShV2HymHFv
 =iHUZ
 -----END PGP SIGNATURE-----

Merge tag 'fbdev-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux

Pull fbdev updates from Tomi Valkeinen:

 - imxfb: fix lcd power up

 - small fixes and cleanups

* tag 'fbdev-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux:
  fbdev: Use IS_ENABLED() instead of checking for built-in or module
  efifb: Don't show the mapping VA
  video: AMBA CLCD: Remove unncessary include in amba-clcd.c
  fbdev: ssd1307fb: Fix charge pump setting
  Documentation: fb: fix spelling mistakes
  fbdev: fbmem: implement error handling in fbmem_init()
  fbdev: sh_mipi_dsi: remove driver
  video: fbdev: imxfb: add some error handling
  video: fbdev: imxfb: fix semantic of .get_power and .set_power
  video: fbdev: omap2: Remove deprecated regulator_can_change_voltage() usage
2016-05-20 11:01:02 -07:00
Linus Torvalds
e4fba88d00 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
 "Fix a regression that causes sha-mb to crash"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: sha1-mb - make sha1_x8_avx2() conform to C function ABI
2016-05-20 10:25:16 -07:00
Arnd Bergmann
ffd565e315 irqchip: nps: add 64BIT dependency
The newly added nps irqchip driver causes build warnings on ARM64.

  include/soc/nps/common.h: In function 'nps_host_reg_non_cl':
  include/soc/nps/common.h:148:9: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]

As the driver is only used on ARC, we don't need to see it without
COMPILE_TEST elsewhere, and we can avoid the warnings by only building
on 32-bit architectures even with CONFIG_COMPILE_TEST.

Acked-by: Marc Zyngier <narc.zyngier@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 10:20:47 -07:00
Linus Torvalds
c04a588029 powerpc updates for 4.7
Highlights:
  - Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
  - Live patching support for ppc64le (also merged via livepatching.git)
 
 Various cleanups & minor fixes from:
  - Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
    Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie, Lennart
    Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael
    Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras, Rashmica Gupta,
    Russell Currey, Suraj Jitindar Singh, Thiago Jung Bauermann, Valentin
    Rothberg, Vipin K Parashar.
 
 General:
  - Update LMB associativity index during DLPAR add/remove from Nathan Fontenot
  - Fix branching to OOL handlers in relocatable kernel from Hari Bathini
  - Add support for userspace Power9 copy/paste from Chris Smart
  - Always use STRICT_MM_TYPECHECKS from Michael Ellerman
  - Add mask of possible MMU features from Michael Ellerman
 
 PCI:
  - Enable pass through of NVLink to guests from Alexey Kardashevskiy
  - Cleanups in preparation for powernv PCI hotplug from Gavin Shan
  - Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
  - Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
  - Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell" from Guilherme G. Piccoli
  - Remove the dependency on EEH struct in DDW mechanism from Guilherme G. Piccoli
 
 selftests:
  - Test cp_abort during context switch from Chris Smart
  - Add several tests for transactional memory support from Rashmica Gupta
 
 perf:
  - Add support for sampling interrupt register state from Anju T
  - Add support for unwinding perf-stackdump from Chandan Kumar
 
 cxl:
  - Configure the PSL for two CAPI ports on POWER8NVL from Philippe Bergheaud
  - Allow initialization on timebase sync failures from Frederic Barrat
  - Increase timeout for detection of AFU mmio hang from Frederic Barrat
  - Handle num_of_processes larger than can fit in the SPA from Ian Munsie
  - Ensure PSL interrupt is configured for contexts with no AFU IRQs from Ian Munsie
  - Add kernel API to allow a context to operate with relocate disabled from Ian Munsie
  - Check periodically the coherent platform function's state from Christophe Lombard
 
 Freescale:
  - Updates from Scott: "Contains 86xx fixes, minor device tree fixes, an erratum
    workaround, and a kconfig dependency fix."
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXPsGzAAoJEFHr6jzI4aWAVoAP/iKdrDe0eYHlVAE9SqnbsiZs
 lgDxdsC8P3fsmP1G9o/HkKhC82zHl/La8Ztz8dtqa+LkSzbfliWP1ztJsI7GsBFo
 tyCKzWnX9Rwvd3meHu/o/SQ29TNLm/PbPyyRqpj5QPbJ8XCXkAXR7ZZZqjvcMsJW
 /AgIr7Cgf53tl9oZzzl/c7CnNHhMq+NBdA71vhWtUx+T97wfJEGyKW6HhZyHDbEU
 iAki7fu77ZpEqC/Fh9swf0dCGBJ+a132NoMVo0AdV7EQLznUYlQpQEqa+1PyHZOP
 /ArOzf2mDg6m3PfCo1eiB07v8PnVZ3llEUbVAJNg3GUxbE4SHrqq/kwm0iElm3p/
 DvFxerCwdX9vmskJX4wDs+pSZRabXYj9XVMptsgFzA4joWrqqb7mBHqaort88YcY
 YSljEt1bHyXmiJ+dBya40qARsWUkCVN7ZgEzdxckq0KI3w7g2tqpqIbO2lClWT6t
 B3GpqQ4jp34+d1M14FB91fIGK7tMvOhSInE0Mv9+tPvRsepXqiiU/SwdAtRlr3m2
 zs/K+4FYcVjJ3Rmpgc+tI38PbZxHe212I35YN6L1LP+4ZfAtzz0NyKdooTIBtkbO
 19pX4WbBjKq8zK+YutrySncBIrbnI6VjW51vtRhgVKZliPFO/6zKagyU6FbxM+E5
 udQES+t3F/9gvtxgxtDe
 =YvyQ
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Highlights:
   - Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
   - Live patching support for ppc64le (also merged via livepatching.git)

  Various cleanups & minor fixes from:
   - Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
     Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie,
     Lennart Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring,
     Michael Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras,
     Rashmica Gupta, Russell Currey, Suraj Jitindar Singh, Thiago Jung
     Bauermann, Valentin Rothberg, Vipin K Parashar.

  General:
   - Update LMB associativity index during DLPAR add/remove from Nathan
     Fontenot
   - Fix branching to OOL handlers in relocatable kernel from Hari Bathini
   - Add support for userspace Power9 copy/paste from Chris Smart
   - Always use STRICT_MM_TYPECHECKS from Michael Ellerman
   - Add mask of possible MMU features from Michael Ellerman

  PCI:
   - Enable pass through of NVLink to guests from Alexey Kardashevskiy
   - Cleanups in preparation for powernv PCI hotplug from Gavin Shan
   - Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
   - Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
   - Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
     from Guilherme G Piccoli
   - Remove the dependency on EEH struct in DDW mechanism from Guilherme
     G Piccoli

  selftests:
   - Test cp_abort during context switch from Chris Smart
   - Add several tests for transactional memory support from Rashmica
     Gupta

  perf:
   - Add support for sampling interrupt register state from Anju T
   - Add support for unwinding perf-stackdump from Chandan Kumar

  cxl:
   - Configure the PSL for two CAPI ports on POWER8NVL from Philippe
     Bergheaud
   - Allow initialization on timebase sync failures from Frederic Barrat
   - Increase timeout for detection of AFU mmio hang from Frederic
     Barrat
   - Handle num_of_processes larger than can fit in the SPA from Ian
     Munsie
   - Ensure PSL interrupt is configured for contexts with no AFU IRQs
     from Ian Munsie
   - Add kernel API to allow a context to operate with relocate disabled
     from Ian Munsie
   - Check periodically the coherent platform function's state from
     Christophe Lombard

  Freescale:
   - Updates from Scott: "Contains 86xx fixes, minor device tree fixes,
     an erratum workaround, and a kconfig dependency fix."

* tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (192 commits)
  powerpc/86xx: Fix PCI interrupt map definition
  powerpc/86xx: Move pci1 definition to the include file
  powerpc/fsl: Fix build of the dtb embedded kernel images
  powerpc/fsl: Fix rcpm compatible string
  powerpc/fsl: Remove FSL_SOC dependency from FSL_LBC
  powerpc/fsl-pci: Add a workaround for PCI 5 errata
  powerpc/fsl: Fix SPI compatible on t208xrdb and t1040rdb
  powerpc/powernv/npu: Add PE to PHB's list
  powerpc/powernv: Fix insufficient memory allocation
  powerpc/iommu: Remove the dependency on EEH struct in DDW mechanism
  Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
  powerpc/eeh: Drop unnecessary label in eeh_pe_change_owner()
  powerpc/eeh: Ignore handlers in eeh_pe_reset_and_recover()
  powerpc/eeh: Restore initial state in eeh_pe_reset_and_recover()
  powerpc/eeh: Don't report error in eeh_pe_reset_and_recover()
  Revert "powerpc/powernv: Exclude root bus in pnv_pci_reset_secondary_bus()"
  powerpc/powernv/npu: Enable NVLink pass through
  powerpc/powernv/npu: Rework TCE Kill handling
  powerpc/powernv/npu: Add set/unset window helpers
  powerpc/powernv/ioda2: Export debug helper pe_level_printk()
  ...
2016-05-20 10:12:41 -07:00
Linus Torvalds
a1c28b75a9 Merge branch 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM updates from Russell King:
 "Changes included in this pull request:

   - revert pxa2xx-flash back to using ioremap_cached() and switch
     memremap() to use arch_memremap_wb()

   - remove pci=firmware command line argument handling

   - remove unnecessary arm_dma_set_mask() implementation, the generic
     implementation will do for ARM

   - removal of the ARM kallsyms "hack" to work around mode switching
     veneers and vectors located below PAGE_OFFSET

   - tidy up build system output a little

   - add L2 cache power management DT bindings

   - remove duplicated local_irq_disable() in reboot paths

   - handle AMBA primecell devices better at registration time with PM
     domains (needed for Samsung SoCs)

   - ARM specific preparation to support Keystone II kexec"

* 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
  ARM: 8567/1: cache-uniphier: activate ways for secondary CPUs
  ARM: 8570/2: Documentation: devicetree: Add PL310 PM bindings
  ARM: 8569/1: pl2x0: Add OF control of cache power management
  ARM: 8568/1: reboot: remove duplicated local_irq_disable()
  ARM: 8566/1: drivers: amba: properly handle devices with power domains
  ARM: provide arm_has_idmap_alias() helper
  ARM: kexec: remove 512MB restriction on kexec crashdump
  ARM: provide improved virt_to_idmap() functionality
  ARM: kexec: fix crashkernel= handling
  ARM: 8557/1: specify install, zinstall, and uinstall as PHONY targets
  ARM: 8562/1: suppress "include/generated/mach-types.h is up to date."
  ARM: 8553/1: kallsyms: remove --page-offset command line option
  ARM: 8552/1: kallsyms: remove special lower address limit for CONFIG_ARM
  ARM: 8555/1: kallsyms: ignore ARM mode switching veneers
  ARM: 8548/1: dma-mapping: remove arm_dma_set_mask()
  ARM: 8554/1: kernel: pci: remove pci=firmware command line parameter handling
  ARM: memremap: implement arch_memremap_wb()
  memremap: add arch specific hook for MEMREMAP_WB mappings
  mtd: pxa2xx-flash: switch back from memremap to ioremap_cached
  ARM: reintroduce ioremap_cached() for creating cached I/O mappings
2016-05-20 10:01:38 -07:00
Linus Torvalds
a05a70db34 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - fsnotify fix

 - poll() timeout fix

 - a few scripts/ tweaks

 - debugobjects updates

 - the (small) ocfs2 queue

 - Minor fixes to kernel/padata.c

 - Maybe half of the MM queue

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits)
  mm, page_alloc: restore the original nodemask if the fast path allocation failed
  mm, page_alloc: uninline the bad page part of check_new_page()
  mm, page_alloc: don't duplicate code in free_pcp_prepare
  mm, page_alloc: defer debugging checks of pages allocated from the PCP
  mm, page_alloc: defer debugging checks of freed pages until a PCP drain
  cpuset: use static key better and convert to new API
  mm, page_alloc: inline pageblock lookup in page free fast paths
  mm, page_alloc: remove unnecessary variable from free_pcppages_bulk
  mm, page_alloc: pull out side effects from free_pages_check
  mm, page_alloc: un-inline the bad part of free_pages_check
  mm, page_alloc: check multiple page fields with a single branch
  mm, page_alloc: remove field from alloc_context
  mm, page_alloc: avoid looking up the first zone in a zonelist twice
  mm, page_alloc: shortcut watermark checks for order-0 pages
  mm, page_alloc: reduce cost of fair zone allocation policy retry
  mm, page_alloc: shorten the page allocator fast path
  mm, page_alloc: check once if a zone has isolated pageblocks
  mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath
  mm, page_alloc: simplify last cpupid reset
  mm, page_alloc: remove unnecessary initialisation from __alloc_pages_nodemask()
  ...
2016-05-19 20:00:06 -07:00
Mel Gorman
4741526b83 mm, page_alloc: restore the original nodemask if the fast path allocation failed
The page allocator fast path uses either the requested nodemask or
cpuset_current_mems_allowed if cpusets are enabled.  If the allocation
context allows watermarks to be ignored then it can also ignore memory
policies.  However, on entering the allocator slowpath the nodemask may
still be cpuset_current_mems_allowed and the policies are enforced.
This patch resets the nodemask appropriately before entering the
slowpath.

Link: http://lkml.kernel.org/r/20160504143628.GU2858@techsingularity.net
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Vlastimil Babka
4e6118016e mm, page_alloc: uninline the bad page part of check_new_page()
Bad pages should be rare so the code handling them doesn't need to be
inline for performance reasons.  Put it to separate function which
returns void.  This also assumes that the initial page_expected_state()
result will match the result of the thorough check, i.e.  the page
doesn't become "good" in the meanwhile.  This matches the same
expectations already in place in free_pages_check().

!DEBUG_VM bloat-o-meter:

  add/remove: 1/0 grow/shrink: 0/1 up/down: 134/-274 (-140)
  function                                     old     new   delta
  check_new_page_bad                             -     134    +134
  get_page_from_freelist                      3468    3194    -274

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
e2769dbdc5 mm, page_alloc: don't duplicate code in free_pcp_prepare
The new free_pcp_prepare() function shares a lot of code with
free_pages_prepare(), which makes this a maintenance risk when some
future patch modifies only one of them.  We should be able to achieve
the same effect (skipping free_pages_check() from !DEBUG_VM configs) by
adding a parameter to free_pages_prepare() and making it inline, so the
checks (and the order != 0 parts) are eliminated from the call from
free_pcp_prepare().

!DEBUG_VM: bloat-o-meter reports no difference, as my gcc was already
inlining free_pages_prepare() and the elimination seems to work as
expected

DEBUG_VM bloat-o-meter:

  add/remove: 0/1 grow/shrink: 2/0 up/down: 1035/-778 (257)
  function                                     old     new   delta
  __free_pages_ok                              297    1060    +763
  free_hot_cold_page                           480     752    +272
  free_pages_prepare                           778       -    -778

Here inlining didn't occur before, and added some code, but it's ok for
a debug option.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
479f854a20 mm, page_alloc: defer debugging checks of pages allocated from the PCP
Every page allocated checks a number of page fields for validity.  This
catches corruption bugs of pages that are already freed but it is
expensive.  This patch weakens the debugging check by checking PCP pages
only when the PCP lists are being refilled.  All compound pages are
checked.  This potentially avoids debugging checks entirely if the PCP
lists are never emptied and refilled so some corruption issues may be
missed.  Full checking requires DEBUG_VM.

With the two deferred debugging patches applied, the impact to a page
allocator microbenchmark is

                                             4.6.0-rc3                  4.6.0-rc3
                                           inline-v3r6            deferalloc-v3r7
  Min      alloc-odr0-1               344.00 (  0.00%)           317.00 (  7.85%)
  Min      alloc-odr0-2               248.00 (  0.00%)           231.00 (  6.85%)
  Min      alloc-odr0-4               209.00 (  0.00%)           192.00 (  8.13%)
  Min      alloc-odr0-8               181.00 (  0.00%)           166.00 (  8.29%)
  Min      alloc-odr0-16              168.00 (  0.00%)           154.00 (  8.33%)
  Min      alloc-odr0-32              161.00 (  0.00%)           148.00 (  8.07%)
  Min      alloc-odr0-64              158.00 (  0.00%)           145.00 (  8.23%)
  Min      alloc-odr0-128             156.00 (  0.00%)           143.00 (  8.33%)
  Min      alloc-odr0-256             168.00 (  0.00%)           154.00 (  8.33%)
  Min      alloc-odr0-512             178.00 (  0.00%)           167.00 (  6.18%)
  Min      alloc-odr0-1024            186.00 (  0.00%)           174.00 (  6.45%)
  Min      alloc-odr0-2048            192.00 (  0.00%)           180.00 (  6.25%)
  Min      alloc-odr0-4096            198.00 (  0.00%)           184.00 (  7.07%)
  Min      alloc-odr0-8192            200.00 (  0.00%)           188.00 (  6.00%)
  Min      alloc-odr0-16384           201.00 (  0.00%)           188.00 (  6.47%)
  Min      free-odr0-1                189.00 (  0.00%)           180.00 (  4.76%)
  Min      free-odr0-2                132.00 (  0.00%)           126.00 (  4.55%)
  Min      free-odr0-4                104.00 (  0.00%)            99.00 (  4.81%)
  Min      free-odr0-8                 90.00 (  0.00%)            85.00 (  5.56%)
  Min      free-odr0-16                84.00 (  0.00%)            80.00 (  4.76%)
  Min      free-odr0-32                80.00 (  0.00%)            76.00 (  5.00%)
  Min      free-odr0-64                78.00 (  0.00%)            74.00 (  5.13%)
  Min      free-odr0-128               77.00 (  0.00%)            73.00 (  5.19%)
  Min      free-odr0-256               94.00 (  0.00%)            91.00 (  3.19%)
  Min      free-odr0-512              108.00 (  0.00%)           112.00 ( -3.70%)
  Min      free-odr0-1024             115.00 (  0.00%)           118.00 ( -2.61%)
  Min      free-odr0-2048             120.00 (  0.00%)           125.00 ( -4.17%)
  Min      free-odr0-4096             123.00 (  0.00%)           129.00 ( -4.88%)
  Min      free-odr0-8192             126.00 (  0.00%)           130.00 ( -3.17%)
  Min      free-odr0-16384            126.00 (  0.00%)           131.00 ( -3.97%)

Note that the free paths for large numbers of pages is impacted as the
debugging cost gets shifted into that path when the page data is no
longer necessarily cache-hot.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
4db7548ccb mm, page_alloc: defer debugging checks of freed pages until a PCP drain
Every page free checks a number of page fields for validity.  This
catches premature frees and corruptions but it is also expensive.  This
patch weakens the debugging check by checking PCP pages at the time they
are drained from the PCP list.  This will trigger the bug but the site
that freed the corrupt page will be lost.  To get the full context, a
kernel rebuild with DEBUG_VM is necessary.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Vlastimil Babka
002f290627 cpuset: use static key better and convert to new API
An important function for cpusets is cpuset_node_allowed(), which
optimizes on the fact if there's a single root CPU set, it must be
trivially allowed.  But the check "nr_cpusets() <= 1" doesn't use the
cpusets_enabled_key static key the right way where static keys eliminate
branching overhead with jump labels.

This patch converts it so that static key is used properly.  It's also
switched to the new static key API and the checking functions are
converted to return bool instead of int.  We also provide a new variant
__cpuset_zone_allowed() which expects that the static key check was
already done and they key was enabled.  This is needed for
get_page_from_freelist() where we want to also avoid the relatively
slower check when ALLOC_CPUSET is not set in alloc_flags.

The impact on the page allocator microbenchmark is less than expected
but the cleanup in itself is worthwhile.

                                             4.6.0-rc2                  4.6.0-rc2
                                       multcheck-v1r20               cpuset-v1r20
  Min      alloc-odr0-1               348.00 (  0.00%)           348.00 (  0.00%)
  Min      alloc-odr0-2               254.00 (  0.00%)           254.00 (  0.00%)
  Min      alloc-odr0-4               213.00 (  0.00%)           213.00 (  0.00%)
  Min      alloc-odr0-8               186.00 (  0.00%)           183.00 (  1.61%)
  Min      alloc-odr0-16              173.00 (  0.00%)           171.00 (  1.16%)
  Min      alloc-odr0-32              166.00 (  0.00%)           163.00 (  1.81%)
  Min      alloc-odr0-64              162.00 (  0.00%)           159.00 (  1.85%)
  Min      alloc-odr0-128             160.00 (  0.00%)           157.00 (  1.88%)
  Min      alloc-odr0-256             169.00 (  0.00%)           166.00 (  1.78%)
  Min      alloc-odr0-512             180.00 (  0.00%)           180.00 (  0.00%)
  Min      alloc-odr0-1024            188.00 (  0.00%)           187.00 (  0.53%)
  Min      alloc-odr0-2048            194.00 (  0.00%)           193.00 (  0.52%)
  Min      alloc-odr0-4096            199.00 (  0.00%)           198.00 (  0.50%)
  Min      alloc-odr0-8192            202.00 (  0.00%)           201.00 (  0.50%)
  Min      alloc-odr0-16384           203.00 (  0.00%)           202.00 (  0.49%)

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Zefan Li <lizefan@huawei.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
0b423ca22f mm, page_alloc: inline pageblock lookup in page free fast paths
The function call overhead of get_pfnblock_flags_mask() is measurable in
the page free paths.  This patch uses an inlined version that is faster.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
e5b31ac2ca mm, page_alloc: remove unnecessary variable from free_pcppages_bulk
The original count is never reused so it can be removed.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
da838d4fcb mm, page_alloc: pull out side effects from free_pages_check
Check without side-effects should be easier to maintain.  It also
removes the duplicated cpupid and flags reset done in !DEBUG_VM variant
of both free_pcp_prepare() and then bulkfree_pcp_prepare().  Finally, it
enables the next patch.

It shouldn't result in new branches, thanks to inlining of the check.

!DEBUG_VM bloat-o-meter:

  add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-27 (-27)
  function                                     old     new   delta
  __free_pages_ok                              748     739      -9
  free_pcppages_bulk                          1403    1385     -18

DEBUG_VM:

  add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-28 (-28)
  function                                     old     new   delta
  free_pages_prepare                           806     778     -28

This is also slightly faster because cpupid information is not set on
tail pages so we can avoid resets there.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
bb552ac6c6 mm, page_alloc: un-inline the bad part of free_pages_check
From: Vlastimil Babka <vbabka@suse.cz>

!DEBUG_VM size and bloat-o-meter:

  add/remove: 1/0 grow/shrink: 0/2 up/down: 124/-370 (-246)
  function                                     old     new   delta
  free_pages_check_bad                           -     124    +124
  free_pcppages_bulk                          1288    1171    -117
  __free_pages_ok                              948     695    -253

DEBUG_VM:

  add/remove: 1/0 grow/shrink: 0/1 up/down: 124/-214 (-90)
  function                                     old     new   delta
  free_pages_check_bad                           -     124    +124
  free_pages_prepare                          1112     898    -214

[akpm@linux-foundation.org: fix whitespace]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
7bfec6f47b mm, page_alloc: check multiple page fields with a single branch
Every page allocated or freed is checked for sanity to avoid corruptions
that are difficult to detect later.  A bad page could be due to a number
of fields.  Instead of using multiple branches, this patch combines
multiple fields into a single branch.  A detailed check is only
necessary if that check fails.

                                             4.6.0-rc2                  4.6.0-rc2
                                        initonce-v1r20            multcheck-v1r20
  Min      alloc-odr0-1               359.00 (  0.00%)           348.00 (  3.06%)
  Min      alloc-odr0-2               260.00 (  0.00%)           254.00 (  2.31%)
  Min      alloc-odr0-4               214.00 (  0.00%)           213.00 (  0.47%)
  Min      alloc-odr0-8               186.00 (  0.00%)           186.00 (  0.00%)
  Min      alloc-odr0-16              173.00 (  0.00%)           173.00 (  0.00%)
  Min      alloc-odr0-32              165.00 (  0.00%)           166.00 ( -0.61%)
  Min      alloc-odr0-64              162.00 (  0.00%)           162.00 (  0.00%)
  Min      alloc-odr0-128             161.00 (  0.00%)           160.00 (  0.62%)
  Min      alloc-odr0-256             170.00 (  0.00%)           169.00 (  0.59%)
  Min      alloc-odr0-512             181.00 (  0.00%)           180.00 (  0.55%)
  Min      alloc-odr0-1024            190.00 (  0.00%)           188.00 (  1.05%)
  Min      alloc-odr0-2048            196.00 (  0.00%)           194.00 (  1.02%)
  Min      alloc-odr0-4096            202.00 (  0.00%)           199.00 (  1.49%)
  Min      alloc-odr0-8192            205.00 (  0.00%)           202.00 (  1.46%)
  Min      alloc-odr0-16384           205.00 (  0.00%)           203.00 (  0.98%)

Again, the benefit is marginal but avoiding excessive branches is
important.  Ideally the paths would not have to check these conditions
at all but regrettably abandoning the tests would make use-after-free
bugs much harder to detect.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
93ea9964d1 mm, page_alloc: remove field from alloc_context
The classzone_idx can be inferred from preferred_zoneref so remove the
unnecessary field and save stack space.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
c33d6c06f6 mm, page_alloc: avoid looking up the first zone in a zonelist twice
The allocator fast path looks up the first usable zone in a zonelist and
then get_page_from_freelist does the same job in the zonelist iterator.
This patch preserves the necessary information.

                                             4.6.0-rc2                  4.6.0-rc2
                                        fastmark-v1r20             initonce-v1r20
  Min      alloc-odr0-1               364.00 (  0.00%)           359.00 (  1.37%)
  Min      alloc-odr0-2               262.00 (  0.00%)           260.00 (  0.76%)
  Min      alloc-odr0-4               214.00 (  0.00%)           214.00 (  0.00%)
  Min      alloc-odr0-8               186.00 (  0.00%)           186.00 (  0.00%)
  Min      alloc-odr0-16              173.00 (  0.00%)           173.00 (  0.00%)
  Min      alloc-odr0-32              165.00 (  0.00%)           165.00 (  0.00%)
  Min      alloc-odr0-64              161.00 (  0.00%)           162.00 ( -0.62%)
  Min      alloc-odr0-128             159.00 (  0.00%)           161.00 ( -1.26%)
  Min      alloc-odr0-256             168.00 (  0.00%)           170.00 ( -1.19%)
  Min      alloc-odr0-512             180.00 (  0.00%)           181.00 ( -0.56%)
  Min      alloc-odr0-1024            190.00 (  0.00%)           190.00 (  0.00%)
  Min      alloc-odr0-2048            196.00 (  0.00%)           196.00 (  0.00%)
  Min      alloc-odr0-4096            202.00 (  0.00%)           202.00 (  0.00%)
  Min      alloc-odr0-8192            206.00 (  0.00%)           205.00 (  0.49%)
  Min      alloc-odr0-16384           206.00 (  0.00%)           205.00 (  0.49%)

The benefit is negligible and the results are within the noise but each
cycle counts.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
48ee5f3696 mm, page_alloc: shortcut watermark checks for order-0 pages
Watermarks have to be checked on every allocation including the number
of pages being allocated and whether reserves can be accessed.  The
reserves only matter if memory is limited and the free_pages adjustment
only applies to high-order pages.  This patch adds a shortcut for
order-0 pages that avoids numerous calculations if there is plenty of
free memory yielding the following performance difference in a page
allocator microbenchmark;

                                             4.6.0-rc2                  4.6.0-rc2
                                         optfair-v1r20             fastmark-v1r20
  Min      alloc-odr0-1               380.00 (  0.00%)           364.00 (  4.21%)
  Min      alloc-odr0-2               273.00 (  0.00%)           262.00 (  4.03%)
  Min      alloc-odr0-4               227.00 (  0.00%)           214.00 (  5.73%)
  Min      alloc-odr0-8               196.00 (  0.00%)           186.00 (  5.10%)
  Min      alloc-odr0-16              183.00 (  0.00%)           173.00 (  5.46%)
  Min      alloc-odr0-32              173.00 (  0.00%)           165.00 (  4.62%)
  Min      alloc-odr0-64              169.00 (  0.00%)           161.00 (  4.73%)
  Min      alloc-odr0-128             169.00 (  0.00%)           159.00 (  5.92%)
  Min      alloc-odr0-256             180.00 (  0.00%)           168.00 (  6.67%)
  Min      alloc-odr0-512             190.00 (  0.00%)           180.00 (  5.26%)
  Min      alloc-odr0-1024            198.00 (  0.00%)           190.00 (  4.04%)
  Min      alloc-odr0-2048            204.00 (  0.00%)           196.00 (  3.92%)
  Min      alloc-odr0-4096            209.00 (  0.00%)           202.00 (  3.35%)
  Min      alloc-odr0-8192            213.00 (  0.00%)           206.00 (  3.29%)
  Min      alloc-odr0-16384           214.00 (  0.00%)           206.00 (  3.74%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
305347550b mm, page_alloc: reduce cost of fair zone allocation policy retry
The fair zone allocation policy is not without cost but it can be
reduced slightly.  This patch removes an unnecessary local variable,
checks the likely conditions of the fair zone policy first, uses a bool
instead of a flags check and falls through when a remote node is
encountered instead of doing a full restart.  The benefit is marginal
but it's there

                                             4.6.0-rc2                  4.6.0-rc2
                                         decstat-v1r20              optfair-v1r20
  Min      alloc-odr0-1               377.00 (  0.00%)           380.00 ( -0.80%)
  Min      alloc-odr0-2               273.00 (  0.00%)           273.00 (  0.00%)
  Min      alloc-odr0-4               226.00 (  0.00%)           227.00 ( -0.44%)
  Min      alloc-odr0-8               196.00 (  0.00%)           196.00 (  0.00%)
  Min      alloc-odr0-16              183.00 (  0.00%)           183.00 (  0.00%)
  Min      alloc-odr0-32              175.00 (  0.00%)           173.00 (  1.14%)
  Min      alloc-odr0-64              172.00 (  0.00%)           169.00 (  1.74%)
  Min      alloc-odr0-128             170.00 (  0.00%)           169.00 (  0.59%)
  Min      alloc-odr0-256             183.00 (  0.00%)           180.00 (  1.64%)
  Min      alloc-odr0-512             191.00 (  0.00%)           190.00 (  0.52%)
  Min      alloc-odr0-1024            199.00 (  0.00%)           198.00 (  0.50%)
  Min      alloc-odr0-2048            204.00 (  0.00%)           204.00 (  0.00%)
  Min      alloc-odr0-4096            210.00 (  0.00%)           209.00 (  0.48%)
  Min      alloc-odr0-8192            213.00 (  0.00%)           213.00 (  0.00%)
  Min      alloc-odr0-16384           214.00 (  0.00%)           214.00 (  0.00%)

The benefit is marginal at best but one of the most important benefits,
avoiding a second search when falling back to another node is not
triggered by this particular test so the benefit for some corner cases
is understated.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
4fcb097117 mm, page_alloc: shorten the page allocator fast path
The page allocator fast path checks page multiple times unnecessarily.
This patch avoids all the slowpath checks if the first allocation
attempt succeeds.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
3777999dd4 mm, page_alloc: check once if a zone has isolated pageblocks
When bulk freeing pages from the per-cpu lists the zone is checked for
isolated pageblocks on every release.  This patch checks it once per
drain.

[mgorman@techsingularity.net: fix locking radce, per Vlastimil]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
83d4ca8148 mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath
__GFP_HARDWALL only has meaning in the context of cpusets but the fast
path always applies the flag on the first attempt.  Move the
manipulations into the cpuset paths where they will be masked by a
static branch in the common case.

With the other micro-optimisations in this series combined, the impact
on a page allocator microbenchmark is

                                             4.6.0-rc2                  4.6.0-rc2
                                         decstat-v1r20                micro-v1r20
  Min      alloc-odr0-1               381.00 (  0.00%)           377.00 (  1.05%)
  Min      alloc-odr0-2               275.00 (  0.00%)           273.00 (  0.73%)
  Min      alloc-odr0-4               229.00 (  0.00%)           226.00 (  1.31%)
  Min      alloc-odr0-8               199.00 (  0.00%)           196.00 (  1.51%)
  Min      alloc-odr0-16              186.00 (  0.00%)           183.00 (  1.61%)
  Min      alloc-odr0-32              179.00 (  0.00%)           175.00 (  2.23%)
  Min      alloc-odr0-64              174.00 (  0.00%)           172.00 (  1.15%)
  Min      alloc-odr0-128             172.00 (  0.00%)           170.00 (  1.16%)
  Min      alloc-odr0-256             181.00 (  0.00%)           183.00 ( -1.10%)
  Min      alloc-odr0-512             193.00 (  0.00%)           191.00 (  1.04%)
  Min      alloc-odr0-1024            201.00 (  0.00%)           199.00 (  1.00%)
  Min      alloc-odr0-2048            206.00 (  0.00%)           204.00 (  0.97%)
  Min      alloc-odr0-4096            212.00 (  0.00%)           210.00 (  0.94%)
  Min      alloc-odr0-8192            215.00 (  0.00%)           213.00 (  0.93%)
  Min      alloc-odr0-16384           216.00 (  0.00%)           214.00 (  0.93%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
09940a4f1e mm, page_alloc: simplify last cpupid reset
The current reset unnecessarily clears flags and makes pointless
calculations.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
5bb1b16975 mm, page_alloc: remove unnecessary initialisation from __alloc_pages_nodemask()
page is guaranteed to be set before it is read with or without the
initialisation.

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
be06af002f mm, page_alloc: remove unnecessary initialisation in get_page_from_freelist
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
4dfa6cd8fd mm, page_alloc: remove unnecessary local variable in get_page_from_freelist
zonelist here is a copy of a struct field that is used once.  Ditch it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
fa379b9586 mm, page_alloc: convert nr_fair_skipped to bool
The number of zones skipped to a zone expiring its fair zone allocation
quota is irrelevant.  Convert to bool.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
c603844bdc mm, page_alloc: convert alloc_flags to unsigned
alloc_flags is a bitmask of flags but it is signed which does not
necessarily generate the best code depending on the compiler.  Even
without an impact, it makes more sense that this be unsigned.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
f75fb889d1 mm, page_alloc: avoid unnecessary zone lookups during pageblock operations
Pageblocks have an associated bitmap to store migrate types and whether
the pageblock should be skipped during compaction.  The bitmap may be
associated with a memory section or a zone but the zone is looked up
unconditionally.  The compiler should optimise this away automatically
so this is a cosmetic patch only in many cases.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
754078eb45 mm, page_alloc: use __dec_zone_state for order-0 page allocation
__dec_zone_state is cheaper to use for removing an order-0 page as it
has fewer conditions to check.

The performance difference on a page allocator microbenchmark is;

                                             4.6.0-rc2                  4.6.0-rc2
                                         optiter-v1r20              decstat-v1r20
  Min      alloc-odr0-1               382.00 (  0.00%)           381.00 (  0.26%)
  Min      alloc-odr0-2               282.00 (  0.00%)           275.00 (  2.48%)
  Min      alloc-odr0-4               233.00 (  0.00%)           229.00 (  1.72%)
  Min      alloc-odr0-8               203.00 (  0.00%)           199.00 (  1.97%)
  Min      alloc-odr0-16              188.00 (  0.00%)           186.00 (  1.06%)
  Min      alloc-odr0-32              182.00 (  0.00%)           179.00 (  1.65%)
  Min      alloc-odr0-64              177.00 (  0.00%)           174.00 (  1.69%)
  Min      alloc-odr0-128             175.00 (  0.00%)           172.00 (  1.71%)
  Min      alloc-odr0-256             184.00 (  0.00%)           181.00 (  1.63%)
  Min      alloc-odr0-512             197.00 (  0.00%)           193.00 (  2.03%)
  Min      alloc-odr0-1024            203.00 (  0.00%)           201.00 (  0.99%)
  Min      alloc-odr0-2048            209.00 (  0.00%)           206.00 (  1.44%)
  Min      alloc-odr0-4096            214.00 (  0.00%)           212.00 (  0.93%)
  Min      alloc-odr0-8192            218.00 (  0.00%)           215.00 (  1.38%)
  Min      alloc-odr0-16384           219.00 (  0.00%)           216.00 (  1.37%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
682a3385e7 mm, page_alloc: inline the fast path of the zonelist iterator
The page allocator iterates through a zonelist for zones that match the
addressing limitations and nodemask of the caller but many allocations
will not be restricted.  Despite this, there is always functional call
overhead which builds up.

This patch inlines the optimistic basic case and only calls the iterator
function for the complex case.  A hindrance was the fact that
cpuset_current_mems_allowed is used in the fastpath as the allowed
nodemask even though all nodes are allowed on most systems.  The patch
handles this by only considering cpuset_current_mems_allowed if a cpuset
exists.  As well as being faster in the fast-path, this removes some
junk in the slowpath.

The performance difference on a page allocator microbenchmark is;

                                             4.6.0-rc2                  4.6.0-rc2
                                      statinline-v1r20              optiter-v1r20
  Min      alloc-odr0-1               412.00 (  0.00%)           382.00 (  7.28%)
  Min      alloc-odr0-2               301.00 (  0.00%)           282.00 (  6.31%)
  Min      alloc-odr0-4               247.00 (  0.00%)           233.00 (  5.67%)
  Min      alloc-odr0-8               215.00 (  0.00%)           203.00 (  5.58%)
  Min      alloc-odr0-16              199.00 (  0.00%)           188.00 (  5.53%)
  Min      alloc-odr0-32              191.00 (  0.00%)           182.00 (  4.71%)
  Min      alloc-odr0-64              187.00 (  0.00%)           177.00 (  5.35%)
  Min      alloc-odr0-128             185.00 (  0.00%)           175.00 (  5.41%)
  Min      alloc-odr0-256             193.00 (  0.00%)           184.00 (  4.66%)
  Min      alloc-odr0-512             207.00 (  0.00%)           197.00 (  4.83%)
  Min      alloc-odr0-1024            213.00 (  0.00%)           203.00 (  4.69%)
  Min      alloc-odr0-2048            220.00 (  0.00%)           209.00 (  5.00%)
  Min      alloc-odr0-4096            226.00 (  0.00%)           214.00 (  5.31%)
  Min      alloc-odr0-8192            229.00 (  0.00%)           218.00 (  4.80%)
  Min      alloc-odr0-16384           229.00 (  0.00%)           219.00 (  4.37%)

perf indicated that next_zones_zonelist disappeared in the profile and
__next_zones_zonelist did not appear.  This is expected as the
micro-benchmark would hit the inlined fast-path every time.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
060e74173f mm, page_alloc: inline zone_statistics
zone_statistics has one call-site but it's a public function.  Make it
static and inline.

The performance difference on a page allocator microbenchmark is;

                                             4.6.0-rc2                  4.6.0-rc2
                                      statbranch-v1r20           statinline-v1r20
  Min      alloc-odr0-1               419.00 (  0.00%)           412.00 (  1.67%)
  Min      alloc-odr0-2               305.00 (  0.00%)           301.00 (  1.31%)
  Min      alloc-odr0-4               250.00 (  0.00%)           247.00 (  1.20%)
  Min      alloc-odr0-8               219.00 (  0.00%)           215.00 (  1.83%)
  Min      alloc-odr0-16              203.00 (  0.00%)           199.00 (  1.97%)
  Min      alloc-odr0-32              195.00 (  0.00%)           191.00 (  2.05%)
  Min      alloc-odr0-64              191.00 (  0.00%)           187.00 (  2.09%)
  Min      alloc-odr0-128             189.00 (  0.00%)           185.00 (  2.12%)
  Min      alloc-odr0-256             198.00 (  0.00%)           193.00 (  2.53%)
  Min      alloc-odr0-512             210.00 (  0.00%)           207.00 (  1.43%)
  Min      alloc-odr0-1024            216.00 (  0.00%)           213.00 (  1.39%)
  Min      alloc-odr0-2048            221.00 (  0.00%)           220.00 (  0.45%)
  Min      alloc-odr0-4096            227.00 (  0.00%)           226.00 (  0.44%)
  Min      alloc-odr0-8192            232.00 (  0.00%)           229.00 (  1.29%)
  Min      alloc-odr0-16384           232.00 (  0.00%)           229.00 (  1.29%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
b9f00e147f mm, page_alloc: reduce branches in zone_statistics
zone_statistics has more branches than it really needs to take an
unlikely GFP flag into account.  Reduce the number and annotate the
unlikely flag.

The performance difference on a page allocator microbenchmark is;

                                             4.6.0-rc2                  4.6.0-rc2
                                      nocompound-v1r10           statbranch-v1r10
  Min      alloc-odr0-1               417.00 (  0.00%)           419.00 ( -0.48%)
  Min      alloc-odr0-2               308.00 (  0.00%)           305.00 (  0.97%)
  Min      alloc-odr0-4               253.00 (  0.00%)           250.00 (  1.19%)
  Min      alloc-odr0-8               221.00 (  0.00%)           219.00 (  0.90%)
  Min      alloc-odr0-16              205.00 (  0.00%)           203.00 (  0.98%)
  Min      alloc-odr0-32              199.00 (  0.00%)           195.00 (  2.01%)
  Min      alloc-odr0-64              193.00 (  0.00%)           191.00 (  1.04%)
  Min      alloc-odr0-128             191.00 (  0.00%)           189.00 (  1.05%)
  Min      alloc-odr0-256             200.00 (  0.00%)           198.00 (  1.00%)
  Min      alloc-odr0-512             212.00 (  0.00%)           210.00 (  0.94%)
  Min      alloc-odr0-1024            219.00 (  0.00%)           216.00 (  1.37%)
  Min      alloc-odr0-2048            225.00 (  0.00%)           221.00 (  1.78%)
  Min      alloc-odr0-4096            231.00 (  0.00%)           227.00 (  1.73%)
  Min      alloc-odr0-8192            234.00 (  0.00%)           232.00 (  0.85%)
  Min      alloc-odr0-16384           234.00 (  0.00%)           232.00 (  0.85%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00