For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system. Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.
For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.
The tunable takes a value in range [0, 100], with a default of 20.
Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl. Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate. The internal interpretation of this opaque
value allows for future fine-tuning.
Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation. A zone's present_pages determines its weight.
To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same. If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value. By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.
This patch is largely based on ideas from Michal Hocko [2]. See also the
LWN article [3].
Performance data
================
System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap. The workload is mainly anonymous
userspace pages, which are easy to move around. I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.
1. Kernel hugepage allocation latencies
With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:
(all latency values are in microseconds)
- With vanilla 5.6.0-rc3
percentile latency
–––––––––– –––––––
5 7894
10 9496
25 12561
30 15295
40 18244
50 21229
60 27556
75 30147
80 31047
90 32859
95 33799
Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
- With 5.6.0-rc3 + this patch, with proactiveness=20
sysctl -w vm.compaction_proactiveness=20
percentile latency
–––––––––– –––––––
5 2
10 2
25 3
30 3
40 3
50 4
60 4
75 4
80 4
90 5
95 429
Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
2. JAVA heap allocation
In this test, we first fragment memory using the same method as for (1).
Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages. We also set THP to madvise to
allow hugepage backing of this heap.
/usr/bin/time
java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
The above command allocates 700G of Java heap using hugepages.
- With vanilla 5.6.0-rc3
17.39user 1666.48system 27:37.89elapsed
- With 5.6.0-rc3 + this patch, with proactiveness=20
8.35user 194.58system 3:19.62elapsed
Elapsed time remains around 3:15, as proactiveness is further increased.
Note that proactive compaction happens throughout the runtime of these
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.
In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80). Repeat.
bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.
Backoff behavior
================
Above workloads produce a memory state which is easy to compact. However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off. To test this aspect:
- Created a kernel driver that allocates almost all memory as hugepages
followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
(=> ~30 seconds between retries).
[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/
Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
262 lines
7.5 KiB
C
262 lines
7.5 KiB
C
/* SPDX-License-Identifier: GPL-2.0 */
|
|
#ifndef _LINUX_COMPACTION_H
|
|
#define _LINUX_COMPACTION_H
|
|
|
|
/*
|
|
* Determines how hard direct compaction should try to succeed.
|
|
* Lower value means higher priority, analogically to reclaim priority.
|
|
*/
|
|
enum compact_priority {
|
|
COMPACT_PRIO_SYNC_FULL,
|
|
MIN_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_FULL,
|
|
COMPACT_PRIO_SYNC_LIGHT,
|
|
MIN_COMPACT_COSTLY_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
|
|
DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
|
|
COMPACT_PRIO_ASYNC,
|
|
INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC
|
|
};
|
|
|
|
/* Return values for compact_zone() and try_to_compact_pages() */
|
|
/* When adding new states, please adjust include/trace/events/compaction.h */
|
|
enum compact_result {
|
|
/* For more detailed tracepoint output - internal to compaction */
|
|
COMPACT_NOT_SUITABLE_ZONE,
|
|
/*
|
|
* compaction didn't start as it was not possible or direct reclaim
|
|
* was more suitable
|
|
*/
|
|
COMPACT_SKIPPED,
|
|
/* compaction didn't start as it was deferred due to past failures */
|
|
COMPACT_DEFERRED,
|
|
|
|
/* compaction not active last round */
|
|
COMPACT_INACTIVE = COMPACT_DEFERRED,
|
|
|
|
/* For more detailed tracepoint output - internal to compaction */
|
|
COMPACT_NO_SUITABLE_PAGE,
|
|
/* compaction should continue to another pageblock */
|
|
COMPACT_CONTINUE,
|
|
|
|
/*
|
|
* The full zone was compacted scanned but wasn't successfull to compact
|
|
* suitable pages.
|
|
*/
|
|
COMPACT_COMPLETE,
|
|
/*
|
|
* direct compaction has scanned part of the zone but wasn't successfull
|
|
* to compact suitable pages.
|
|
*/
|
|
COMPACT_PARTIAL_SKIPPED,
|
|
|
|
/* compaction terminated prematurely due to lock contentions */
|
|
COMPACT_CONTENDED,
|
|
|
|
/*
|
|
* direct compaction terminated after concluding that the allocation
|
|
* should now succeed
|
|
*/
|
|
COMPACT_SUCCESS,
|
|
};
|
|
|
|
struct alloc_context; /* in mm/internal.h */
|
|
|
|
/*
|
|
* Number of free order-0 pages that should be available above given watermark
|
|
* to make sure compaction has reasonable chance of not running out of free
|
|
* pages that it needs to isolate as migration target during its work.
|
|
*/
|
|
static inline unsigned long compact_gap(unsigned int order)
|
|
{
|
|
/*
|
|
* Although all the isolations for migration are temporary, compaction
|
|
* free scanner may have up to 1 << order pages on its list and then
|
|
* try to split an (order - 1) free page. At that point, a gap of
|
|
* 1 << order might not be enough, so it's safer to require twice that
|
|
* amount. Note that the number of pages on the list is also
|
|
* effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum
|
|
* that the migrate scanner can have isolated on migrate list, and free
|
|
* scanner is only invoked when the number of isolated free pages is
|
|
* lower than that. But it's not worth to complicate the formula here
|
|
* as a bigger gap for higher orders than strictly necessary can also
|
|
* improve chances of compaction success.
|
|
*/
|
|
return 2UL << order;
|
|
}
|
|
|
|
#ifdef CONFIG_COMPACTION
|
|
extern int sysctl_compact_memory;
|
|
extern int sysctl_compaction_proactiveness;
|
|
extern int sysctl_compaction_handler(struct ctl_table *table, int write,
|
|
void *buffer, size_t *length, loff_t *ppos);
|
|
extern int sysctl_extfrag_threshold;
|
|
extern int sysctl_compact_unevictable_allowed;
|
|
|
|
extern int extfrag_for_order(struct zone *zone, unsigned int order);
|
|
extern int fragmentation_index(struct zone *zone, unsigned int order);
|
|
extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
|
|
unsigned int order, unsigned int alloc_flags,
|
|
const struct alloc_context *ac, enum compact_priority prio,
|
|
struct page **page);
|
|
extern void reset_isolation_suitable(pg_data_t *pgdat);
|
|
extern enum compact_result compaction_suitable(struct zone *zone, int order,
|
|
unsigned int alloc_flags, int highest_zoneidx);
|
|
|
|
extern void defer_compaction(struct zone *zone, int order);
|
|
extern bool compaction_deferred(struct zone *zone, int order);
|
|
extern void compaction_defer_reset(struct zone *zone, int order,
|
|
bool alloc_success);
|
|
extern bool compaction_restarting(struct zone *zone, int order);
|
|
|
|
/* Compaction has made some progress and retrying makes sense */
|
|
static inline bool compaction_made_progress(enum compact_result result)
|
|
{
|
|
/*
|
|
* Even though this might sound confusing this in fact tells us
|
|
* that the compaction successfully isolated and migrated some
|
|
* pageblocks.
|
|
*/
|
|
if (result == COMPACT_SUCCESS)
|
|
return true;
|
|
|
|
return false;
|
|
}
|
|
|
|
/* Compaction has failed and it doesn't make much sense to keep retrying. */
|
|
static inline bool compaction_failed(enum compact_result result)
|
|
{
|
|
/* All zones were scanned completely and still not result. */
|
|
if (result == COMPACT_COMPLETE)
|
|
return true;
|
|
|
|
return false;
|
|
}
|
|
|
|
/* Compaction needs reclaim to be performed first, so it can continue. */
|
|
static inline bool compaction_needs_reclaim(enum compact_result result)
|
|
{
|
|
/*
|
|
* Compaction backed off due to watermark checks for order-0
|
|
* so the regular reclaim has to try harder and reclaim something.
|
|
*/
|
|
if (result == COMPACT_SKIPPED)
|
|
return true;
|
|
|
|
return false;
|
|
}
|
|
|
|
/*
|
|
* Compaction has backed off for some reason after doing some work or none
|
|
* at all. It might be throttling or lock contention. Retrying might be still
|
|
* worthwhile, but with a higher priority if allowed.
|
|
*/
|
|
static inline bool compaction_withdrawn(enum compact_result result)
|
|
{
|
|
/*
|
|
* If compaction is deferred for high-order allocations, it is
|
|
* because sync compaction recently failed. If this is the case
|
|
* and the caller requested a THP allocation, we do not want
|
|
* to heavily disrupt the system, so we fail the allocation
|
|
* instead of entering direct reclaim.
|
|
*/
|
|
if (result == COMPACT_DEFERRED)
|
|
return true;
|
|
|
|
/*
|
|
* If compaction in async mode encounters contention or blocks higher
|
|
* priority task we back off early rather than cause stalls.
|
|
*/
|
|
if (result == COMPACT_CONTENDED)
|
|
return true;
|
|
|
|
/*
|
|
* Page scanners have met but we haven't scanned full zones so this
|
|
* is a back off in fact.
|
|
*/
|
|
if (result == COMPACT_PARTIAL_SKIPPED)
|
|
return true;
|
|
|
|
return false;
|
|
}
|
|
|
|
|
|
bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
|
|
int alloc_flags);
|
|
|
|
extern int kcompactd_run(int nid);
|
|
extern void kcompactd_stop(int nid);
|
|
extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int highest_zoneidx);
|
|
|
|
#else
|
|
static inline void reset_isolation_suitable(pg_data_t *pgdat)
|
|
{
|
|
}
|
|
|
|
static inline enum compact_result compaction_suitable(struct zone *zone, int order,
|
|
int alloc_flags, int highest_zoneidx)
|
|
{
|
|
return COMPACT_SKIPPED;
|
|
}
|
|
|
|
static inline void defer_compaction(struct zone *zone, int order)
|
|
{
|
|
}
|
|
|
|
static inline bool compaction_deferred(struct zone *zone, int order)
|
|
{
|
|
return true;
|
|
}
|
|
|
|
static inline bool compaction_made_progress(enum compact_result result)
|
|
{
|
|
return false;
|
|
}
|
|
|
|
static inline bool compaction_failed(enum compact_result result)
|
|
{
|
|
return false;
|
|
}
|
|
|
|
static inline bool compaction_needs_reclaim(enum compact_result result)
|
|
{
|
|
return false;
|
|
}
|
|
|
|
static inline bool compaction_withdrawn(enum compact_result result)
|
|
{
|
|
return true;
|
|
}
|
|
|
|
static inline int kcompactd_run(int nid)
|
|
{
|
|
return 0;
|
|
}
|
|
static inline void kcompactd_stop(int nid)
|
|
{
|
|
}
|
|
|
|
static inline void wakeup_kcompactd(pg_data_t *pgdat,
|
|
int order, int highest_zoneidx)
|
|
{
|
|
}
|
|
|
|
#endif /* CONFIG_COMPACTION */
|
|
|
|
struct node;
|
|
#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
|
|
extern int compaction_register_node(struct node *node);
|
|
extern void compaction_unregister_node(struct node *node);
|
|
|
|
#else
|
|
|
|
static inline int compaction_register_node(struct node *node)
|
|
{
|
|
return 0;
|
|
}
|
|
|
|
static inline void compaction_unregister_node(struct node *node)
|
|
{
|
|
}
|
|
#endif /* CONFIG_COMPACTION && CONFIG_SYSFS && CONFIG_NUMA */
|
|
|
|
#endif /* _LINUX_COMPACTION_H */
|