forked from Minki/linux
f183324133
From Roman ("percpu: partial chunk depopulation"): In our [Facebook] production experience the percpu memory allocator is sometimes struggling with returning the memory to the system. A typical example is a creation of several thousands memory cgroups (each has several chunks of the percpu data used for vmstats, vmevents, ref counters etc). Deletion and complete releasing of these cgroups doesn't always lead to a shrinkage of the percpu memory, so that sometimes there are several GB's of memory wasted. The underlying problem is the fragmentation: to release an underlying chunk all percpu allocations should be released first. The percpu allocator tends to top up chunks to improve the utilization. It means new small-ish allocations (e.g. percpu ref counters) are placed onto almost filled old-ish chunks, effectively pinning them in memory. This patchset solves this problem by implementing a partial depopulation of percpu chunks: chunks with many empty pages are being asynchronously depopulated and the pages are returned to the system. To illustrate the problem the following script can be used: -- cd /sys/fs/cgroup mkdir percpu_test echo "+memory" > percpu_test/cgroup.subtree_control cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do mkdir percpu_test/cg_"${i}" for j in `seq 1 10`; do mkdir percpu_test/cg_"${i}"_"${j}" done done cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do for j in `seq 1 10`; do rmdir percpu_test/cg_"${i}"_"${j}" done done sleep 10 cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do rmdir percpu_test/cg_"${i}" done rmdir percpu_test -- It creates 11000 memory cgroups and removes every 10 out of 11. It prints the initial size of the percpu memory, the size after creating all cgroups and the size after deleting most of them. Results: vanilla: ./percpu_test.sh Percpu: 7488 kB Percpu: 481152 kB Percpu: 481152 kB with this patchset applied: ./percpu_test.sh Percpu: 7488 kB Percpu: 481408 kB Percpu: 135552 kB The total size of the percpu memory was reduced by more than 3.5 times. This patch: This patch implements partial depopulation of percpu chunks. As of now, a chunk can be depopulated only as a part of the final destruction, if there are no more outstanding allocations. However to minimize a memory waste it might be useful to depopulate a partially filed chunk, if a small number of outstanding allocations prevents the chunk from being fully reclaimed. This patch implements the following depopulation process: it scans over the chunk pages, looks for a range of empty and populated pages and performs the depopulation. To avoid races with new allocations, the chunk is previously isolated. After the depopulation the chunk is sidelined to a special list or freed. New allocations prefer using active chunks to sidelined chunks. If a sidelined chunk is used, it is reintegrated to the active lists. The depopulation is scheduled on the free path if the chunk is all of the following: 1) has more than 1/4 of total pages free and populated 2) the system has enough free percpu pages aside of this chunk 3) isn't the reserved chunk 4) isn't the first chunk If it's already depopulated but got free populated pages, it's a good target too. The chunk is moved to a special slot, pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work item is scheduled. On isolation, these pages are removed from the pcpu_nr_empty_pop_pages. It is constantly replaced to the to_depopulate_slot when it meets these qualifications. pcpu_reclaim_populated() iterates over the to_depopulate_slot until it becomes empty. The depopulation is performed in the reverse direction to keep populated pages close to the beginning. Depopulated chunks are sidelined to preferentially avoid them for new allocations. When no active chunk can suffice a new allocation, sidelined chunks are first checked before creating a new chunk. Signed-off-by: Roman Gushchin <guro@fb.com> Co-developed-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Dennis Zhou <dennis@kernel.org> Tested-by: Pratik Sampat <psampat@linux.ibm.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
126 lines
3.1 KiB
C
126 lines
3.1 KiB
C
// SPDX-License-Identifier: GPL-2.0-only
|
|
/*
|
|
* mm/percpu-km.c - kernel memory based chunk allocation
|
|
*
|
|
* Copyright (C) 2010 SUSE Linux Products GmbH
|
|
* Copyright (C) 2010 Tejun Heo <tj@kernel.org>
|
|
*
|
|
* Chunks are allocated as a contiguous kernel memory using gfp
|
|
* allocation. This is to be used on nommu architectures.
|
|
*
|
|
* To use percpu-km,
|
|
*
|
|
* - define CONFIG_NEED_PER_CPU_KM from the arch Kconfig.
|
|
*
|
|
* - CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK must not be defined. It's
|
|
* not compatible with PER_CPU_KM. EMBED_FIRST_CHUNK should work
|
|
* fine.
|
|
*
|
|
* - NUMA is not supported. When setting up the first chunk,
|
|
* @cpu_distance_fn should be NULL or report all CPUs to be nearer
|
|
* than or at LOCAL_DISTANCE.
|
|
*
|
|
* - It's best if the chunk size is power of two multiple of
|
|
* PAGE_SIZE. Because each chunk is allocated as a contiguous
|
|
* kernel memory block using alloc_pages(), memory will be wasted if
|
|
* chunk size is not aligned. percpu-km code will whine about it.
|
|
*/
|
|
|
|
#if defined(CONFIG_SMP) && defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK)
|
|
#error "contiguous percpu allocation is incompatible with paged first chunk"
|
|
#endif
|
|
|
|
#include <linux/log2.h>
|
|
|
|
static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
|
|
int page_start, int page_end, gfp_t gfp)
|
|
{
|
|
return 0;
|
|
}
|
|
|
|
static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
|
|
int page_start, int page_end)
|
|
{
|
|
/* nada */
|
|
}
|
|
|
|
static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type,
|
|
gfp_t gfp)
|
|
{
|
|
const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT;
|
|
struct pcpu_chunk *chunk;
|
|
struct page *pages;
|
|
unsigned long flags;
|
|
int i;
|
|
|
|
chunk = pcpu_alloc_chunk(type, gfp);
|
|
if (!chunk)
|
|
return NULL;
|
|
|
|
pages = alloc_pages(gfp, order_base_2(nr_pages));
|
|
if (!pages) {
|
|
pcpu_free_chunk(chunk);
|
|
return NULL;
|
|
}
|
|
|
|
for (i = 0; i < nr_pages; i++)
|
|
pcpu_set_page_chunk(nth_page(pages, i), chunk);
|
|
|
|
chunk->data = pages;
|
|
chunk->base_addr = page_address(pages);
|
|
|
|
spin_lock_irqsave(&pcpu_lock, flags);
|
|
pcpu_chunk_populated(chunk, 0, nr_pages);
|
|
spin_unlock_irqrestore(&pcpu_lock, flags);
|
|
|
|
pcpu_stats_chunk_alloc();
|
|
trace_percpu_create_chunk(chunk->base_addr);
|
|
|
|
return chunk;
|
|
}
|
|
|
|
static void pcpu_destroy_chunk(struct pcpu_chunk *chunk)
|
|
{
|
|
const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT;
|
|
|
|
if (!chunk)
|
|
return;
|
|
|
|
pcpu_stats_chunk_dealloc();
|
|
trace_percpu_destroy_chunk(chunk->base_addr);
|
|
|
|
if (chunk->data)
|
|
__free_pages(chunk->data, order_base_2(nr_pages));
|
|
pcpu_free_chunk(chunk);
|
|
}
|
|
|
|
static struct page *pcpu_addr_to_page(void *addr)
|
|
{
|
|
return virt_to_page(addr);
|
|
}
|
|
|
|
static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai)
|
|
{
|
|
size_t nr_pages, alloc_pages;
|
|
|
|
/* all units must be in a single group */
|
|
if (ai->nr_groups != 1) {
|
|
pr_crit("can't handle more than one group\n");
|
|
return -EINVAL;
|
|
}
|
|
|
|
nr_pages = (ai->groups[0].nr_units * ai->unit_size) >> PAGE_SHIFT;
|
|
alloc_pages = roundup_pow_of_two(nr_pages);
|
|
|
|
if (alloc_pages > nr_pages)
|
|
pr_warn("wasting %zu pages per chunk\n",
|
|
alloc_pages - nr_pages);
|
|
|
|
return 0;
|
|
}
|
|
|
|
static bool pcpu_should_reclaim_chunk(struct pcpu_chunk *chunk)
|
|
{
|
|
return false;
|
|
}
|