linux

mainlining shenanigans

Go to file

Muchun Song 6a6b7b77cc mm: list_lru: transpose the array of per-node per-memcg lru lists Patch series "Optimize list lru memory consumption", v6. In our server, we found a suspected memory leak problem. The kmalloc-32 consumes more than 6GB of memory. Other kmem_caches consume less than 2GB memory. After our in-depth analysis, the memory consumption of kmalloc-32 slab cache is the cause of list_lru_one allocation. crash> p memcg_nr_cache_ids memcg_nr_cache_ids = $2 = 24574 memcg_nr_cache_ids is very large and memory consumption of each list_lru can be calculated with the following formula. num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) There are 4 numa nodes in our system, so each list_lru consumes ~3MB. crash> list super_blocks \| wc -l 952 Every mount will register 2 list lrus, one is for inode, another is for dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 MB (~5.6GB). But now the number of memory cgroups is less than 500. So I guess more than 12286 memory cgroups have been created on this machine (I do not know why there are so many cgroups, it may be a user's bug or the user really want to do that). Because memcg_nr_cache_ids has not been reduced to a suitable value. It leads to waste a lot of memory. If we want to reduce memcg_nr_cache_ids, we have to reboot the server. This is not what we want. In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do this. But this did not fundamentally solve the problem. We currently allocate scope for every memcg to be able to tracked on every superblock instantiated in the system, regardless of whether that superblock is even accessible to that memcg. These huge memcg counts come from container hosts where memcgs are confined to just a small subset of the total number of superblocks that instantiated at any given point in time. For these systems with huge container counts, list_lru does not need the capability of tracking every memcg on every superblock. What it comes down to is that the list_lru is only needed for a given memcg if that memcg is instatiating and freeing objects on a given list_lru. As Dave said, "Which makes me think we should be moving more towards 'add the memcg to the list_lru at the first insert' model rather than 'instantiate all at memcg init time just in case'." This patchset aims to optimize the list lru memory consumption from different aspects. I had done a easy test to show the optimization. I create 10k memory cgroups and mount 10k filesystems in the systems. We use free command to show how many memory does the systems comsumes after this operation (There are 2 numa nodes in the system). +-----------------------+------------------------+ \| condition \| memory consumption \| +-----------------------+------------------------+ \| without this patchset \| 24464 MB \| +-----------------------+------------------------+ \| after patch 1 \| 21957 MB \| <--------+ +-----------------------+------------------------+ \| \| after patch 10 \| 6895 MB \| \| +-----------------------+------------------------+ \| \| after patch 12 \| 4367 MB \| \| +-----------------------+------------------------+ \| \| The more the number of nodes, the more obvious the effect---+ BTW, there was a recent discussion [2] on the same issue. [1] https://lore.kernel.org/all/20210428094949.43579-1-songmuchun@bytedance.com/ [2] https://lore.kernel.org/all/20210405054848.GA1077931@in.ibm.com/ This series not only optimizes the memory usage of list_lru but also simplifies the code. This patch (of 16): The current scheme of maintaining per-node per-memcg lru lists looks like: struct list_lru { struct list_lru_node node; (for each node) struct list_lru_memcg memcg_lrus; struct list_lru_one lru[]; (for each memcg) } By effectively transposing the two-dimension array of list_lru_one's structures (per-node per-memcg => per-memcg per-node) it's possible to save some memory and simplify alloc/dealloc paths. The new scheme looks like: struct list_lru { struct list_lru_memcg mlrus; struct list_lru_per_memcg mlru[]; (for each memcg) struct list_lru_one node[0]; (for each node) } Memory savings are coming from not only 'struct rcu_head' but also some pointer arrays used to store the pointer to 'struct list_lru_one'. The array is per node and its size is 8 (a pointer) num_memcgs. So the total size of the arrays is 8 * num_nodes * memcg_nr_cache_ids. After this patch, the size becomes 8 * memcg_nr_cache_ids. Link: https://lkml.kernel.org/r/20220228122126.37293-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20220228122126.37293-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Alex Shi <alexs@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2022-03-22 15:57:03 -07:00
arch	Fix for the SLS mitigation, which makes a "SETcc/RET" pair grow to	2022-03-20 09:46:52 -07:00
block	block/bfq-iosched.c: use "false" rather than "BLK_RW_ASYNC"	2022-03-22 15:57:01 -07:00
certs	certs: Fix build error when CONFIG_MODULE_SIG_KEY is empty	2022-01-23 00:08:44 +09:00
crypto	crypto: af_alg - get rid of alg_memory_allocated	2022-02-15 14:29:04 +00:00
Documentation	mm/memcg: disable threshold event handlers on PREEMPT_RT	2022-03-22 15:57:02 -07:00
drivers	remove bdi_congested() and wb_congested() and related functions	2022-03-22 15:57:01 -07:00
fs	mm: fs: fix lru_cache_disabled race in bh_lru	2022-03-22 15:57:01 -07:00
include	mm: list_lru: transpose the array of per-node per-memcg lru lists	2022-03-22 15:57:03 -07:00
init	lib/stackdepot: allow optional init and stack_table allocation by kvmalloc()	2022-01-22 08:33:37 +02:00
ipc	ipc/sem: do not sleep with a spin lock held	2022-02-04 09:25:05 -08:00
kernel	configs/debug: restore DEBUG_INFO=y for overriding	2022-03-17 11:02:13 -07:00
lib	ARM further fixes for 5.17-rc:	2022-03-02 16:11:56 -08:00
LICENSES	LICENSES/LGPL-2.1: Add LGPL-2.1-or-later as valid identifiers	2021-12-16 14:33:10 +01:00
mm	mm: list_lru: transpose the array of per-node per-memcg lru lists	2022-03-22 15:57:03 -07:00
net	net: dsa: Add missing of_node_put() in dsa_port_parse_of	2022-03-17 13:13:27 +01:00
samples	samples/seccomp: Adjust sample to also provide kill option	2022-02-10 19:09:12 -08:00
scripts	scripts/spelling.txt: add more spellings to spelling.txt	2022-03-22 15:57:00 -07:00
security	selinux/stable-5.17 PR 20220223	2022-02-23 17:19:55 -08:00
sound	ALSA: intel_hdmi: Fix reference to PCM buffer address	2022-03-02 09:25:37 +01:00
tools	selftests: memcg: test high limit for single entry allocation	2022-03-22 15:57:02 -07:00
usr	kbuild: remove include/linux/cyclades.h from header file check	2022-01-27 08:51:08 +01:00
virt	KVM: Fix lockdep false negative during host resume	2022-02-17 09:52:50 -05:00
.clang-format	genirq/msi: Make interrupt allocation less convoluted	2021-12-16 22:22:20 +01:00
.cocciconfig
.get_maintainer.ignore	Opt out of scripts/get_maintainer.pl	2019-05-16 10:53:40 -07:00
.gitattributes	.gitattributes: use 'dts' diff driver for dts files	2019-12-04 19:44:11 -08:00
.gitignore	.gitignore: ignore only top-level modules.builtin	2021-05-02 00:43:35 +09:00
.mailmap	MAINTAINERS: Update Jisheng's email address	2022-03-08 17:30:32 +01:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	MAINTAINERS: replace a Microchip AT91 maintainer	2022-02-09 11:30:01 +01:00
Kbuild	kbuild: rename hostprogs-y/always to hostprogs/always-y	2020-02-04 01:53:07 +09:00
Kconfig	kbuild: ensure full rebuild when the compiler is updated	2020-05-12 13:28:33 +09:00
MAINTAINERS	Add Paolo Abeni to networking maintainers	2022-03-15 12:16:10 -07:00
Makefile	Linux 5.17	2022-03-20 13:14:17 -07:00
README	Drop all 00-INDEX files from Documentation/	2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.