forked from Minki/linux
6edf2576a6
kmalloc's API family is critical for mm, with one nature that it will round up the request size to a fixed one (mostly power of 2). Say when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes could be allocated, so in worst case, there is around 50% memory space waste. The wastage is not a big issue for requests that get allocated/freed quickly, but may cause problems with objects that have longer life time. We've met a kernel boot OOM panic (v5.10), and from the dumped slab info: [ 26.062145] kmalloc-2k 814056KB 814056KB From debug we found there are huge number of 'struct iova_magazine', whose size is 1032 bytes (1024 + 8), so each allocation will waste 1016 bytes. Though the issue was solved by giving the right (bigger) size of RAM, it is still nice to optimize the size (either use a kmalloc friendly size or create a dedicated slab for it). And from lkml archive, there was another crash kernel OOM case [1] back in 2019, which seems to be related with the similar slab waste situation, as the log is similar: [ 4.332648] iommu: Adding device 0000:20:02.0 to group 16 [ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0 ... [ 4.857565] kmalloc-2048 59164KB 59164KB The crash kernel only has 256M memory, and 59M is pretty big here. (Note: the related code has been changed and optimised in recent kernel [2], these logs are just picked to demo the problem, also a patch changing its size to 1024 bytes has been merged) So add an way to track each kmalloc's memory waste info, and leverage the existing SLUB debug framework (specifically SLUB_STORE_USER) to show its call stack of original allocation, so that user can evaluate the waste situation, identify some hot spots and optimize accordingly, for a better utilization of memory. The waste info is integrated into existing interface: '/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of 'kmalloc-4k' after boot is: 126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1 __kmem_cache_alloc_node+0x11f/0x4e0 __kmalloc_node+0x4e/0x140 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe] ixgbe_probe+0x165f/0x1d20 [ixgbe] local_pci_probe+0x78/0xc0 work_for_cpu_fn+0x26/0x40 ... which means in 'kmalloc-4k' slab, there are 126 requests of 2240 bytes which got a 4KB space (wasting 1856 bytes each and 233856 bytes in total), from ixgbe_alloc_q_vector(). And when system starts some real workload like multiple docker instances, there could are more severe waste. [1]. https://lkml.org/lkml/2019/8/12/266 [2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/ [Thanks Hyeonggon for pointing out several bugs about sorting/format] [Thanks Vlastimil for suggesting way to reduce memory usage of orig_size and keep it only for kmalloc objects] Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: John Garry <john.garry@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
---|---|---|
.. | ||
ABI | ||
accounting | ||
admin-guide | ||
arc | ||
arm | ||
arm64 | ||
block | ||
bpf | ||
cdrom | ||
core-api | ||
cpu-freq | ||
crypto | ||
dev-tools | ||
devicetree | ||
doc-guide | ||
driver-api | ||
fault-injection | ||
fb | ||
features | ||
filesystems | ||
firmware_class | ||
firmware-guide | ||
fpga | ||
gpu | ||
hid | ||
hwmon | ||
i2c | ||
ia64 | ||
iio | ||
images | ||
infiniband | ||
input | ||
isdn | ||
kbuild | ||
kernel-hacking | ||
leds | ||
litmus-tests | ||
livepatch | ||
locking | ||
loongarch | ||
m68k | ||
maintainer | ||
mhi | ||
mips | ||
misc-devices | ||
mm | ||
netlabel | ||
networking | ||
nios2 | ||
nvdimm | ||
openrisc | ||
parisc | ||
PCI | ||
pcmcia | ||
peci | ||
power | ||
powerpc | ||
process | ||
RCU | ||
riscv | ||
s390 | ||
scheduler | ||
scsi | ||
security | ||
sh | ||
sound | ||
sparc | ||
sphinx | ||
sphinx-static | ||
spi | ||
staging | ||
target | ||
timers | ||
tools | ||
trace | ||
translations | ||
usb | ||
userspace-api | ||
virt | ||
w1 | ||
watchdog | ||
x86 | ||
xtensa | ||
.gitignore | ||
arch.rst | ||
asm-annotations.rst | ||
atomic_bitops.txt | ||
atomic_t.txt | ||
Changes | ||
CodingStyle | ||
conf.py | ||
docutils.conf | ||
dontdiff | ||
index.rst | ||
Kconfig | ||
Makefile | ||
memory-barriers.txt | ||
SubmittingPatches |