linux/mm/shrinker_debug.c
Qi Zheng c42d50aefd mm: shrinker: add infrastructure for dynamically allocating shrinker
Patch series "use refcount+RCU method to implement lockless slab shrink",
v6.

1. Background
=============

We used to implement the lockless slab shrink with SRCU [1], but then kernel
test robot reported -88.8% regression in stress-ng.ramfs.ops_per_sec test
case [2], so we reverted it [3].

This patch series aims to re-implement the lockless slab shrink using the
refcount+RCU method proposed by Dave Chinner [4].

[1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/
[2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/
[4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/

2. Implementation
=================

Currently, the shrinker instances can be divided into the following three types:

a) global shrinker instance statically defined in the kernel, such as
   workingset_shadow_shrinker.

b) global shrinker instance statically defined in the kernel modules, such as
   mmu_shrinker in x86.

c) shrinker instance embedded in other structures.

For case a, the memory of shrinker instance is never freed. For case b, the
memory of shrinker instance will be freed after synchronize_rcu() when the
module is unloaded. For case c, the memory of shrinker instance will be freed
along with the structure it is embedded in.

In preparation for implementing lockless slab shrink, we need to dynamically
allocate those shrinker instances in case c, then the memory can be dynamically
freed alone by calling kfree_rcu().

This patchset adds the following new APIs for dynamically allocating shrinker,
and add a private_data field to struct shrinker to record and get the original
embedded structure.

1. shrinker_alloc()
2. shrinker_register()
3. shrinker_free()

In order to simplify shrinker-related APIs and make shrinker more independent of
other kernel mechanisms, this patchset uses the above APIs to convert all
shrinkers (including case a and b) to dynamically allocated, and then remove all
existing APIs. This will also have another advantage mentioned by Dave Chinner:

```
The other advantage of this is that it will break all the existing out of tree
code and third party modules using the old API and will no longer work with a
kernel using lockless slab shrinkers. They need to break (both at the source and
binary levels) to stop bad things from happening due to using uncoverted
shrinkers in the new setup.
```

Then we free the shrinker by calling call_rcu(), and use rcu_read_{lock,unlock}()
to ensure that the shrinker instance is valid. And the shrinker::refcount
mechanism ensures that the shrinker instance will not be run again after
unregistration. So the structure that records the pointer of shrinker instance
can be safely freed without waiting for the RCU read-side critical section.

In this way, while we implement the lockless slab shrink, we don't need to be
blocked in unregister_shrinker() to wait RCU read-side critical section.

PATCH 1: introduce new APIs
PATCH 2~38: convert all shrinnkers to use new APIs
PATCH 39: remove old APIs
PATCH 40~41: some cleanups and preparations
PATCH 42-43: implement the lockless slab shrink
PATCH 44~45: convert shrinker_rwsem to mutex

3. Testing
==========

3.1 slab shrink stress test
---------------------------

We can reproduce the down_read_trylock() hotspot through the following script:

```

DIR="/root/shrinker/memcg/mnt"

do_create()
{
    mkdir -p /sys/fs/cgroup/memory/test
    echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
    for i in `seq 0 $1`;
    do
        mkdir -p /sys/fs/cgroup/memory/test/$i;
        echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
        mkdir -p $DIR/$i;
    done
}

do_mount()
{
    for i in `seq $1 $2`;
    do
        mount -t tmpfs $i $DIR/$i;
    done
}

do_touch()
{
    for i in `seq $1 $2`;
    do
        echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
        dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
    done
}

case "$1" in
  touch)
    do_touch $2 $3
    ;;
  test)
    do_create 4000
    do_mount 0 4000
    do_touch 0 3000
    ;;
  *)
    exit 1
    ;;
esac
```

Save the above script, then run test and touch commands. Then we can use the
following perf command to view hotspots:

perf top -U -F 999

1) Before applying this patchset:

  33.15%  [kernel]          [k] down_read_trylock
  25.38%  [kernel]          [k] shrink_slab
  21.75%  [kernel]          [k] up_read
   4.45%  [kernel]          [k] _find_next_bit
   2.27%  [kernel]          [k] do_shrink_slab
   1.80%  [kernel]          [k] intel_idle_irq
   1.79%  [kernel]          [k] shrink_lruvec
   0.67%  [kernel]          [k] xas_descend
   0.41%  [kernel]          [k] mem_cgroup_iter
   0.40%  [kernel]          [k] shrink_node
   0.38%  [kernel]          [k] list_lru_count_one

2) After applying this patchset:

  64.56%  [kernel]          [k] shrink_slab
  12.18%  [kernel]          [k] do_shrink_slab
   3.30%  [kernel]          [k] __rcu_read_unlock
   2.61%  [kernel]          [k] shrink_lruvec
   2.49%  [kernel]          [k] __rcu_read_lock
   1.93%  [kernel]          [k] intel_idle_irq
   0.89%  [kernel]          [k] shrink_node
   0.81%  [kernel]          [k] mem_cgroup_iter
   0.77%  [kernel]          [k] mem_cgroup_calculate_protection
   0.66%  [kernel]          [k] list_lru_count_one

We can see that the first perf hotspot becomes shrink_slab, which is what we
expect.

3.2 registration and unregistration stress test
-----------------------------------------------

Run the command below to test:

stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &

1) Before applying this patchset:

setting to a 60 second run per stressor
dispatching hogs: 9 ramfs
stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
                          (secs)    (secs)    (secs)   (real time) (usr+sys time)
ramfs            473062     60.00      8.00    279.13      7884.12        1647.59
for a 60.01s run time:
   1440.34s available CPU time
      7.99s user time   (  0.55%)
    279.13s system time ( 19.38%)
    287.12s total time  ( 19.93%)
load average: 7.12 2.99 1.15
successful run completed in 60.01s (1 min, 0.01 secs)

2) After applying this patchset:

setting to a 60 second run per stressor
dispatching hogs: 9 ramfs
stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
                          (secs)    (secs)    (secs)   (real time) (usr+sys time)
ramfs            477165     60.00      8.13    281.34      7952.55        1648.40
for a 60.01s run time:
   1440.33s available CPU time
      8.12s user time   (  0.56%)
    281.34s system time ( 19.53%)
    289.46s total time  ( 20.10%)
load average: 6.98 3.03 1.19
successful run completed in 60.01s (1 min, 0.01 secs)

We can see that the ops/s has hardly changed.


This patch (of 45):

Currently, the shrinker instances can be divided into the following three
types:

a) global shrinker instance statically defined in the kernel, such as
   workingset_shadow_shrinker.

b) global shrinker instance statically defined in the kernel modules, such
   as mmu_shrinker in x86.

c) shrinker instance embedded in other structures.

For case a, the memory of shrinker instance is never freed. For case b,
the memory of shrinker instance will be freed after synchronize_rcu() when
the module is unloaded. For case c, the memory of shrinker instance will
be freed along with the structure it is embedded in.

In preparation for implementing lockless slab shrink, we need to
dynamically allocate those shrinker instances in case c, then the memory
can be dynamically freed alone by calling kfree_rcu().

So this commit adds the following new APIs for dynamically allocating
shrinker, and add a private_data field to struct shrinker to record and
get the original embedded structure.

1. shrinker_alloc()

Used to allocate shrinker instance itself and related memory, it will
return a pointer to the shrinker instance on success and NULL on failure.

2. shrinker_register()

Used to register the shrinker instance, which is same as the current
register_shrinker_prepared().

3. shrinker_free()

Used to unregister (if needed) and free the shrinker instance.

In order to simplify shrinker-related APIs and make shrinker more
independent of other kernel mechanisms, subsequent submissions will use
the above API to convert all shrinkers (including case a and b) to
dynamically allocated, and then remove all existing APIs.

This will also have another advantage mentioned by Dave Chinner:

```
The other advantage of this is that it will break all the existing
out of tree code and third party modules using the old API and will
no longer work with a kernel using lockless slab shrinkers. They
need to break (both at the source and binary levels) to stop bad
things from happening due to using unconverted shrinkers in the new
setup.
```

[zhengqi.arch@bytedance.com: mm: shrinker: some cleanup]
  Link: https://lkml.kernel.org/r/20230919024607.65463-1-zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/20230911094444.68966-1-zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/20230911094444.68966-2-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Chuck Lever <cel@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Coly Li <colyli@suse.de>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sean Paul <sean@poorly.run>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:23 -07:00

280 lines
6.1 KiB
C

// SPDX-License-Identifier: GPL-2.0
#include <linux/idr.h>
#include <linux/slab.h>
#include <linux/debugfs.h>
#include <linux/seq_file.h>
#include <linux/shrinker.h>
#include <linux/memcontrol.h>
#include "internal.h"
/* defined in vmscan.c */
extern struct rw_semaphore shrinker_rwsem;
extern struct list_head shrinker_list;
static DEFINE_IDA(shrinker_debugfs_ida);
static struct dentry *shrinker_debugfs_root;
static unsigned long shrinker_count_objects(struct shrinker *shrinker,
struct mem_cgroup *memcg,
unsigned long *count_per_node)
{
unsigned long nr, total = 0;
int nid;
for_each_node(nid) {
if (nid == 0 || (shrinker->flags & SHRINKER_NUMA_AWARE)) {
struct shrink_control sc = {
.gfp_mask = GFP_KERNEL,
.nid = nid,
.memcg = memcg,
};
nr = shrinker->count_objects(shrinker, &sc);
if (nr == SHRINK_EMPTY)
nr = 0;
} else {
nr = 0;
}
count_per_node[nid] = nr;
total += nr;
}
return total;
}
static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
{
struct shrinker *shrinker = m->private;
unsigned long *count_per_node;
struct mem_cgroup *memcg;
unsigned long total;
bool memcg_aware;
int ret = 0, nid;
count_per_node = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
if (!count_per_node)
return -ENOMEM;
rcu_read_lock();
memcg_aware = shrinker->flags & SHRINKER_MEMCG_AWARE;
memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
if (memcg && !mem_cgroup_online(memcg))
continue;
total = shrinker_count_objects(shrinker,
memcg_aware ? memcg : NULL,
count_per_node);
if (total) {
seq_printf(m, "%lu", mem_cgroup_ino(memcg));
for_each_node(nid)
seq_printf(m, " %lu", count_per_node[nid]);
seq_putc(m, '\n');
}
if (!memcg_aware) {
mem_cgroup_iter_break(NULL, memcg);
break;
}
if (signal_pending(current)) {
mem_cgroup_iter_break(NULL, memcg);
ret = -EINTR;
break;
}
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
rcu_read_unlock();
kfree(count_per_node);
return ret;
}
DEFINE_SHOW_ATTRIBUTE(shrinker_debugfs_count);
static int shrinker_debugfs_scan_open(struct inode *inode, struct file *file)
{
file->private_data = inode->i_private;
return nonseekable_open(inode, file);
}
static ssize_t shrinker_debugfs_scan_write(struct file *file,
const char __user *buf,
size_t size, loff_t *pos)
{
struct shrinker *shrinker = file->private_data;
unsigned long nr_to_scan = 0, ino, read_len;
struct shrink_control sc = {
.gfp_mask = GFP_KERNEL,
};
struct mem_cgroup *memcg = NULL;
int nid;
char kbuf[72];
read_len = size < (sizeof(kbuf) - 1) ? size : (sizeof(kbuf) - 1);
if (copy_from_user(kbuf, buf, read_len))
return -EFAULT;
kbuf[read_len] = '\0';
if (sscanf(kbuf, "%lu %d %lu", &ino, &nid, &nr_to_scan) != 3)
return -EINVAL;
if (nid < 0 || nid >= nr_node_ids)
return -EINVAL;
if (nr_to_scan == 0)
return size;
if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
memcg = mem_cgroup_get_from_ino(ino);
if (!memcg || IS_ERR(memcg))
return -ENOENT;
if (!mem_cgroup_online(memcg)) {
mem_cgroup_put(memcg);
return -ENOENT;
}
} else if (ino != 0) {
return -EINVAL;
}
sc.nid = nid;
sc.memcg = memcg;
sc.nr_to_scan = nr_to_scan;
sc.nr_scanned = nr_to_scan;
shrinker->scan_objects(shrinker, &sc);
mem_cgroup_put(memcg);
return size;
}
static const struct file_operations shrinker_debugfs_scan_fops = {
.owner = THIS_MODULE,
.open = shrinker_debugfs_scan_open,
.write = shrinker_debugfs_scan_write,
};
int shrinker_debugfs_add(struct shrinker *shrinker)
{
struct dentry *entry;
char buf[128];
int id;
lockdep_assert_held(&shrinker_rwsem);
/* debugfs isn't initialized yet, add debugfs entries later. */
if (!shrinker_debugfs_root)
return 0;
id = ida_alloc(&shrinker_debugfs_ida, GFP_KERNEL);
if (id < 0)
return id;
shrinker->debugfs_id = id;
snprintf(buf, sizeof(buf), "%s-%d", shrinker->name, id);
/* create debugfs entry */
entry = debugfs_create_dir(buf, shrinker_debugfs_root);
if (IS_ERR(entry)) {
ida_free(&shrinker_debugfs_ida, id);
return PTR_ERR(entry);
}
shrinker->debugfs_entry = entry;
debugfs_create_file("count", 0440, entry, shrinker,
&shrinker_debugfs_count_fops);
debugfs_create_file("scan", 0220, entry, shrinker,
&shrinker_debugfs_scan_fops);
return 0;
}
int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
{
struct dentry *entry;
char buf[128];
const char *new, *old;
va_list ap;
int ret = 0;
va_start(ap, fmt);
new = kvasprintf_const(GFP_KERNEL, fmt, ap);
va_end(ap);
if (!new)
return -ENOMEM;
down_write(&shrinker_rwsem);
old = shrinker->name;
shrinker->name = new;
if (shrinker->debugfs_entry) {
snprintf(buf, sizeof(buf), "%s-%d", shrinker->name,
shrinker->debugfs_id);
entry = debugfs_rename(shrinker_debugfs_root,
shrinker->debugfs_entry,
shrinker_debugfs_root, buf);
if (IS_ERR(entry))
ret = PTR_ERR(entry);
else
shrinker->debugfs_entry = entry;
}
up_write(&shrinker_rwsem);
kfree_const(old);
return ret;
}
EXPORT_SYMBOL(shrinker_debugfs_rename);
struct dentry *shrinker_debugfs_detach(struct shrinker *shrinker,
int *debugfs_id)
{
struct dentry *entry = shrinker->debugfs_entry;
lockdep_assert_held(&shrinker_rwsem);
*debugfs_id = entry ? shrinker->debugfs_id : -1;
shrinker->debugfs_entry = NULL;
return entry;
}
void shrinker_debugfs_remove(struct dentry *debugfs_entry, int debugfs_id)
{
debugfs_remove_recursive(debugfs_entry);
ida_free(&shrinker_debugfs_ida, debugfs_id);
}
static int __init shrinker_debugfs_init(void)
{
struct shrinker *shrinker;
struct dentry *dentry;
int ret = 0;
dentry = debugfs_create_dir("shrinker", NULL);
if (IS_ERR(dentry))
return PTR_ERR(dentry);
shrinker_debugfs_root = dentry;
/* Create debugfs entries for shrinkers registered at boot */
down_write(&shrinker_rwsem);
list_for_each_entry(shrinker, &shrinker_list, list)
if (!shrinker->debugfs_entry) {
ret = shrinker_debugfs_add(shrinker);
if (ret)
break;
}
up_write(&shrinker_rwsem);
return ret;
}
late_initcall(shrinker_debugfs_init);