linux/fs
Vladimir Davydov 503c358cf1 list_lru: introduce list_lru_shrink_{count,walk}
Kmem accounting of memcg is unusable now, because it lacks slab shrinker
support.  That means when we hit the limit we will get ENOMEM w/o any
chance to recover.  What we should do then is to call shrink_slab, which
would reclaim old inode/dentry caches from this cgroup.  This is what
this patch set is intended to do.

Basically, it does two things.  First, it introduces the notion of
per-memcg slab shrinker.  A shrinker that wants to reclaim objects per
cgroup should mark itself as SHRINKER_MEMCG_AWARE.  Then it will be
passed the memory cgroup to scan from in shrink_control->memcg.  For
such shrinkers shrink_slab iterates over the whole cgroup subtree under
the target cgroup and calls the shrinker for each kmem-active memory
cgroup.

Secondly, this patch set makes the list_lru structure per-memcg.  It's
done transparently to list_lru users - everything they have to do is to
tell list_lru_init that they want memcg-aware list_lru.  Then the
list_lru will automatically distribute objects among per-memcg lists
basing on which cgroup the object is accounted to.  This way to make FS
shrinkers (icache, dcache) memcg-aware we only need to make them use
memcg-aware list_lru, and this is what this patch set does.

As before, this patch set only enables per-memcg kmem reclaim when the
pressure goes from memory.limit, not from memory.kmem.limit.  Handling
memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
it is still unclear whether we will have this knob in the unified
hierarchy.

This patch (of 9):

NUMA aware slab shrinkers use the list_lru structure to distribute
objects coming from different NUMA nodes to different lists.  Whenever
such a shrinker needs to count or scan objects from a particular node,
it issues commands like this:

        count = list_lru_count_node(lru, sc->nid);
        freed = list_lru_walk_node(lru, sc->nid, isolate_func,
                                   isolate_arg, &sc->nr_to_scan);

where sc is an instance of the shrink_control structure passed to it
from vmscan.

To simplify this, let's add special list_lru functions to be used by
shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
consolidate the nid and nr_to_scan arguments in the shrink_control
structure.

This will also allow us to avoid patching shrinkers that use list_lru
when we make shrink_slab() per-memcg - all we will have to do is extend
the shrink_control structure to include the target memcg and make
list_lru_shrink_{count,walk} handle this appropriately.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:08 -08:00
..
9p mm: drop vm_ops->remap_pages and generic_file_remap_pages() stub 2015-02-10 14:30:30 -08:00
adfs adfs: add __printf verification, fix format/argument mismatches 2014-08-08 15:57:24 -07:00
affs fs/affs/file.c: remove obsolete pagesize check 2014-12-13 12:42:52 -08:00
afs rxrpc: make the users of rxrpc_kernel_send_data() set kvec-backed msg_iter properly 2015-02-04 01:34:14 -05:00
autofs4 assorted conversions to %p[dD] 2014-11-19 13:01:20 -05:00
befs befs: remove dead code 2014-12-13 12:42:51 -08:00
bfs fs/bfs: use bfs prefix for dump_imap 2014-08-08 15:57:24 -07:00
btrfs page_writeback: put account_page_redirty() after set_page_dirty() 2015-02-11 17:06:04 -08:00
cachefiles assorted conversions to %p[dD] 2014-11-19 13:01:20 -05:00
ceph Merge branch 'akpm' (patches from Andrew) 2015-02-10 16:45:56 -08:00
cifs Merge branch 'akpm' (patches from Andrew) 2015-02-10 16:45:56 -08:00
coda coda_venus_readdir(): use file_inode() 2014-12-11 16:28:12 -05:00
configfs assorted conversions to %p[dD] 2014-11-19 13:01:20 -05:00
cramfs fs/cramfs/inode.c: use linux/uaccess.h 2014-08-08 15:57:25 -07:00
debugfs Driver core patches for 3.19-rc1 2014-12-14 16:10:09 -08:00
devpts fs/devpts/inode.c: convert printk to pr_foo() 2014-06-06 16:08:14 -07:00
dlm netlink: make nlmsg_end() and genlmsg_end() void 2015-01-18 01:03:45 -05:00
ecryptfs Fixes for filename decryption and encrypted view plus a cleanup 2014-12-19 18:15:12 -08:00
efivarfs * Move efivarfs from the misc filesystem section to pseudo filesystem, 2015-01-29 19:16:40 +01:00
efs fs/efs/namei.c: return is not a function 2014-08-08 15:57:18 -07:00
exofs Boaz Harrosh - Fix broken email address 2014-10-19 20:22:32 +03:00
exportfs move d_rcu from overlapping d_child to overlapping d_alias 2014-11-03 15:20:29 -05:00
ext2 ext2: Convert to private i_dquot field 2014-11-10 10:06:10 +01:00
ext3 ext3: destroy sbi mutexes in put_super 2015-01-05 11:13:55 +01:00
ext4 Merge branch 'akpm' (patches from Andrew) 2015-02-10 16:45:56 -08:00
f2fs mm: drop vm_ops->remap_pages and generic_file_remap_pages() stub 2015-02-10 14:30:30 -08:00
fat fat: fix data past EOF resulting from fsx testsuite 2014-12-13 12:42:51 -08:00
freevxfs
fscache fs/fscache/object-list.c: use __seq_open_private() 2014-10-13 17:52:21 +01:00
fuse mm: drop vm_ops->remap_pages and generic_file_remap_pages() stub 2015-02-10 14:30:30 -08:00
gfs2 list_lru: introduce list_lru_shrink_{count,walk} 2015-02-12 18:54:08 -08:00
hfs fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp 2014-12-10 17:41:16 -08:00
hfsplus hfsplus: fix longname handling 2014-12-18 19:08:10 -08:00
hostfs hostfs: support rename flags 2014-08-07 14:40:09 -04:00
hpfs fs/hpfs/dnode.c: fix suspect code indent 2014-08-08 15:57:22 -07:00
hppfs vfs: make first argument of dir_context.actor typed 2014-10-31 17:48:54 -04:00
hugetlbfs mm: convert i_mmap_mutex to rwsem 2014-12-13 12:42:45 -08:00
isofs isofs: Fix bug in the way to check if the year is a leap year 2015-01-07 09:51:49 +01:00
jbd jbd: Deletion of an unnecessary check before the function call "iput" 2014-11-18 10:15:29 +01:00
jbd2 Lots of bugs fixes, including Zheng and Jan's extent status shrinker 2014-12-12 09:28:03 -08:00
jffs2 jffs2: Drop bogus if in comment 2014-11-28 18:23:44 -08:00
jfs jfs: Deletion of an unnecessary check before the function call "unload_nls" 2015-02-02 15:02:34 -06:00
kernfs kernfs: Fix kernfs_name_compare 2015-01-09 15:51:08 -08:00
lockd Merge branch 'for-3.20' of git://linux-nfs.org/~bfields/linux 2015-02-12 10:39:41 -08:00
logfs fs/logfs/readwrite.c: kernel-doc warning fixes 2014-08-06 18:01:12 -07:00
minix minix zmap block counts calculation fix 2014-08-08 15:57:20 -07:00
ncpfs Merge branch 'akpm' (patchbomb from Andrew) 2014-12-10 18:34:42 -08:00
nfs NFS client updates for Linux 3.20 2015-02-11 17:14:54 -08:00
nfs_common lockd: move lockd's grace period handling into its own module 2014-09-17 16:33:11 -04:00
nfsd nfsd: default NFSv4.2 to on 2015-02-09 14:58:50 -05:00
nilfs2 mm: drop vm_ops->remap_pages and generic_file_remap_pages() stub 2015-02-10 14:30:30 -08:00
nls
notify fanotify: don't set FAN_ONDIR implicitly on a marks ignored mask 2015-02-10 14:30:28 -08:00
ntfs assorted conversions to %p[dD] 2014-11-19 13:01:20 -05:00
ocfs2 Merge branch 'akpm' (patches from Andrew) 2015-02-10 16:45:56 -08:00
omfs FS/OMFS: block number sanity check during fill_super operation 2014-10-14 02:18:22 +02:00
openpromfs
overlayfs Merge branch 'iov_iter' into for-next 2014-12-08 20:39:29 -05:00
proc mm: /proc/pid/clear_refs: avoid split_huge_page() 2015-02-11 17:06:06 -08:00
pstore pstore: Fix sprintf format specifier in pstore_dump() 2015-01-16 16:01:29 -08:00
qnx4
qnx6 fs/qnx6: update debugging to current functions 2014-08-08 15:57:26 -07:00
quota Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2015-02-10 15:52:38 -08:00
ramfs fs/ramfs/file-nommu.c: replace count*size kzalloc by kcalloc 2014-08-08 15:57:18 -07:00
reiserfs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2014-12-16 15:46:01 -08:00
romfs fs/romfs/super.c: add blank line after declarations 2014-08-08 15:57:25 -07:00
squashfs Squashfs: Add LZ4 compression configuration option 2014-11-27 18:48:44 +00:00
sysfs sysfs/kernfs: make read requests on pre-alloc files use the buffer. 2014-11-07 10:54:38 -08:00
sysv
ubifs mm: drop vm_ops->remap_pages and generic_file_remap_pages() stub 2015-02-10 14:30:30 -08:00
udf udf: remove bool assignment to 0/1 2015-02-05 16:34:25 +01:00
ufs fs/ufs/balloc.c: remove unused variable 2014-10-14 02:18:20 +02:00
xfs list_lru: introduce list_lru_shrink_{count,walk} 2015-02-12 18:54:08 -08:00
aio.c aio: annotate aio_read_event_ring for sleep patterns 2015-02-03 19:29:05 -05:00
anon_inodes.c
attr.c fs,userns: Change inode_capable to capable_wrt_inode_uidgid 2014-06-10 13:57:22 -07:00
bad_inode.c bad_inode: add ->rename2() 2014-08-07 14:40:09 -04:00
binfmt_aout.c assorted conversions to %p[dD] 2014-11-19 13:01:20 -05:00
binfmt_elf_fdpic.c handle suicide on late failure exits in execve() in search_binary_handler() 2014-10-09 02:39:00 -04:00
binfmt_elf.c Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus 2014-12-11 17:56:37 -08:00
binfmt_em86.c syscalls: implement execveat() system call 2014-12-13 12:42:51 -08:00
binfmt_flat.c fs/binfmt_flat.c: make old_reloc() static 2014-06-04 16:54:21 -07:00
binfmt_misc.c unfuck binfmt_misc.c (broken by commit e6084d4) 2014-12-17 08:27:14 -05:00
binfmt_script.c syscalls: implement execveat() system call 2014-12-13 12:42:51 -08:00
binfmt_som.c
block_dev.c fs: add freeze_super/thaw_super fs hooks 2014-11-17 10:35:17 +00:00
buffer.c fs: clarify rate limit suppressed buffer I/O errors 2014-10-21 13:55:11 -06:00
char_dev.c fs/char_dev.c: remove pointless assignment from __register_chrdev_region() 2014-12-10 17:41:04 -08:00
compat_binfmt_elf.c
compat_ioctl.c Bluetooth: Move HCI socket definitions into its own header file 2014-07-11 13:53:04 +03:00
compat.c vfs: make first argument of dir_context.actor typed 2014-10-31 17:48:54 -04:00
coredump.c coredump: add %i/%I in core_pattern to report the tid of the crashed thread 2014-10-14 02:18:21 +02:00
dcache.c list_lru: introduce list_lru_shrink_{count,walk} 2015-02-12 18:54:08 -08:00
dcookies.c
direct-io.c fuse: honour max_read and max_write in direct_io mode 2014-09-26 21:16:51 -04:00
drop_caches.c mm: vmscan: invoke slab shrinkers from shrink_zone() 2014-12-13 12:42:48 -08:00
eventfd.c fs: Convert show_fdinfo functions to void 2014-11-05 14:13:23 -05:00
eventpoll.c fs: Convert show_fdinfo functions to void 2014-11-05 14:13:23 -05:00
exec.c syscalls: implement execveat() system call 2014-12-13 12:42:51 -08:00
fcntl.c vfs: renumber FMODE_NONOTIFY and add to uniqueness check 2015-01-08 15:10:52 -08:00
fhandle.c
file_table.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-10-13 11:28:42 +02:00
file.c fs/file.c: replace get_unused_fd() with get_unused_fd_flags(0) 2014-12-10 17:41:10 -08:00
filesystems.c
fs_pin.c make fs/{namespace,super}.c forget about acct.h 2014-08-07 14:40:09 -04:00
fs_struct.c
fs-writeback.c writeback: fix a subtle race condition in I_DIRTY clearing 2014-11-04 10:42:23 -07:00
inode.c list_lru: introduce list_lru_shrink_{count,walk} 2015-02-12 18:54:08 -08:00
internal.h list_lru: introduce list_lru_shrink_{count,walk} 2015-02-12 18:54:08 -08:00
ioctl.c fsioctl.c: make generic_block_fiemap() signal-tolerant 2015-02-10 14:30:30 -08:00
Kconfig fs: Make efivarfs a pseudo filesystem, built by default with EFI 2015-01-05 14:15:58 +00:00
Kconfig.binfmt binfmt_elf: allow arch code to examine PT_LOPROC ... PT_HIPROC headers 2014-11-24 07:45:02 +01:00
libfs.c move d_rcu from overlapping d_child to overlapping d_alias 2014-11-03 15:20:29 -05:00
locks.c fs: add FL_LAYOUT lease type 2015-02-02 18:09:38 +01:00
Makefile Merge branch 'nsfs' into for-next 2014-12-10 21:31:59 -05:00
mbcache.c fs/mbcache: replace __builtin_log2() with ilog2() 2014-06-25 22:08:29 -04:00
mount.h common object embedded into various struct ....ns 2014-12-04 14:31:00 -05:00
mpage.c vfs: guard end of device for mpage interface 2014-10-09 22:25:53 -04:00
namei.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-16 15:53:03 -08:00
namespace.c mnt: Fix a memory stomp in umount 2014-12-18 11:22:02 -08:00
no-block.c
nsfs.c take the targets of /proc/*/ns/* symlinks to separate fs 2014-12-10 21:30:20 -05:00
open.c Merge branch 'for-3.19' of git://linux-nfs.org/~bfields/linux 2014-12-16 15:25:31 -08:00
pipe.c
pnode.c mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers. 2014-12-02 10:46:50 -06:00
pnode.h
posix_acl.c
proc_namespace.c vfs: make mounts and mountstats honor root dir like mountinfo does 2014-12-17 08:27:15 -05:00
read_write.c locks: convert posix locks to file_lock_context 2015-01-16 16:08:16 -05:00
readdir.c vfs: make first argument of dir_context.actor typed 2014-10-31 17:48:54 -04:00
select.c
seq_file.c fs, seq_file: fallback to vmalloc instead of oom kill processes 2014-12-13 12:42:49 -08:00
signalfd.c fs: Convert show_fdinfo functions to void 2014-11-05 14:13:23 -05:00
splice.c vfs: export do_splice_direct() to modules 2014-10-24 00:14:35 +02:00
stack.c fs: fix comment for 'CONFIG_LBADF' 2014-08-26 09:35:56 +02:00
stat.c
statfs.c
super.c list_lru: introduce list_lru_shrink_{count,walk} 2015-02-12 18:54:08 -08:00
sync.c kill f_dentry uses 2014-11-19 13:01:25 -05:00
timerfd.c fs: Convert show_fdinfo functions to void 2014-11-05 14:13:23 -05:00
utimes.c
xattr.c new helper: audit_file() 2014-11-19 13:01:26 -05:00