linux/fs
Vasily Averin 79f6540ba8 memcg: enable accounting for mnt_cache entries
Patch series "memcg accounting from OpenVZ", v7.

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.  Now we're
rebasing again, revising our old patches and trying to push them upstream.

We try to protect the host system from any misuse of kernel memory
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready
yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that they
are at least not worse as the according original RHEL7 kernel.

This patch (of 10):

The kernel allocates ~400 bytes of 'struct mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts, and this
can be repeated many times.  Additionally, each mount allocates up to
PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:12 -07:00
..
9p 9p for 5.13-rc1 2021-05-07 11:18:52 -07:00
adfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
affs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
afs afs: Remove redundant assignment to ret 2021-07-21 15:11:22 +01:00
autofs autofs: should_expire() argument is guaranteed to be positive 2021-03-24 14:14:27 -04:00
befs fs/befs: Delete obsolete TODO file 2021-03-30 16:54:49 -07:00
bfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
btrfs for-5.14-rc7-tag 2021-08-26 11:05:11 -07:00
cachefiles fscache, cachefiles: Add alternate API to use kiocb for read/write to cache 2021-04-23 10:14:32 +01:00
ceph ceph: fix possible null-pointer dereference in ceph_mdsmap_decode() 2021-08-25 16:34:11 +02:00
cifs cifs: Call close synchronously during unlink/rename/lease break. 2021-08-12 11:29:58 -05:00
coda coda: fix reference counting in coda_file_mmap error path 2021-04-23 14:42:39 -07:00
configfs configfs: restore the kernel v5.13 text attribute write behavior 2021-08-09 16:56:00 +02:00
cramfs cramfs: use %pD instead of messing with file_dentry()->d_name 2021-01-05 23:02:47 -05:00
crypto fscrypt: fix derivation of SipHash keys on big endian CPUs 2021-06-05 00:52:52 -07:00
debugfs Linux 5.13-rc6 2021-06-14 09:07:45 +02:00
devpts
dlm fs: dlm: invalid buffer access in lookup error 2021-06-11 12:44:47 -05:00
ecryptfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
efivarfs efivars: convert to fileattr 2021-04-12 15:04:29 +02:00
efs
erofs erofs: clean up file headers & footers 2021-06-08 00:41:24 +08:00
exfat Description for this pull request: 2021-07-06 11:06:04 -07:00
exportfs exportfs: Add a function to return the raw output from fh_to_dentry() 2020-12-09 09:39:38 -05:00
ext2 fs/ext2: Avoid page_address on pages returned by ext2_get_page 2021-07-16 12:36:51 +02:00
ext4 ext4: fix potential htree corruption when growing large_dir directories 2021-08-06 13:00:49 -04:00
f2fs f2fs: drop dirty node pages when cp is in error status 2021-07-06 22:05:06 -07:00
fat mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
freevxfs
fscache fscache, cachefiles: Add alternate API to use kiocb for read/write to cache 2021-04-23 10:14:32 +01:00
fuse Merge branch 'for-5.14/dax' into libnvdimm-fixes 2021-08-11 12:04:43 -07:00
gfs2 Various minor gfs2 cleanups and fixes 2021-06-29 20:23:08 -07:00
hfs hfs: add lock nesting notation to hfs_find_init 2021-07-15 10:13:49 -07:00
hfsplus hfsplus: report create_date to kstat.btime 2021-07-01 11:06:06 -07:00
hostfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-05-02 09:14:01 -07:00
hpfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
hugetlbfs hugetlbfs: fix mount mode command line processing 2021-07-23 17:43:28 -07:00
iomap iomap: Don't create iomap_page objects in iomap_page_mkwrite_actor 2021-07-15 09:58:06 -07:00
isofs isofs: remove redundant continue statement 2021-06-17 17:11:42 +02:00
jbd2 ext4: inline jbd2_journal_[un]register_shrinker() 2021-07-08 08:37:31 -04:00
jffs2 This pull request contains changes for JFFS2, UBI and UBIFS 2021-05-04 18:08:40 -07:00
jfs JFS fixes for 5.14 2021-07-02 14:25:17 -07:00
kernfs Driver core changes for 5.14-rc1 2021-07-05 13:51:41 -07:00
lockd lockd: Update the NLMv4 SHARE results encoder to use struct xdr_stream 2021-07-06 20:14:44 -04:00
minix mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
netfs netfs: fix test for whether we can skip read when writing beyond EOF 2021-06-21 21:24:07 +01:00
nfs NFS client updates for Linux 5.14 2021-07-09 09:43:57 -07:00
nfs_common nfs_common: fix doc warning 2021-07-06 20:14:41 -04:00
nfsd block-5.14-2021-07-08 2021-07-09 12:05:33 -07:00
nilfs2 Merge branch 'akpm' (patches from Andrew) 2021-07-02 12:08:10 -07:00
nls
notify ucounts: add missing data type changes 2021-08-09 15:45:02 -05:00
ntfs Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-07-03 11:30:04 -07:00
ocfs2 ocfs2: ocfs2_downconvert_lock failure results in deadlock 2021-09-03 09:58:09 -07:00
omfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
openpromfs openpromfs: don't do unlock_new_inode() until the new inode is set up 2021-03-12 22:15:22 -05:00
orangefs orangefs: fix orangefs df output. 2021-06-28 08:40:08 -04:00
overlayfs ovl: fix uninitialized pointer read in ovl_lookup_real_one() 2021-08-10 10:21:30 +02:00
proc Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-07-03 11:41:14 -07:00
pstore for-5.14/drivers-2021-06-29 2021-06-30 12:21:16 -07:00
qnx4
qnx6
quota quota: remove unnecessary oom message 2021-06-22 10:40:52 +02:00
ramfs fs: move ramfs_aops to libfs 2021-06-29 10:53:48 -07:00
reiserfs reiserfs: check directory items on read from disk 2021-07-16 12:36:51 +02:00
romfs
squashfs squashfs: add option to panic on errors 2021-06-29 10:53:46 -07:00
sysfs sysfs: Support zapping of binary attr mmaps 2021-01-12 14:26:31 +01:00
sysv mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
tracefs tracing: Fix various typos in comments 2021-03-23 14:08:18 -04:00
ubifs ubifs: Set/Clear I_LINKABLE under i_lock for whiteout inode 2021-06-22 09:21:39 +02:00
udf \n 2021-07-01 12:06:39 -07:00
ufs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
unicode .gitignore: prefix local generated files with a slash 2021-05-02 00:43:35 +09:00
vboxsf vboxsf: Add support for the atomic_open directory-inode op 2021-06-23 14:36:52 +02:00
verity fsverity: relax build time dependency on CRYPTO_SHA256 2021-04-22 17:31:32 +10:00
xfs Fixes for 5.14-rc4: 2021-08-01 12:07:23 -07:00
zonefs zonefs: remove redundant null bio check 2021-07-16 13:45:18 +09:00
aio.c Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio" 2021-04-30 11:20:39 -07:00
anon_inodes.c fs: anon_inodes: rephrase to appropriate kernel-doc 2021-01-15 12:17:25 -05:00
attr.c ima: handle idmapped mounts 2021-01-24 14:27:20 +01:00
bad_inode.c fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
binfmt_aout.c binfmt: remove in-tree usage of MAP_EXECUTABLE 2021-06-29 10:53:50 -07:00
binfmt_elf_fdpic.c Merge branch 'akpm' (patches from Andrew) 2021-06-29 17:29:11 -07:00
binfmt_elf.c Merge branch 'akpm' (patches from Andrew) 2021-06-29 17:29:11 -07:00
binfmt_flat.c binfmt: remove in-tree usage of MAP_EXECUTABLE 2021-06-29 10:53:50 -07:00
binfmt_misc.c binfmt_misc: fix possible deadlock in bm_register_write 2021-03-13 11:27:30 -08:00
binfmt_script.c
block_dev.c block-5.14-2021-07-30 2021-07-30 11:08:12 -07:00
buffer.c mm/writeback: move __set_page_dirty() to core mm 2021-06-29 10:53:48 -07:00
char_dev.c
compat_binfmt_elf.c get rid of COMPAT_ELF_EXEC_PAGESIZE 2021-01-06 08:42:51 -05:00
coredump.c Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-07-03 11:41:14 -07:00
d_path.c getcwd(2): clean up error handling 2021-05-18 20:15:58 -04:00
dax.c Merge branch 'for-5.14/dax' into libnvdimm-fixes 2021-08-11 12:04:43 -07:00
dcache.c useful constants: struct qstr for ".." 2021-04-15 22:36:45 -04:00
direct-io.c fs: direct-io: fix missing sdio->boundary 2021-04-09 14:54:23 -07:00
drop_caches.c fs: drop_caches: fix skipping over shadow cache inodes 2021-09-03 09:58:10 -07:00
eventfd.c
eventpoll.c fs/epoll: restore waking from ep_done_scan() 2021-05-06 19:24:13 -07:00
exec.c Merge branch 'akpm' (patches from Andrew) 2021-07-02 12:08:10 -07:00
fcntl.c fcntl: Fix unreachable code in do_fcntl() 2021-07-12 11:09:13 -05:00
fhandle.c switch file_open_root() to struct path 2021-04-07 13:56:43 -04:00
file_table.c
file.c Merge branch 'work.file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-05-03 11:05:28 -07:00
filesystems.c
fs_context.c memcg: charge fs_context and legacy_fs_context 2021-09-03 09:58:12 -07:00
fs_parser.c vfs: fs_parser: clean up kernel-doc warnings 2021-04-30 11:20:35 -07:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c writeback: memcg: simplify cgroup_writeback_by_id 2021-09-03 09:58:10 -07:00
fsopen.c
init.c init: handle idmapped mounts 2021-01-24 14:27:19 +01:00
inode.c fs: inode: count invalidated shadow pages in pginodesteal 2021-09-03 09:58:10 -07:00
internal.h cgroup1: fix leaked context root causing sporadic NULL deref in LTP 2021-07-21 06:39:20 -10:00
io_uring.c io_uring: fix xa_alloc_cycle() error return value check 2021-08-20 14:59:58 -06:00
io-wq.c io-wq: fix IO_WORKER_F_FIXED issue in create_io_worker() 2021-08-09 19:59:06 -06:00
io-wq.h io_uring: move creds from io-wq work to io_kiocb 2021-06-18 09:22:02 -06:00
ioctl.c vfs: add fileattr ops 2021-04-12 15:04:23 +02:00
Kconfig mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON 2021-06-30 20:47:26 -07:00
Kconfig.binfmt binfmt: remove support for em86 (alpha only) 2021-07-25 22:33:03 -07:00
kernel_read_file.c switch file_open_root() to struct path 2021-04-07 13:56:43 -04:00
libfs.c fs: remove noop_set_page_dirty() 2021-06-29 10:53:48 -07:00
locks.c Additional fixes and clean-ups for NFSD since tags/nfsd-5.13, 2021-05-05 13:44:19 -07:00
Makefile binfmt: remove support for em86 (alpha only) 2021-07-25 22:33:03 -07:00
mbcache.c
mount.h mount: make {lock,unlock}_mount_hash() static 2021-01-24 14:29:34 +01:00
mpage.c block: rename BIO_MAX_PAGES to BIO_MAX_VECS 2021-03-11 07:47:48 -07:00
namei.c fs, mm: fix race in unlinking swapfile 2021-09-03 09:58:11 -07:00
namespace.c memcg: enable accounting for mnt_cache entries 2021-09-03 09:58:12 -07:00
no-block.c
nsfs.c
open.c Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-07-03 11:41:14 -07:00
pipe.c mm/gup: remove try_get_page(), call try_get_compound_head() directly 2021-09-03 09:58:11 -07:00
pnode.c
pnode.h mount: fix mounting of detached mounts onto targets that reside on shared mounts 2021-03-08 15:18:43 +01:00
posix_acl.c fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
proc_namespace.c fs: introduce MOUNT_ATTR_IDMAP 2021-01-24 14:43:45 +01:00
read_write.c teach sendfile(2) to handle send-to-pipe directly 2021-01-25 23:29:36 -05:00
readdir.c readdir: make sure to verify directory entry for legacy interfaces too 2021-04-17 11:39:49 -07:00
remap_range.c ioctl: handle idmapped mounts 2021-01-24 14:27:19 +01:00
select.c kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data() 2021-03-16 22:13:10 +01:00
seq_file.c seq_file: disallow extremely large seq buffer allocations 2021-07-19 17:18:48 -07:00
signalfd.c signalfd: Remove SIL_PERF_EVENT fields from signalfd_siginfo 2021-05-18 16:20:54 -05:00
splice.c for-5.12/block-2021-02-17 2021-02-21 11:02:48 -08:00
stack.c
stat.c fs: fix reporting supported extra file attributes for statx() 2021-04-17 23:03:50 -04:00
statfs.c s390,alpha: switch to 64-bit ino_t 2021-02-13 17:17:53 +01:00
super.c block: move bd_mutex to struct gendisk 2021-06-01 07:44:32 -06:00
sync.c
timerfd.c
userfaultfd.c userfaultfd: do not untag user pointers 2021-07-23 17:43:28 -07:00
utimes.c utimes: handle idmapped mounts 2021-01-24 14:27:18 +01:00
xattr.c xattr: fix kernel-doc for mnt_userns and vfs xattr helpers 2021-03-23 11:20:26 +01:00