linux/Documentation/sysctl
Kirill A. Shutemov dc6c9a35b6 mm: account pmd page tables to the process
Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
kernel doesn't account PMD tables to the process, only PTE.

The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.

	#include <errno.h>
	#include <stdio.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <sys/mman.h>
	#include <sys/prctl.h>

	#define PUD_SIZE (1UL << 30)
	#define PMD_SIZE (1UL << 21)

	#define NR_PUD 130000

	int main(void)
	{
		char *addr = NULL;
		unsigned long i;

		prctl(PR_SET_THP_DISABLE);
		for (i = 0; i < NR_PUD ; i++) {
			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
			if (addr == MAP_FAILED) {
				perror("mmap");
				break;
			}
			*addr = 'x';
			munmap(addr, PMD_SIZE);
			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
			if (addr == MAP_FAILED)
				perror("re-mmap"), exit(1);
		}
		printf("PID %d consumed %lu KiB in PMD page tables\n",
				getpid(), i * 4096 >> 10);
		return pause();
	}

The patch addresses the issue by account PMD tables to the process the
same way we account PTE.

The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:

 - HugeTLB can share PMD page tables. The patch handles by accounting
   the table to all processes who share it.

 - x86 PAE pre-allocates few PMD tables on fork.

 - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
   check on exit(2).

Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
counter value is used to calculate baseline for badness score by
oom-killer.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: David Rientjes <rientjes@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 17:06:04 -08:00
..
00-INDEX sysctl: remove obsolete comments 2011-01-13 08:03:18 -08:00
abi.txt
fs.txt Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
kernel.txt livepatch: kernel: add TAINT_LIVEPATCH 2014-12-22 15:40:48 +01:00
net.txt net-timestamp: no-payload only sysctl 2015-02-02 18:46:51 -08:00
README
sunrpc.txt
vm.txt mm: account pmd page tables to the process 2015-02-11 17:06:04 -08:00

Documentation for /proc/sys/		kernel version 2.2.10
	(c) 1998, 1999,  Rik van Riel <riel@nl.linux.org>

'Why', I hear you ask, 'would anyone even _want_ documentation
for them sysctl files? If anybody really needs it, it's all in
the source...'

Well, this documentation is written because some people either
don't know they need to tweak something, or because they don't
have the time or knowledge to read the source code.

Furthermore, the programmers who built sysctl have built it to
be actually used, not just for the fun of programming it :-)

==============================================================

Legal blurb:

As usual, there are two main things to consider:
1. you get what you pay for
2. it's free

The consequences are that I won't guarantee the correctness of
this document, and if you come to me complaining about how you
screwed up your system because of wrong documentation, I won't
feel sorry for you. I might even laugh at you...

But of course, if you _do_ manage to screw up your system using
only the sysctl options used in this file, I'd like to hear of
it. Not only to have a great laugh, but also to make sure that
you're the last RTFMing person to screw up.

In short, e-mail your suggestions, corrections and / or horror
stories to: <riel@nl.linux.org>

Rik van Riel.

==============================================================

Introduction:

Sysctl is a means of configuring certain aspects of the kernel
at run-time, and the /proc/sys/ directory is there so that you
don't even need special tools to do it!
In fact, there are only four things needed to use these config
facilities:
- a running Linux system
- root access
- common sense (this is especially hard to come by these days)
- knowledge of what all those values mean

As a quick 'ls /proc/sys' will show, the directory consists of
several (arch-dependent?) subdirs. Each subdir is mainly about
one part of the kernel, so you can do configuration on a piece
by piece basis, or just some 'thematic frobbing'.

The subdirs are about:
abi/		execution domains & personalities
debug/		<empty>
dev/		device specific information (eg dev/cdrom/info)
fs/		specific filesystems
		filehandle, inode, dentry and quota tuning
		binfmt_misc <Documentation/binfmt_misc.txt>
kernel/		global kernel info / tuning
		miscellaneous stuff
net/		networking stuff, for documentation look in:
		<Documentation/networking/>
proc/		<empty>
sunrpc/		SUN Remote Procedure Call (NFS)
vm/		memory management tuning
		buffer and cache management

These are the subdirs I have on my system. There might be more
or other subdirs in another setup. If you see another dir, I'd
really like to hear about it :-)