linux/arch/x86
Kirill A. Shutemov dc6c9a35b6 mm: account pmd page tables to the process
Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
kernel doesn't account PMD tables to the process, only PTE.

The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.

	#include <errno.h>
	#include <stdio.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <sys/mman.h>
	#include <sys/prctl.h>

	#define PUD_SIZE (1UL << 30)
	#define PMD_SIZE (1UL << 21)

	#define NR_PUD 130000

	int main(void)
	{
		char *addr = NULL;
		unsigned long i;

		prctl(PR_SET_THP_DISABLE);
		for (i = 0; i < NR_PUD ; i++) {
			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
			if (addr == MAP_FAILED) {
				perror("mmap");
				break;
			}
			*addr = 'x';
			munmap(addr, PMD_SIZE);
			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
			if (addr == MAP_FAILED)
				perror("re-mmap"), exit(1);
		}
		printf("PID %d consumed %lu KiB in PMD page tables\n",
				getpid(), i * 4096 >> 10);
		return pause();
	}

The patch addresses the issue by account PMD tables to the process the
same way we account PTE.

The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:

 - HugeTLB can share PMD page tables. The patch handles by accounting
   the table to all processes who share it.

 - x86 PAE pre-allocates few PMD tables on fork.

 - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
   check on exit(2).

Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
counter value is used to calculate baseline for badness score by
oom-killer.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: David Rientjes <rientjes@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 17:06:04 -08:00
..
boot Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-02-09 17:50:09 -08:00
configs x86/kconfig/defconfig: Enable CONFIG_FHANDLE=y 2014-12-08 12:04:17 +01:00
crypto crypto: add missing crypto module aliases 2015-01-13 22:29:11 +11:00
ia32 x86: ia32entry.S: fix wrong symbolic constant usage: R11->ARGOFFSET 2015-01-13 14:10:31 -08:00
include mm: make FIRST_USER_ADDRESS unsigned long on all archs 2015-02-11 17:06:03 -08:00
kernel Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching 2015-02-10 18:35:40 -08:00
kvm Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-02-09 14:28:42 -08:00
lguest x86: Avoid building unused IRQ entry stubs 2014-12-16 14:08:14 +01:00
lib x86: Fix off-by-one in instruction decoder 2015-01-09 11:12:26 +01:00
math-emu
mm mm: account pmd page tables to the process 2015-02-11 17:06:04 -08:00
net Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-12-10 15:48:20 -05:00
oprofile percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t 2014-08-28 08:58:57 -04:00
pci ACPI and power management updates for v3.20-rc1 2015-02-10 15:09:41 -08:00
platform Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-12-19 14:02:02 -08:00
power nosave: consolidate __nosave_{begin,end} in <asm/sections.h> 2014-10-09 22:26:04 -04:00
purgatory Merge branches 'x86-build-for-linus', 'x86-cleanups-for-linus' and 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-12-10 12:35:46 -08:00
realmode
syscalls x86: hook up execveat system call 2014-12-13 12:42:51 -08:00
tools x86, build: replace Perl script with Shell script 2015-01-26 13:37:18 -08:00
um x86, um: actually mark system call tables readonly 2015-01-04 14:21:25 +01:00
vdso x86, vdso: teach 'make clean' remove vdso64 binaries 2015-01-28 18:44:18 -08:00
video
xen xen: mark grant mapped pages as foreign 2015-01-28 14:03:12 +00:00
.gitignore x86/build: Add arch/x86/purgatory/ make generated files to gitignore 2014-10-09 09:29:46 +02:00
Kbuild kexec: create a new config option CONFIG_KEXEC_FILE for new syscall 2014-08-29 16:28:16 -07:00
Kconfig Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching 2015-02-10 18:35:40 -08:00
Kconfig.cpu
Kconfig.debug
Makefile Merge branch 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-10-13 18:17:33 +02:00
Makefile_32.cpu
Makefile.um