forked from Minki/linux
a4ff8e8620
Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2. This has started as a follow up discussion [3][4] resulting in the runtime failure caused by hardening patch [5] which removes MAP_FIXED from the elf loader because MAP_FIXED is inherently dangerous as it might silently clobber an existing underlying mapping (e.g. stack). The reason for the failure is that some architectures enforce an alignment for the given address hint without MAP_FIXED used (e.g. for shared or file backed mappings). One way around this would be excluding those archs which do alignment tricks from the hardening [6]. The patch is really trivial but it has been objected, rightfully so, that this screams for a more generic solution. We basically want a non-destructive MAP_FIXED. The first patch introduced MAP_FIXED_NOREPLACE which enforces the given address but unlike MAP_FIXED it fails with EEXIST if the given range conflicts with an existing one. The flag is introduced as a completely new one rather than a MAP_FIXED extension because of the backward compatibility. We really want a never-clobber semantic even on older kernels which do not recognize the flag. Unfortunately mmap sucks wrt flags evaluation because we do not EINVAL on unknown flags. On those kernels we would simply use the traditional hint based semantic so the caller can still get a different address (which sucks) but at least not silently corrupt an existing mapping. I do not see a good way around that. Except we won't export expose the new semantic to the userspace at all. It seems there are users who would like to have something like that. Jemalloc has been mentioned by Michael Ellerman [7] Florian Weimer has mentioned the following: : glibc ld.so currently maps DSOs without hints. This means that the kernel : will map right next to each other, and the offsets between them a completely : predictable. We would like to change that and supply a random address in a : window of the address space. If there is a conflict, we do not want the : kernel to pick a non-random address. Instead, we would try again with a : random address. John Hubbard has mentioned CUDA example : a) Searches /proc/<pid>/maps for a "suitable" region of available : VA space. "Suitable" generally means it has to have a base address : within a certain limited range (a particular device model might : have odd limitations, for example), it has to be large enough, and : alignment has to be large enough (again, various devices may have : constraints that lead us to do this). : : This is of course subject to races with other threads in the process. : : Let's say it finds a region starting at va. : : b) Next it does: : p = mmap(va, ...) : : *without* setting MAP_FIXED, of course (so va is just a hint), to : attempt to safely reserve that region. If p != va, then in most cases, : this is a failure (almost certainly due to another thread getting a : mapping from that region before we did), and so this layer now has to : call munmap(), before returning a "failure: retry" to upper layers. : : IMPROVEMENT: --> if instead, we could call this: : : p = mmap(va, ... MAP_FIXED_NOREPLACE ...) : : , then we could skip the munmap() call upon failure. This : is a small thing, but it is useful here. (Thanks to Piotr : Jaroszynski and Mark Hairgrove for helping me get that detail : exactly right, btw.) : : c) After that, CUDA suballocates from p, via: : : q = mmap(sub_region_start, ... MAP_FIXED ...) : : Interestingly enough, "freeing" is also done via MAP_FIXED, and : setting PROT_NONE to the subregion. Anyway, I just included (c) for : general interest. Atomic address range probing in the multithreaded programs in general sounds like an interesting thing to me. The second patch simply replaces MAP_FIXED use in elf loader by MAP_FIXED_NOREPLACE. I believe other places which rely on MAP_FIXED should follow. Actually real MAP_FIXED usages should be docummented properly and they should be more of an exception. [1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org [2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org [3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au [4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com [5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org [6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz [7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au This patch (of 2): MAP_FIXED is used quite often to enforce mapping at the particular range. The main problem of this flag is, however, that it is inherently dangerous because it unmaps existing mappings covered by the requested range. This can cause silent memory corruptions. Some of them even with serious security implications. While the current semantic might be really desiderable in many cases there are others which would want to enforce the given range but rather see a failure than a silent memory corruption on a clashing range. Please note that there is no guarantee that a given range is obeyed by the mmap even when it is free - e.g. arch specific code is allowed to apply an alignment. Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this behavior. It has the same semantic as MAP_FIXED wrt. the given address request with a single exception that it fails with EEXIST if the requested address is already covered by an existing mapping. We still do rely on get_unmaped_area to handle all the arch specific MAP_FIXED treatment and check for a conflicting vma after it returns. The flag is introduced as a completely new one rather than a MAP_FIXED extension because of the backward compatibility. We really want a never-clobber semantic even on older kernels which do not recognize the flag. Unfortunately mmap sucks wrt. flags evaluation because we do not EINVAL on unknown flags. On those kernels we would simply use the traditional hint based semantic so the caller can still get a different address (which sucks) but at least not silently corrupt an existing mapping. I do not see a good way around that. [mpe@ellerman.id.au: fix whitespace] [fail on clashing range with EEXIST as per Florian Weimer] [set MAP_FIXED before round_hint_to_min as per Khalid Aziz] Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.org Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Russell King - ARM Linux <linux@armlinux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Joel Stanley <joel@jms.id.au> Cc: Kees Cook <keescook@chromium.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Jason Evans <jasone@google.com> Cc: David Goldblatt <davidtgoldblatt@gmail.com> Cc: Edward Tomasz Napierała <trasz@FreeBSD.org> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
76 lines
3.0 KiB
C
76 lines
3.0 KiB
C
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
|
|
#ifndef __ASM_GENERIC_MMAN_COMMON_H
|
|
#define __ASM_GENERIC_MMAN_COMMON_H
|
|
|
|
/*
|
|
Author: Michael S. Tsirkin <mst@mellanox.co.il>, Mellanox Technologies Ltd.
|
|
Based on: asm-xxx/mman.h
|
|
*/
|
|
|
|
#define PROT_READ 0x1 /* page can be read */
|
|
#define PROT_WRITE 0x2 /* page can be written */
|
|
#define PROT_EXEC 0x4 /* page can be executed */
|
|
#define PROT_SEM 0x8 /* page may be used for atomic ops */
|
|
#define PROT_NONE 0x0 /* page can not be accessed */
|
|
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
|
|
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
|
|
|
|
#define MAP_SHARED 0x01 /* Share changes */
|
|
#define MAP_PRIVATE 0x02 /* Changes are private */
|
|
#define MAP_SHARED_VALIDATE 0x03 /* share + validate extension flags */
|
|
#define MAP_TYPE 0x0f /* Mask for type of mapping */
|
|
#define MAP_FIXED 0x10 /* Interpret addr exactly */
|
|
#define MAP_ANONYMOUS 0x20 /* don't use a file */
|
|
#ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
|
|
# define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be uninitialized */
|
|
#else
|
|
# define MAP_UNINITIALIZED 0x0 /* Don't support this flag */
|
|
#endif
|
|
#define MAP_FIXED_NOREPLACE 0x80000 /* MAP_FIXED which doesn't unmap underlying mapping */
|
|
|
|
/*
|
|
* Flags for mlock
|
|
*/
|
|
#define MLOCK_ONFAULT 0x01 /* Lock pages in range after they are faulted in, do not prefault */
|
|
|
|
#define MS_ASYNC 1 /* sync memory asynchronously */
|
|
#define MS_INVALIDATE 2 /* invalidate the caches */
|
|
#define MS_SYNC 4 /* synchronous memory sync */
|
|
|
|
#define MADV_NORMAL 0 /* no further special treatment */
|
|
#define MADV_RANDOM 1 /* expect random page references */
|
|
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
|
|
#define MADV_WILLNEED 3 /* will need these pages */
|
|
#define MADV_DONTNEED 4 /* don't need these pages */
|
|
|
|
/* common parameters: try to keep these consistent across architectures */
|
|
#define MADV_FREE 8 /* free pages only if memory pressure */
|
|
#define MADV_REMOVE 9 /* remove these pages & resources */
|
|
#define MADV_DONTFORK 10 /* don't inherit across fork */
|
|
#define MADV_DOFORK 11 /* do inherit across fork */
|
|
#define MADV_HWPOISON 100 /* poison a page for testing */
|
|
#define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */
|
|
|
|
#define MADV_MERGEABLE 12 /* KSM may merge identical pages */
|
|
#define MADV_UNMERGEABLE 13 /* KSM may not merge identical pages */
|
|
|
|
#define MADV_HUGEPAGE 14 /* Worth backing with hugepages */
|
|
#define MADV_NOHUGEPAGE 15 /* Not worth backing with hugepages */
|
|
|
|
#define MADV_DONTDUMP 16 /* Explicity exclude from the core dump,
|
|
overrides the coredump filter bits */
|
|
#define MADV_DODUMP 17 /* Clear the MADV_DONTDUMP flag */
|
|
|
|
#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */
|
|
#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */
|
|
|
|
/* compatibility flags */
|
|
#define MAP_FILE 0
|
|
|
|
#define PKEY_DISABLE_ACCESS 0x1
|
|
#define PKEY_DISABLE_WRITE 0x2
|
|
#define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
|
|
PKEY_DISABLE_WRITE)
|
|
|
|
#endif /* __ASM_GENERIC_MMAN_COMMON_H */
|