linux/mm
Axel Rasmussen 7677f7fd8b userfaultfd: add minor fault registration mode
Patch series "userfaultfd: add minor fault handling", v9.

Overview
========

This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults.  By "minor" fault, I mean the following
situation:

Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory).  One of the mappings is registered with userfaultfd (in minor
mode), and the other is not.  Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents.  The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault.  As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.

We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE.  The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...).  In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".

Use Case
========

Consider the use case of VM live migration (e.g. under QEMU/KVM):

1. While a VM is still running, we copy the contents of its memory to a
   target machine. The pages are populated on the target by writing to the
   non-UFFD mapping, using the setup described above. The VM is still running
   (and therefore its memory is likely changing), so this may be repeated
   several times, until we decide the target is "up to date enough".

2. We pause the VM on the source, and start executing on the target machine.
   During this gap, the VM's user(s) will *see* a pause, so it is desirable to
   minimize this window.

3. Between the last time any page was copied from the source to the target, and
   when the VM was paused, the contents of that page may have changed - and
   therefore the copy we have on the target machine is out of date. Although we
   can keep track of which pages are out of date, for VMs with large amounts of
   memory, it is "slow" to transfer this information to the target machine. We
   want to resume execution before such a transfer would complete.

4. So, the guest begins executing on the target machine. The first time it
   touches its memory (via the UFFD-registered mapping), userspace wants to
   intercept this fault. Userspace checks whether or not the page is up to date,
   and if not, copies the updated page from the source machine, via the non-UFFD
   mapping. Finally, whether a copy was performed or not, userspace issues a
   UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
   are correct, carry on setting up the mapping".

We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.

Interaction with Existing APIs
==============================

Because this is a feature, a registered VMA could potentially receive both
missing and minor faults.  I spent some time thinking through how the
existing API interacts with the new feature:

UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page.  If UFFDIO_CONTINUE is used on a non-minor fault:

- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.

UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated.  This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).

- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
  in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
  -ENOENT in that case (regardless of the kind of fault).

Future Work
===========

This series only supports hugetlbfs.  I have a second series in flight to
support shmem as well, extending the functionality.  This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.

This patch (of 6):

This feature allows userspace to intercept "minor" faults.  By "minor"
faults, I mean the following situation:

Let there exist two mappings (i.e., VMAs) to the same page(s).  One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not.  Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents.  The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault.  As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.

This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered.  In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.

This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].

However, doing it this was requires we allocate a VM_* flag for the new
registration mode.  On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS.  When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.

[1] https://lore.kernel.org/patchwork/patch/1380226/

[peterx@redhat.com: fix minor fault page leak]
  Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com

Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michal Koutn" <mkoutny@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Adam Ruprecht <ruprecht@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:22 -07:00
..
kasan kasan: record task_work_add() call stack 2021-04-30 11:20:42 -07:00
kfence kfence: make compatible with kmemleak 2021-03-25 09:22:55 -07:00
backing-dev.c mm/backing-dev.c: use might_alloc() 2021-02-26 09:41:01 -08:00
balloon_compaction.c
cleancache.c
cma_debug.c mm/cma: change cma mutex to irq safe spinlock 2021-05-05 11:27:21 -07:00
cma.c mm/cma: change cma mutex to irq safe spinlock 2021-05-05 11:27:21 -07:00
cma.h mm/cma: change cma mutex to irq safe spinlock 2021-05-05 11:27:21 -07:00
compaction.c mm: make alloc_contig_range handle in-use hugetlb pages 2021-05-05 11:27:22 -07:00
debug_page_ref.c
debug_vm_pgtable.c mm: HUGE_VMAP arch support cleanup 2021-04-30 11:20:40 -07:00
debug.c mm/debug: improve memcg debugging 2021-02-24 13:38:27 -08:00
dmapool.c mm/dmapool: switch from strlcpy to strscpy 2021-04-30 11:20:39 -07:00
early_ioremap.c mm/early_ioremap.c: use __func__ instead of function name 2021-02-26 09:41:02 -08:00
fadvise.c mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED 2020-10-13 18:38:29 -07:00
failslab.c
filemap.c dax: account DAX entries as nrpages 2021-05-05 11:27:19 -07:00
frontswap.c mm/frontswap: mark various intentional data races 2020-08-14 19:56:56 -07:00
gup_test.c mm/gup_test.c: mark gup_test_init as __init function 2020-12-15 12:13:38 -08:00
gup_test.h selftests/vm: gup_test: introduce the dump_pages() sub-test 2020-12-15 12:13:38 -08:00
gup.c mm: gup: remove FOLL_SPLIT 2021-04-30 11:20:37 -07:00
highmem.c mm/highmem: fix CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP 2021-03-25 09:22:55 -07:00
hmm.c mm: do page fault accounting in handle_mm_fault 2020-08-12 10:58:02 -07:00
huge_memory.c mm: huge_memory: debugfs for file-backed THP split 2021-05-05 11:27:21 -07:00
hugetlb_cgroup.c hugetlb: make free_huge_page irq safe 2021-05-05 11:27:22 -07:00
hugetlb.c userfaultfd: add minor fault registration mode 2021-05-05 11:27:22 -07:00
hwpoison-inject.c mm,hwpoison-inject: don't pin for hwpoison_filter 2020-10-16 11:11:16 -07:00
init-mm.c mm/gup: prevent gup_fast from racing with COW during fork 2020-12-15 12:13:39 -08:00
internal.h mm,compaction: let isolate_migratepages_{range,block} return error codes 2021-05-05 11:27:22 -07:00
interval_tree.c mm/interval_tree: add comments to improve code readability 2021-04-30 11:20:38 -07:00
io-mapping.c mm: add a io_mapping_map_user helper 2021-04-30 11:20:39 -07:00
ioremap.c mm: move vmap_range from mm/ioremap.c to mm/vmalloc.c 2021-04-30 11:20:40 -07:00
Kconfig mm: generalize HUGETLB_PAGE_SIZE_VARIABLE 2021-05-05 11:27:20 -07:00
Kconfig.debug mm, page_poison: remove CONFIG_PAGE_POISONING_ZERO 2020-12-15 12:13:46 -08:00
khugepaged.c khugepaged: remove meaningless !pte_present() check in khugepaged_scan_pmd() 2021-05-05 11:27:21 -07:00
kmemleak.c mm/kmemleak.c: fix a typo 2021-04-30 11:20:36 -07:00
ksm.c mm: cleanup kstrto*() usage 2020-12-15 12:13:47 -08:00
list_lru.c mm/list_lru.c: remove kvfree_rcu_local() 2021-02-24 13:38:30 -08:00
maccess.c uaccess: add force_uaccess_{begin,end} helpers 2020-08-12 10:57:59 -07:00
madvise.c mm/madvise: replace ptrace attach requirement for process_madvise 2021-03-13 11:27:30 -08:00
Makefile mm: add a io_mapping_map_user helper 2021-04-30 11:20:39 -07:00
mapping_dirty_helpers.c mm/mapping_dirty_helpers: guard hugepage pud's usage 2021-04-16 16:10:37 -07:00
memblock.c memblock: remove return value of memblock_free_all() 2021-02-22 13:01:23 -08:00
memcontrol.c mm: memcontrol: inline __memcg_kmem_{un}charge() into obj_cgroup_{un}charge_pages() 2021-04-30 11:20:38 -07:00
memfd.c
memory_hotplug.c arm64: mte: Map hotplugged memory as Normal Tagged 2021-03-10 10:56:46 +00:00
memory-failure.c mm/memory-failure: unnecessary amount of unmapping 2021-04-30 11:20:44 -07:00
memory.c mm: apply_to_pte_range warn and fail if a large pte is encountered 2021-04-30 11:20:39 -07:00
mempolicy.c mm/mempolicy: fix mpol_misplaced kernel-doc 2021-04-30 11:20:43 -07:00
mempool.c kasan, mm: integrate page_alloc init with HW_TAGS 2021-04-30 11:20:41 -07:00
memremap.c mm/memremap.c: fix improper SPDX comment style 2021-04-30 11:20:37 -07:00
memtest.c
migrate.c mm/page_alloc: combine __alloc_pages and __alloc_pages_nodemask 2021-04-30 11:20:42 -07:00
mincore.c inode: make init and permission helpers idmapped mount aware 2021-01-24 14:27:16 +01:00
mlock.c mm/mlock: stop counting mlocked pages when none vma is found 2021-02-26 09:41:01 -08:00
mm_init.c include/linux/page-flags-layout.h: cleanups 2021-04-30 11:20:42 -07:00
mmap_lock.c mm: mmap_lock: add tracepoints around lock acquisition 2020-12-15 12:13:41 -08:00
mmap.c Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio" 2021-04-30 11:20:39 -07:00
mmu_gather.c mm: eliminate "expecting prototype" kernel-doc warnings 2021-04-16 16:10:36 -07:00
mmu_notifier.c mm/mmu_notifiers: ensure range_end() is paired with range_start() 2021-03-25 09:22:55 -07:00
mmzone.c mm/lru: replace pgdat lru_lock with lruvec lock 2020-12-15 14:48:04 -08:00
mprotect.c mm/mprotect.c: optimize error detection in do_mprotect_pkey() 2021-02-24 13:38:30 -08:00
mremap.c Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio" 2021-04-30 11:20:39 -07:00
msync.c mm/msync: exit early when the flags is an MS_ASYNC and start < vm_start 2021-04-30 11:20:37 -07:00
nommu.c mm/nommu: Fix return type of filemap_map_pages() 2021-01-28 14:10:31 +00:00
oom_kill.c mm: eliminate "expecting prototype" kernel-doc warnings 2021-04-16 16:10:36 -07:00
page_alloc.c mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig 2021-05-05 11:27:22 -07:00
page_counter.c mm: page_counter: mitigate consequences of a page_counter underflow 2021-04-30 11:20:38 -07:00
page_ext.c mm: fix some spelling mistakes in comments 2020-12-15 22:46:19 -08:00
page_idle.c mm: page_idle_get_page() does not need lru_lock 2020-12-15 14:48:03 -08:00
page_io.c swap: fix swapfile read/write offset 2021-03-02 17:25:46 -07:00
page_isolation.c mm/page_isolation: do not isolate the max order page 2020-12-15 12:13:45 -08:00
page_owner.c mm: page_owner: detect page_owner recursion via task_struct 2021-04-30 11:20:36 -07:00
page_poison.c mm: page_poison: print page info when corruption is caught 2021-04-30 11:20:36 -07:00
page_reporting.c mm/page_reporting: use list_entry_is_head() in page_reporting_cycle() 2021-02-24 13:38:30 -08:00
page_reporting.h
page_vma_mapped.c mm/page_vma_mapped.c: add colon to fix kernel-doc markups error for check_pte 2020-12-15 12:13:41 -08:00
page-writeback.c mm: page-writeback: simplify memcg handling in test_clear_page_writeback() 2021-04-30 11:20:37 -07:00
pagewalk.c
percpu-internal.h percpu: make pcpu_nr_empty_pop_pages per chunk type 2021-04-09 13:58:38 +00:00
percpu-km.c mm: memcg/percpu: account percpu memory to memory cgroups 2020-08-12 10:57:55 -07:00
percpu-stats.c percpu: make pcpu_nr_empty_pop_pages per chunk type 2021-04-09 13:58:38 +00:00
percpu-vm.c mm/vmalloc: remove unmap_kernel_range 2021-04-30 11:20:40 -07:00
percpu.c percpu: make pcpu_nr_empty_pop_pages per chunk type 2021-04-09 13:58:38 +00:00
pgalloc-track.h mm: move p?d_alloc_track to separate header file 2020-08-07 11:33:26 -07:00
pgtable-generic.c mm/pgtable-generic.c: optimize the VM_BUG_ON condition in pmdp_huge_clear_flush() 2021-02-24 13:38:30 -08:00
process_vm_access.c mm/process_vm_access.c: include compat.h 2021-01-12 18:12:54 -08:00
ptdump.c mm: ptdump: fix build failure 2021-04-16 16:10:37 -07:00
readahead.c mm: Implement readahead_control pageset expansion 2021-04-23 10:14:29 +01:00
rmap.c mm/rmap: correct obsolete comment of page_get_anon_vma() 2021-02-26 09:41:01 -08:00
rodata_test.c mm/rodata_test.c: fix missing function declaration 2020-08-21 09:52:53 -07:00
shmem.c shmem: allow reporting fanotify events with file handles on tmpfs 2021-04-19 16:03:48 +02:00
shuffle.c mm: eliminate "expecting prototype" kernel-doc warnings 2021-04-16 16:10:36 -07:00
shuffle.h mm/shuffle: remove dynamic reconfiguration 2020-08-07 11:33:29 -07:00
slab_common.c mm/slab_common: provide "slab_merge" option for !IS_ENABLED(CONFIG_SLAB_MERGE_DEFAULT) builds 2021-04-30 11:20:36 -07:00
slab.c kasan, mm: integrate slab init_on_free with HW_TAGS 2021-04-30 11:20:41 -07:00
slab.h kasan, mm: integrate slab init_on_alloc with HW_TAGS 2021-04-30 11:20:41 -07:00
slob.c mm: Don't build mm_dump_obj() on CONFIG_PRINTK=n kernels 2021-03-08 14:18:46 -08:00
slub.c kasan, mm: integrate slab init_on_free with HW_TAGS 2021-04-30 11:20:41 -07:00
sparse-vmemmap.c mm/sparse: only sub-section aligned range would be populated 2020-08-07 11:33:27 -07:00
sparse.c mm/sparse: add the missing sparse_buffer_fini() in error branch 2021-04-30 11:20:39 -07:00
swap_cgroup.c
swap_slots.c mm/swap_slots.c: remove redundant NULL check 2021-02-24 13:38:28 -08:00
swap_state.c mm: stop accounting shadow entries 2021-05-05 11:27:19 -07:00
swap.c mm: remove pagevec_lookup_entries 2021-02-26 09:40:59 -08:00
swapfile.c swap: fix swapfile read/write offset 2021-03-02 17:25:46 -07:00
truncate.c mm: stop accounting shadow entries 2021-05-05 11:27:19 -07:00
usercopy.c mm/usercopy.c: delete duplicated word 2020-08-12 10:57:58 -07:00
userfaultfd.c hugetlb: pass vma into huge_pte_alloc() and huge_pmd_share() 2021-05-05 11:27:20 -07:00
util.c mm: move page_mapping_file to pagemap.h 2021-04-30 11:20:37 -07:00
vmacache.c
vmalloc.c mm/vmalloc: remove an empty line 2021-04-30 11:20:40 -07:00
vmpressure.c
vmscan.c mm: make alloc_contig_range handle in-use hugetlb pages 2021-05-05 11:27:22 -07:00
vmstat.c mm/vmstat.c: erase latency in vmstat_shepherd 2021-02-26 09:41:00 -08:00
workingset.c mm: stop accounting shadow entries 2021-05-05 11:27:19 -07:00
z3fold.c z3fold: prevent reclaim/free race for headless pages 2021-03-25 09:22:55 -07:00
zbud.c mm: set the sleep_mapped to true for zbud and z3fold 2021-02-26 09:41:01 -08:00
zpool.c mm/zswap: add the flag can_sleep_mapped 2021-02-26 09:41:01 -08:00
zsmalloc.c mm/zsmalloc.c: use page_private() to access page->private 2021-02-26 09:41:01 -08:00
zswap.c mm/zswap: add the flag can_sleep_mapped 2021-02-26 09:41:01 -08:00