linux/mm
Andrea Arcangeli adef440691 userfaultfd: UFFDIO_MOVE uABI
Implement the uABI of UFFDIO_MOVE ioctl.
UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages are
available (in userspace) for recycling, as is usually the case in heap
compaction algorithms, then we can avoid the page allocation and memcpy
(done by UFFDIO_COPY). Also, since the pages are recycled in the
userspace, we avoid the need to release (via madvise) the pages back to
the kernel [2].

We see over 40% reduction (on a Google pixel 6 device) in the compacting
thread's completion time by using UFFDIO_MOVE vs.  UFFDIO_COPY.  This was
measured using a benchmark that emulates a heap compaction implementation
using userfaultfd (to allow concurrent accesses by application threads). 
More details of the usecase are explained in [2].  Furthermore,
UFFDIO_MOVE enables moving swapped-out pages without touching them within
the same vma.  Today, it can only be done by mremap, however it forces
splitting the vma.

[1] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redhat.com/
[2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com/

Update for the ioctl_userfaultfd(2)  manpage:

   UFFDIO_MOVE
       (Since Linux xxx)  Move a continuous memory chunk into the
       userfault registered range and optionally wake up the blocked
       thread. The source and destination addresses and the number of
       bytes to move are specified by the src, dst, and len fields of
       the uffdio_move structure pointed to by argp:

           struct uffdio_move {
               __u64 dst;    /* Destination of move */
               __u64 src;    /* Source of move */
               __u64 len;    /* Number of bytes to move */
               __u64 mode;   /* Flags controlling behavior of move */
               __s64 move;   /* Number of bytes moved, or negated error */
           };

       The following value may be bitwise ORed in mode to change the
       behavior of the UFFDIO_MOVE operation:

       UFFDIO_MOVE_MODE_DONTWAKE
              Do not wake up the thread that waits for page-fault
              resolution

       UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES
              Allow holes in the source virtual range that is being moved.
              When not specified, the holes will result in ENOENT error.
              When specified, the holes will be accounted as successfully
              moved memory. This is mostly useful to move hugepage aligned
              virtual regions without knowing if there are transparent
              hugepages in the regions or not, but preventing the risk of
              having to split the hugepage during the operation.

       The move field is used by the kernel to return the number of
       bytes that was actually moved, or an error (a negated errno-
       style value).  If the value returned in move doesn't match the
       value that was specified in len, the operation fails with the
       error EAGAIN.  The move field is output-only; it is not read by
       the UFFDIO_MOVE operation.

       The operation may fail for various reasons. Usually, remapping of
       pages that are not exclusive to the given process fail; once KSM
       might deduplicate pages or fork() COW-shares pages during fork()
       with child processes, they are no longer exclusive. Further, the
       kernel might only perform lightweight checks for detecting whether
       the pages are exclusive, and return -EBUSY in case that check fails.
       To make the operation more likely to succeed, KSM should be
       disabled, fork() should be avoided or MADV_DONTFORK should be
       configured for the source VMA before fork().

       This ioctl(2) operation returns 0 on success.  In this case, the
       entire area was moved.  On error, -1 is returned and errno is
       set to indicate the error.  Possible errors include:

       EAGAIN The number of bytes moved (i.e., the value returned in
              the move field) does not equal the value that was
              specified in the len field.

       EINVAL Either dst or len was not a multiple of the system page
              size, or the range specified by src and len or dst and len
              was invalid.

       EINVAL An invalid bit was specified in the mode field.

       ENOENT
              The source virtual memory range has unmapped holes and
              UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set.

       EEXIST
              The destination virtual memory range is fully or partially
              mapped.

       EBUSY
              The pages in the source virtual memory range are either
              pinned or not exclusive to the process. The kernel might
              only perform lightweight checks for detecting whether the
              pages are exclusive. To make the operation more likely to
              succeed, KSM should be disabled, fork() should be avoided
              or MADV_DONTFORK should be configured for the source virtual
              memory area before fork().

       ENOMEM Allocating memory needed for the operation failed.

       ESRCH
              The target process has exited at the time of a UFFDIO_MOVE
              operation.

Link: https://lkml.kernel.org/r/20231206103702.3873743-3-surenb@google.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29 11:58:24 -08:00
..
damon mm/damon/core-test: test max_nr_accesses overflow caused divide-by-zero 2023-12-20 14:48:13 -08:00
kasan sync mm-stable with mm-hotfixes-stable to pick up depended-upon changes 2023-12-20 14:47:18 -08:00
kfence LoongArch changes for v6.6 2023-09-08 12:16:52 -07:00
kmsan kmsan: use stack_depot_save instead of __stack_depot_save 2023-12-10 16:51:46 -08:00
backing-dev.c writeback: remove redundant checks for root memcg 2023-08-21 13:37:48 -07:00
balloon_compaction.c
bootmem_info.c bootmem: use kmemleak_free_part_phys in put_page_bootmem 2023-10-25 16:47:13 -07:00
cma_debug.c
cma_sysfs.c mm: cma: make kobj_type structure constant 2023-03-28 16:20:06 -07:00
cma.c mm: cma: remove unnecessary initialization of ret 2023-12-12 10:57:08 -08:00
cma.h
compaction.c mm: compaction: avoid fast_isolate_freepages blindly choose improper pageblock 2023-12-12 10:57:08 -08:00
debug_page_alloc.c mm: page_alloc: split out DEBUG_PAGEALLOC 2023-06-09 16:25:23 -07:00
debug_page_ref.c
debug_vm_pgtable.c mm: fix multiple typos in multiple files 2023-10-25 16:47:14 -07:00
debug.c mm: update validate_mm() to use vma iterator 2023-06-09 16:25:31 -07:00
dmapool_test.c dmapool: add alloc/free performance test 2023-04-05 19:42:38 -07:00
dmapool.c dmapool: create/destroy cleanup 2023-06-09 16:25:17 -07:00
early_ioremap.c mm/early_ioremap.c: improve the execution efficiency of early_ioremap_setup() 2023-06-09 16:25:56 -07:00
fadvise.c mm: remove unnecessary pagevec includes 2023-06-23 16:59:31 -07:00
fail_page_alloc.c mm: page_alloc: split out FAIL_PAGE_ALLOC 2023-06-09 16:25:23 -07:00
failslab.c mm: fix unexpected changes to {failslab|fail_page_alloc}.attr 2022-11-22 18:50:44 -08:00
filemap.c sync mm-stable with mm-hotfixes-stable to pick up depended-upon changes 2023-12-20 14:47:18 -08:00
folio-compat.c mm: return void from folio_start_writeback() and related functions 2023-12-10 16:51:37 -08:00
gup_test.c Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes. 2023-06-23 16:58:19 -07:00
gup_test.h mm/gup_test: start/stop/read functionality for PIN LONGTERM test 2022-11-08 17:37:15 -08:00
gup.c mm/gup: fix follow_devmap_p[mu]d() on page==NULL handling 2023-12-10 16:51:52 -08:00
highmem.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
hmm.c mm: enable page walking API to lock vmas during the walk 2023-08-21 13:07:20 -07:00
huge_memory.c userfaultfd: UFFDIO_MOVE uABI 2023-12-29 11:58:24 -08:00
hugetlb_cgroup.c mm, hugetlb: remove HUGETLB_CGROUP_MIN_ORDER 2023-10-18 14:34:17 -07:00
hugetlb_vmemmap.c mm: hugetlb_vmemmap: move mmap lock to vmemmap_remap_range() 2023-12-12 10:57:08 -08:00
hugetlb_vmemmap.h mm: hugetlb_vmemmap: fix reference to nonexistent file 2023-10-25 16:47:14 -07:00
hugetlb.c hugetlb: fix null-ptr-deref in hugetlb_vma_lock_write 2023-12-06 16:12:43 -08:00
hwpoison-inject.c
init-mm.c mm: move dummy_vm_ops out of a header 2023-08-21 13:37:46 -07:00
internal.h mm: use vma_pages() for vma objects 2023-12-12 10:57:08 -08:00
interval_tree.c
io-mapping.c
ioremap.c mm: ioremap: remove unneeded ioremap_allowed and iounmap_allowed 2023-08-18 10:12:36 -07:00
Kconfig mm/thp: add CONFIG_TRANSPARENT_HUGEPAGE_NEVER option 2023-12-12 10:57:07 -08:00
Kconfig.debug mm: page_table_check: Make it dependent on EXCLUSIVE_SYSTEM_RAM 2023-05-29 16:14:28 +01:00
khugepaged.c userfaultfd: UFFDIO_MOVE uABI 2023-12-29 11:58:24 -08:00
kmemleak.c kmemleak: avoid RCU stalls when freeing metadata for per-CPU pointers 2023-12-12 10:57:07 -08:00
ksm.c mm: ksm: remove unnecessary try_to_freeze() 2023-12-20 14:48:13 -08:00
list_lru.c mm/list_lru.c: remove unused list_lru_from_kmem() 2023-12-20 14:48:11 -08:00
maccess.c mm: Fix copy_from_user_nofault(). 2023-04-12 17:36:23 -07:00
madvise.c mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range() 2023-12-06 16:12:50 -08:00
Makefile mm: vmscan: move shrinker-related code into a separate file 2023-10-04 10:32:23 -07:00
mapping_dirty_helpers.c mm: fix clean_record_shared_mapping_range kernel-doc 2023-08-24 16:20:30 -07:00
memblock.c NUMA: optimize detection of memory with no node id assigned by firmware 2023-12-10 16:51:34 -08:00
memcontrol.c mm: memcg: restore subtree stats flushing 2023-12-20 14:48:11 -08:00
memfd.c memfd: drop warning for missing exec-related flags 2023-10-04 10:32:22 -07:00
memory_hotplug.c mm/memory_hotplug: split memmap_on_memory requests across memblocks 2023-12-10 16:51:34 -08:00
memory-failure.c sync mm-stable with mm-hotfixes-stable to pick up depended-upon changes 2023-12-20 14:47:18 -08:00
memory-tiers.c dax, kmem: calculate abstract distance with general interface 2023-10-16 15:44:39 -07:00
memory.c mm/memory: replace kmap() with kmap_local_page() 2023-12-20 14:48:14 -08:00
mempolicy.c Many singleton patches against the MM code. The patch series which are 2023-11-02 19:38:47 -10:00
mempool.c mm/mempool: replace kmap_atomic() with kmap_local_page() 2023-12-10 16:51:49 -08:00
memremap.c mm: use vmem_altmap code without CONFIG_ZONE_DEVICE 2023-12-10 16:51:48 -08:00
memtest.c mm: memtest: convert to memtest_report_meminfo() 2023-08-21 13:37:47 -07:00
migrate_device.c Add x86 shadow stack support 2023-08-31 12:20:12 -07:00
migrate.c mm: migrate high-order folios in swap cache correctly 2023-12-20 13:46:19 -08:00
mincore.c mm: enable page walking API to lock vmas during the walk 2023-08-21 13:07:20 -07:00
mlock.c mm: mlock: avoid folio_within_range() on KSM pages 2023-10-25 16:47:14 -07:00
mm_init.c mm/mm_init.c: append newline to the unavailable ranges log-message 2023-12-10 16:51:51 -08:00
mm_slot.h
mmap_lock.c
mmap.c mmap: remove the IA64-specific vma expansion implementation 2023-12-10 16:51:39 -08:00
mmu_gather.c mm: fix kernel-doc warning from tlb_flush_rmaps() 2023-08-24 16:20:30 -07:00
mmu_notifier.c mmu_notifiers: rename invalidate_range notifier 2023-08-18 10:12:41 -07:00
mmzone.c zswap: shrink zswap pool based on memory pressure 2023-12-12 10:57:02 -08:00
mprotect.c mm: mprotect: use a folio in change_pte_range() 2023-10-25 16:47:12 -07:00
mremap.c mm: abstract VMA merge and extend into vma_merge_extend() helper 2023-10-18 14:34:18 -07:00
msync.c
nommu.c Many singleton patches against the MM code. The patch series which are 2023-11-02 19:38:47 -10:00
oom_kill.c mm, oom:dump_tasks add rss detailed information printing 2023-12-10 16:51:53 -08:00
page_alloc.c mm: page_alloc: simplify __free_pages_ok() 2023-12-20 14:48:14 -08:00
page_counter.c
page_ext.c mm/page_ext: move functions around for minor cleanups to page_ext 2023-08-18 10:12:31 -07:00
page_idle.c mm: page_idle: convert page idle to use a folio 2023-01-18 17:12:52 -08:00
page_io.c mm: memcg: add THP swap out info for anonymous reclaim 2023-10-04 10:32:27 -07:00
page_isolation.c mm/hugetlb: get rid of page_hstate() 2023-08-18 10:12:39 -07:00
page_owner.c mm/page_owner: record and dump free_pid and free_tgid 2023-12-10 16:51:40 -08:00
page_poison.c mm/page_poison: replace kmap_atomic() with kmap_local_page() 2023-12-10 16:51:50 -08:00
page_reporting.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
page_reporting.h
page_table_check.c mm: convert page_table_check_pte_set() to page_table_check_ptes_set() 2023-08-24 16:20:18 -07:00
page_vma_mapped.c mm: thp: introduce multi-size THP sysfs interface 2023-12-20 14:48:12 -08:00
page-writeback.c mm: return void from folio_start_writeback() and related functions 2023-12-10 16:51:37 -08:00
pagewalk.c mm: pagewalk: assert write mmap lock only for walking the user page tables 2023-12-10 16:51:53 -08:00
percpu-internal.h percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing 2023-06-19 16:19:29 -07:00
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c Many singleton patches against the MM code. The patch series which are 2023-11-02 19:38:47 -10:00
pgalloc-track.h
pgtable-generic.c mm/pgtable: notes on pte_offset_map[_lock]() 2023-08-18 10:12:25 -07:00
process_vm_access.c mm: fix process_vm_rw page counts 2023-12-10 16:51:39 -08:00
ptdump.c mm: ptdump should use ptep_get_lockless() 2023-06-19 16:19:24 -07:00
readahead.c mm/readahead: do not allow order-1 folio 2023-12-12 10:57:06 -08:00
rmap.c userfaultfd: UFFDIO_MOVE uABI 2023-12-29 11:58:24 -08:00
rodata_test.c
secretmem.c mm/secretmem: use a folio in secretmem_fault() 2023-08-21 13:38:02 -07:00
shmem_quota.c shmem: Add default quota limit mount options 2023-08-09 09:15:40 +02:00
shmem.c sync mm-stable with mm-hotfixes-stable to pick up depended-upon changes 2023-12-20 14:47:18 -08:00
show_mem.c mm: refactor si_mem_available() 2023-10-04 10:32:19 -07:00
shrinker_debug.c mm: shrinker: convert shrinker_rwsem to mutex 2023-10-04 10:32:26 -07:00
shrinker.c mm: shrinker: convert shrinker_rwsem to mutex 2023-10-04 10:32:26 -07:00
shuffle.c
shuffle.h mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
slab_common.c RCU pull request for v6.7 2023-10-30 18:01:41 -10:00
slab.c Randomized slab caches for kmalloc() 2023-07-18 10:07:47 +02:00
slab.h mm: kmem: scoped objcg protection 2023-10-25 16:47:11 -07:00
slub.c slub, kasan: improve interaction of KASAN and slub_debug poisoning 2023-12-10 16:51:48 -08:00
sparse-vmemmap.c mm/vmemmap: allow architectures to override how vmemmap optimization works 2023-08-18 10:12:53 -07:00
sparse.c mm/sparse: remove redundant judgments from macro for_each_present_section_nr 2023-08-18 10:12:14 -07:00
swap_cgroup.c
swap_slots.c
swap_state.c zswap: shrink zswap pool based on memory pressure 2023-12-12 10:57:02 -08:00
swap.c mm: remove references to pagevec 2023-06-23 16:59:30 -07:00
swap.h zswap: make shrinking memcg-aware 2023-12-12 10:57:01 -08:00
swapfile.c mm/swapfile: replace kmap_atomic() with kmap_local_page() 2023-12-10 16:51:53 -08:00
truncate.c fs: convert error_remove_page to error_remove_folio 2023-12-10 16:51:42 -08:00
usercopy.c mm: Fix copy_from_user_nofault(). 2023-04-12 17:36:23 -07:00
userfaultfd.c userfaultfd: UFFDIO_MOVE uABI 2023-12-29 11:58:24 -08:00
util.c mm/util: use kmap_local_page() in memcmp_pages() 2023-12-10 16:51:49 -08:00
vmalloc.c mm/vmalloc: fix the unchecked dereference warning in vread_iter() 2023-11-01 12:38:35 -07:00
vmpressure.c net-memcg: Fix scope of sockmem pressure indicators 2023-08-16 12:21:32 +01:00
vmscan.c mm: memcg: restore subtree stats flushing 2023-12-20 14:48:11 -08:00
vmstat.c mm: memcg: add per-memcg zswap writeback stat 2023-12-12 10:57:02 -08:00
workingset.c mm: memcg: restore subtree stats flushing 2023-12-20 14:48:11 -08:00
z3fold.c mm/z3fold: remove obsolete comment for struct z3fold_pool 2023-08-21 13:37:51 -07:00
zbud.c mm: zswap: remove shrink from zpool interface 2023-06-19 16:19:27 -07:00
zpool.c mm: zswap: remove shrink from zpool interface 2023-06-19 16:19:27 -07:00
zsmalloc.c zsmalloc: use copy_page for full page copy 2023-10-18 14:34:16 -07:00
zswap.c mm: memcg: restore subtree stats flushing 2023-12-20 14:48:11 -08:00