userfaultfd: update documentation to describe minor fault handling
Reword / reorganize things a little bit into "lists", so new features / modes / ioctls can sort of just be appended. Describe how UFFDIO_REGISTER_MODE_MINOR and UFFDIO_CONTINUE can be used to intercept and resolve minor faults. Make it clear that COPY and ZEROPAGE are used for MISSING faults, whereas CONTINUE is used for MINOR faults. Link: https://lkml.kernel.org/r/20210301222728.176417-6-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Adam Ruprecht <ruprecht@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oliver Upton <oupton@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Shawn Anastasio <shawn@anastas.io> Cc: Steven Price <steven.price@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit is contained in:
parent
f619147104
commit
b8da5cd4e5
@ -63,36 +63,36 @@ the generic ioctl available.
|
||||
|
||||
The ``uffdio_api.features`` bitmask returned by the ``UFFDIO_API`` ioctl
|
||||
defines what memory types are supported by the ``userfaultfd`` and what
|
||||
events, except page fault notifications, may be generated.
|
||||
events, except page fault notifications, may be generated:
|
||||
|
||||
If the kernel supports registering ``userfaultfd`` ranges on hugetlbfs
|
||||
virtual memory areas, ``UFFD_FEATURE_MISSING_HUGETLBFS`` will be set in
|
||||
``uffdio_api.features``. Similarly, ``UFFD_FEATURE_MISSING_SHMEM`` will be
|
||||
set if the kernel supports registering ``userfaultfd`` ranges on shared
|
||||
memory (covering all shmem APIs, i.e. tmpfs, ``IPCSHM``, ``/dev/zero``,
|
||||
``MAP_SHARED``, ``memfd_create``, etc).
|
||||
- The ``UFFD_FEATURE_EVENT_*`` flags indicate that various other events
|
||||
other than page faults are supported. These events are described in more
|
||||
detail below in the `Non-cooperative userfaultfd`_ section.
|
||||
|
||||
The userland application that wants to use ``userfaultfd`` with hugetlbfs
|
||||
or shared memory need to set the corresponding flag in
|
||||
``uffdio_api.features`` to enable those features.
|
||||
- ``UFFD_FEATURE_MISSING_HUGETLBFS`` and ``UFFD_FEATURE_MISSING_SHMEM``
|
||||
indicate that the kernel supports ``UFFDIO_REGISTER_MODE_MISSING``
|
||||
registrations for hugetlbfs and shared memory (covering all shmem APIs,
|
||||
i.e. tmpfs, ``IPCSHM``, ``/dev/zero``, ``MAP_SHARED``, ``memfd_create``,
|
||||
etc) virtual memory areas, respectively.
|
||||
|
||||
If the userland desires to receive notifications for events other than
|
||||
page faults, it has to verify that ``uffdio_api.features`` has appropriate
|
||||
``UFFD_FEATURE_EVENT_*`` bits set. These events are described in more
|
||||
detail below in `Non-cooperative userfaultfd`_ section.
|
||||
- ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports
|
||||
``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory
|
||||
areas.
|
||||
|
||||
Once the ``userfaultfd`` has been enabled the ``UFFDIO_REGISTER`` ioctl should
|
||||
be invoked (if present in the returned ``uffdio_api.ioctls`` bitmask) to
|
||||
register a memory range in the ``userfaultfd`` by setting the
|
||||
The userland application should set the feature flags it intends to use
|
||||
when invoking the ``UFFDIO_API`` ioctl, to request that those features be
|
||||
enabled if supported.
|
||||
|
||||
Once the ``userfaultfd`` API has been enabled the ``UFFDIO_REGISTER``
|
||||
ioctl should be invoked (if present in the returned ``uffdio_api.ioctls``
|
||||
bitmask) to register a memory range in the ``userfaultfd`` by setting the
|
||||
uffdio_register structure accordingly. The ``uffdio_register.mode``
|
||||
bitmask will specify to the kernel which kind of faults to track for
|
||||
the range (``UFFDIO_REGISTER_MODE_MISSING`` would track missing
|
||||
pages). The ``UFFDIO_REGISTER`` ioctl will return the
|
||||
the range. The ``UFFDIO_REGISTER`` ioctl will return the
|
||||
``uffdio_register.ioctls`` bitmask of ioctls that are suitable to resolve
|
||||
userfaults on the range registered. Not all ioctls will necessarily be
|
||||
supported for all memory types depending on the underlying virtual
|
||||
memory backend (anonymous memory vs tmpfs vs real filebacked
|
||||
mappings).
|
||||
supported for all memory types (e.g. anonymous memory vs. shmem vs.
|
||||
hugetlbfs), or all types of intercepted faults.
|
||||
|
||||
Userland can use the ``uffdio_register.ioctls`` to manage the virtual
|
||||
address space in the background (to add or potentially also remove
|
||||
@ -100,21 +100,46 @@ memory from the ``userfaultfd`` registered range). This means a userfault
|
||||
could be triggering just before userland maps in the background the
|
||||
user-faulted page.
|
||||
|
||||
The primary ioctl to resolve userfaults is ``UFFDIO_COPY``. That
|
||||
atomically copies a page into the userfault registered range and wakes
|
||||
up the blocked userfaults
|
||||
(unless ``uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE`` is set).
|
||||
Other ioctl works similarly to ``UFFDIO_COPY``. They're atomic as in
|
||||
guaranteeing that nothing can see an half copied page since it'll
|
||||
keep userfaulting until the copy has finished.
|
||||
Resolving Userfaults
|
||||
--------------------
|
||||
|
||||
There are three basic ways to resolve userfaults:
|
||||
|
||||
- ``UFFDIO_COPY`` atomically copies some existing page contents from
|
||||
userspace.
|
||||
|
||||
- ``UFFDIO_ZEROPAGE`` atomically zeros the new page.
|
||||
|
||||
- ``UFFDIO_CONTINUE`` maps an existing, previously-populated page.
|
||||
|
||||
These operations are atomic in the sense that they guarantee nothing can
|
||||
see a half-populated page, since readers will keep userfaulting until the
|
||||
operation has finished.
|
||||
|
||||
By default, these wake up userfaults blocked on the range in question.
|
||||
They support a ``UFFDIO_*_MODE_DONTWAKE`` ``mode`` flag, which indicates
|
||||
that waking will be done separately at some later time.
|
||||
|
||||
Which ioctl to choose depends on the kind of page fault, and what we'd
|
||||
like to do to resolve it:
|
||||
|
||||
- For ``UFFDIO_REGISTER_MODE_MISSING`` faults, the fault needs to be
|
||||
resolved by either providing a new page (``UFFDIO_COPY``), or mapping
|
||||
the zero page (``UFFDIO_ZEROPAGE``). By default, the kernel would map
|
||||
the zero page for a missing fault. With userfaultfd, userspace can
|
||||
decide what content to provide before the faulting thread continues.
|
||||
|
||||
- For ``UFFDIO_REGISTER_MODE_MINOR`` faults, there is an existing page (in
|
||||
the page cache). Userspace has the option of modifying the page's
|
||||
contents before resolving the fault. Once the contents are correct
|
||||
(modified or not), userspace asks the kernel to map the page and let the
|
||||
faulting thread continue with ``UFFDIO_CONTINUE``.
|
||||
|
||||
Notes:
|
||||
|
||||
- If you requested ``UFFDIO_REGISTER_MODE_MISSING`` when registering then
|
||||
you must provide some kind of page in your thread after reading from
|
||||
the uffd. You must provide either ``UFFDIO_COPY`` or ``UFFDIO_ZEROPAGE``.
|
||||
The normal behavior of the OS automatically providing a zero page on
|
||||
an anonymous mmaping is not in place.
|
||||
- You can tell which kind of fault occurred by examining
|
||||
``pagefault.flags`` within the ``uffd_msg``, checking for the
|
||||
``UFFD_PAGEFAULT_FLAG_*`` flags.
|
||||
|
||||
- None of the page-delivering ioctls default to the range that you
|
||||
registered with. You must fill in all fields for the appropriate
|
||||
@ -122,9 +147,9 @@ Notes:
|
||||
|
||||
- You get the address of the access that triggered the missing page
|
||||
event out of a struct uffd_msg that you read in the thread from the
|
||||
uffd. You can supply as many pages as you want with ``UFFDIO_COPY`` or
|
||||
``UFFDIO_ZEROPAGE``. Keep in mind that unless you used DONTWAKE then
|
||||
the first of any of those IOCTLs wakes up the faulting thread.
|
||||
uffd. You can supply as many pages as you want with these IOCTLs.
|
||||
Keep in mind that unless you used DONTWAKE then the first of any of
|
||||
those IOCTLs wakes up the faulting thread.
|
||||
|
||||
- Be sure to test for all errors including
|
||||
(``pollfd[0].revents & POLLERR``). This can happen, e.g. when ranges
|
||||
|
Loading…
Reference in New Issue
Block a user