mirror of
https://github.com/torvalds/linux.git
synced 2024-11-28 23:21:31 +00:00
e6a9a2cbc1
The PAGEMAP_SCAN ioctl returns information regarding page table entries. It is more efficient compared to reading pagemap files. CRIU can start to utilize this ioctl, but it needs info about soft-dirty bits to track memory changes. We are aware of a new method for tracking memory changes implemented in the PAGEMAP_SCAN ioctl. For CRIU, the primary advantage of this method is its usability by unprivileged users. However, it is not feasible to transparently replace the soft-dirty tracker with the new one. The main problem here is userfault descriptors that have to be preserved between pre-dump iterations. It means criu continues supporting the soft-dirty method to avoid breakage for current users. The new method will be implemented as a separate feature. [avagin@google.com: update tools/include/uapi/linux/fs.h] Link: https://lkml.kernel.org/r/20231107164139.576046-1-avagin@google.com Link: https://lkml.kernel.org/r/20231106220959.296568-1-avagin@google.com Signed-off-by: Andrei Vagin <avagin@google.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
320 lines
12 KiB
ReStructuredText
320 lines
12 KiB
ReStructuredText
=============================
|
|
Examining Process Page Tables
|
|
=============================
|
|
|
|
pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
|
|
userspace programs to examine the page tables and related information by
|
|
reading files in ``/proc``.
|
|
|
|
There are four components to pagemap:
|
|
|
|
* ``/proc/pid/pagemap``. This file lets a userspace process find out which
|
|
physical frame each virtual page is mapped to. It contains one 64-bit
|
|
value for each virtual page, containing the following data (from
|
|
``fs/proc/task_mmu.c``, above pagemap_read):
|
|
|
|
* Bits 0-54 page frame number (PFN) if present
|
|
* Bits 0-4 swap type if swapped
|
|
* Bits 5-54 swap offset if swapped
|
|
* Bit 55 pte is soft-dirty (see
|
|
Documentation/admin-guide/mm/soft-dirty.rst)
|
|
* Bit 56 page exclusively mapped (since 4.2)
|
|
* Bit 57 pte is uffd-wp write-protected (since 5.13) (see
|
|
Documentation/admin-guide/mm/userfaultfd.rst)
|
|
* Bits 58-60 zero
|
|
* Bit 61 page is file-page or shared-anon (since 3.5)
|
|
* Bit 62 page swapped
|
|
* Bit 63 page present
|
|
|
|
Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
|
|
In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from
|
|
4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
|
|
Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
|
|
|
|
If the page is not present but in swap, then the PFN contains an
|
|
encoding of the swap file number and the page's offset into the
|
|
swap. Unmapped pages return a null PFN. This allows determining
|
|
precisely which pages are mapped (or in swap) and comparing mapped
|
|
pages between processes.
|
|
|
|
Efficient users of this interface will use ``/proc/pid/maps`` to
|
|
determine which areas of memory are actually mapped and llseek to
|
|
skip over unmapped regions.
|
|
|
|
* ``/proc/kpagecount``. This file contains a 64-bit count of the number of
|
|
times each page is mapped, indexed by PFN.
|
|
|
|
The page-types tool in the tools/mm directory can be used to query the
|
|
number of times a page is mapped.
|
|
|
|
* ``/proc/kpageflags``. This file contains a 64-bit set of flags for each
|
|
page, indexed by PFN.
|
|
|
|
The flags are (from ``fs/proc/page.c``, above kpageflags_read):
|
|
|
|
0. LOCKED
|
|
1. ERROR
|
|
2. REFERENCED
|
|
3. UPTODATE
|
|
4. DIRTY
|
|
5. LRU
|
|
6. ACTIVE
|
|
7. SLAB
|
|
8. WRITEBACK
|
|
9. RECLAIM
|
|
10. BUDDY
|
|
11. MMAP
|
|
12. ANON
|
|
13. SWAPCACHE
|
|
14. SWAPBACKED
|
|
15. COMPOUND_HEAD
|
|
16. COMPOUND_TAIL
|
|
17. HUGE
|
|
18. UNEVICTABLE
|
|
19. HWPOISON
|
|
20. NOPAGE
|
|
21. KSM
|
|
22. THP
|
|
23. OFFLINE
|
|
24. ZERO_PAGE
|
|
25. IDLE
|
|
26. PGTABLE
|
|
|
|
* ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the
|
|
memory cgroup each page is charged to, indexed by PFN. Only available when
|
|
CONFIG_MEMCG is set.
|
|
|
|
Short descriptions to the page flags
|
|
====================================
|
|
|
|
0 - LOCKED
|
|
The page is being locked for exclusive access, e.g. by undergoing read/write
|
|
IO.
|
|
7 - SLAB
|
|
The page is managed by the SLAB/SLUB kernel memory allocator.
|
|
When compound page is used, either will only set this flag on the head
|
|
page.
|
|
10 - BUDDY
|
|
A free memory block managed by the buddy system allocator.
|
|
The buddy system organizes free memory in blocks of various orders.
|
|
An order N block has 2^N physically contiguous pages, with the BUDDY flag
|
|
set for and _only_ for the first page.
|
|
15 - COMPOUND_HEAD
|
|
A compound page with order N consists of 2^N physically contiguous pages.
|
|
A compound page with order 2 takes the form of "HTTT", where H donates its
|
|
head page and T donates its tail page(s). The major consumers of compound
|
|
pages are hugeTLB pages (Documentation/admin-guide/mm/hugetlbpage.rst),
|
|
the SLUB etc. memory allocators and various device drivers.
|
|
However in this interface, only huge/giga pages are made visible
|
|
to end users.
|
|
16 - COMPOUND_TAIL
|
|
A compound page tail (see description above).
|
|
17 - HUGE
|
|
This is an integral part of a HugeTLB page.
|
|
19 - HWPOISON
|
|
Hardware detected memory corruption on this page: don't touch the data!
|
|
20 - NOPAGE
|
|
No page frame exists at the requested address.
|
|
21 - KSM
|
|
Identical memory pages dynamically shared between one or more processes.
|
|
22 - THP
|
|
Contiguous pages which construct transparent hugepages.
|
|
23 - OFFLINE
|
|
The page is logically offline.
|
|
24 - ZERO_PAGE
|
|
Zero page for pfn_zero or huge_zero page.
|
|
25 - IDLE
|
|
The page has not been accessed since it was marked idle (see
|
|
Documentation/admin-guide/mm/idle_page_tracking.rst).
|
|
Note that this flag may be stale in case the page was accessed via
|
|
a PTE. To make sure the flag is up-to-date one has to read
|
|
``/sys/kernel/mm/page_idle/bitmap`` first.
|
|
26 - PGTABLE
|
|
The page is in use as a page table.
|
|
|
|
IO related page flags
|
|
---------------------
|
|
|
|
1 - ERROR
|
|
IO error occurred.
|
|
3 - UPTODATE
|
|
The page has up-to-date data.
|
|
ie. for file backed page: (in-memory data revision >= on-disk one)
|
|
4 - DIRTY
|
|
The page has been written to, hence contains new data.
|
|
i.e. for file backed page: (in-memory data revision > on-disk one)
|
|
8 - WRITEBACK
|
|
The page is being synced to disk.
|
|
|
|
LRU related page flags
|
|
----------------------
|
|
|
|
5 - LRU
|
|
The page is in one of the LRU lists.
|
|
6 - ACTIVE
|
|
The page is in the active LRU list.
|
|
18 - UNEVICTABLE
|
|
The page is in the unevictable (non-)LRU list It is somehow pinned and
|
|
not a candidate for LRU page reclaims, e.g. ramfs pages,
|
|
shmctl(SHM_LOCK) and mlock() memory segments.
|
|
2 - REFERENCED
|
|
The page has been referenced since last LRU list enqueue/requeue.
|
|
9 - RECLAIM
|
|
The page will be reclaimed soon after its pageout IO completed.
|
|
11 - MMAP
|
|
A memory mapped page.
|
|
12 - ANON
|
|
A memory mapped page that is not part of a file.
|
|
13 - SWAPCACHE
|
|
The page is mapped to swap space, i.e. has an associated swap entry.
|
|
14 - SWAPBACKED
|
|
The page is backed by swap/RAM.
|
|
|
|
The page-types tool in the tools/mm directory can be used to query the
|
|
above flags.
|
|
|
|
Using pagemap to do something useful
|
|
====================================
|
|
|
|
The general procedure for using pagemap to find out about a process' memory
|
|
usage goes like this:
|
|
|
|
1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
|
|
mapped to what.
|
|
2. Select the maps you are interested in -- all of them, or a particular
|
|
library, or the stack or the heap, etc.
|
|
3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
|
|
4. Read a u64 for each page from pagemap.
|
|
5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you
|
|
just read, seek to that entry in the file, and read the data you want.
|
|
|
|
For example, to find the "unique set size" (USS), which is the amount of
|
|
memory that a process is using that is not shared with any other process,
|
|
you can go through every map in the process, find the PFNs, look those up
|
|
in kpagecount, and tally up the number of pages that are only referenced
|
|
once.
|
|
|
|
Exceptions for Shared Memory
|
|
============================
|
|
|
|
Page table entries for shared pages are cleared when the pages are zapped or
|
|
swapped out. This makes swapped out pages indistinguishable from never-allocated
|
|
ones.
|
|
|
|
In kernel space, the swap location can still be retrieved from the page cache.
|
|
However, values stored only on the normal PTE get lost irretrievably when the
|
|
page is swapped out (i.e. SOFT_DIRTY).
|
|
|
|
In user space, whether the page is present, swapped or none can be deduced with
|
|
the help of lseek and/or mincore system calls.
|
|
|
|
lseek() can differentiate between accessed pages (present or swapped out) and
|
|
holes (none/non-allocated) by specifying the SEEK_DATA flag on the file where
|
|
the pages are backed. For anonymous shared pages, the file can be found in
|
|
``/proc/pid/map_files/``.
|
|
|
|
mincore() can differentiate between pages in memory (present, including swap
|
|
cache) and out of memory (swapped out or none/non-allocated).
|
|
|
|
Other notes
|
|
===========
|
|
|
|
Reading from any of the files will return -EINVAL if you are not starting
|
|
the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
|
|
into the file), or if the size of the read is not a multiple of 8 bytes.
|
|
|
|
Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
|
|
always 12 at most architectures). Since Linux 3.11 their meaning changes
|
|
after first clear of soft-dirty bits. Since Linux 4.2 they are used for
|
|
flags unconditionally.
|
|
|
|
Pagemap Scan IOCTL
|
|
==================
|
|
|
|
The ``PAGEMAP_SCAN`` IOCTL on the pagemap file can be used to get or optionally
|
|
clear the info about page table entries. The following operations are supported
|
|
in this IOCTL:
|
|
|
|
- Scan the address range and get the memory ranges matching the provided criteria.
|
|
This is performed when the output buffer is specified.
|
|
- Write-protect the pages. The ``PM_SCAN_WP_MATCHING`` is used to write-protect
|
|
the pages of interest. The ``PM_SCAN_CHECK_WPASYNC`` aborts the operation if
|
|
non-Async Write Protected pages are found. The ``PM_SCAN_WP_MATCHING`` can be
|
|
used with or without ``PM_SCAN_CHECK_WPASYNC``.
|
|
- Both of those operations can be combined into one atomic operation where we can
|
|
get and write protect the pages as well.
|
|
|
|
Following flags about pages are currently supported:
|
|
|
|
- ``PAGE_IS_WPALLOWED`` - Page has async-write-protection enabled
|
|
- ``PAGE_IS_WRITTEN`` - Page has been written to from the time it was write protected
|
|
- ``PAGE_IS_FILE`` - Page is file backed
|
|
- ``PAGE_IS_PRESENT`` - Page is present in the memory
|
|
- ``PAGE_IS_SWAPPED`` - Page is in swapped
|
|
- ``PAGE_IS_PFNZERO`` - Page has zero PFN
|
|
- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed
|
|
- ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty
|
|
|
|
The ``struct pm_scan_arg`` is used as the argument of the IOCTL.
|
|
|
|
1. The size of the ``struct pm_scan_arg`` must be specified in the ``size``
|
|
field. This field will be helpful in recognizing the structure if extensions
|
|
are done later.
|
|
2. The flags can be specified in the ``flags`` field. The ``PM_SCAN_WP_MATCHING``
|
|
and ``PM_SCAN_CHECK_WPASYNC`` are the only added flags at this time. The get
|
|
operation is optionally performed depending upon if the output buffer is
|
|
provided or not.
|
|
3. The range is specified through ``start`` and ``end``.
|
|
4. The walk can abort before visiting the complete range such as the user buffer
|
|
can get full etc. The walk ending address is specified in``end_walk``.
|
|
5. The output buffer of ``struct page_region`` array and size is specified in
|
|
``vec`` and ``vec_len``.
|
|
6. The optional maximum requested pages are specified in the ``max_pages``.
|
|
7. The masks are specified in ``category_mask``, ``category_anyof_mask``,
|
|
``category_inverted`` and ``return_mask``.
|
|
|
|
Find pages which have been written and WP them as well::
|
|
|
|
struct pm_scan_arg arg = {
|
|
.size = sizeof(arg),
|
|
.flags = PM_SCAN_CHECK_WPASYNC | PM_SCAN_CHECK_WPASYNC,
|
|
..
|
|
.category_mask = PAGE_IS_WRITTEN,
|
|
.return_mask = PAGE_IS_WRITTEN,
|
|
};
|
|
|
|
Find pages which have been written, are file backed, not swapped and either
|
|
present or huge::
|
|
|
|
struct pm_scan_arg arg = {
|
|
.size = sizeof(arg),
|
|
.flags = 0,
|
|
..
|
|
.category_mask = PAGE_IS_WRITTEN | PAGE_IS_SWAPPED,
|
|
.category_inverted = PAGE_IS_SWAPPED,
|
|
.category_anyof_mask = PAGE_IS_PRESENT | PAGE_IS_HUGE,
|
|
.return_mask = PAGE_IS_WRITTEN | PAGE_IS_SWAPPED |
|
|
PAGE_IS_PRESENT | PAGE_IS_HUGE,
|
|
};
|
|
|
|
The ``PAGE_IS_WRITTEN`` flag can be considered as a better-performing alternative
|
|
of soft-dirty flag. It doesn't get affected by VMA merging of the kernel and hence
|
|
the user can find the true soft-dirty pages in case of normal pages. (There may
|
|
still be extra dirty pages reported for THP or Hugetlb pages.)
|
|
|
|
"PAGE_IS_WRITTEN" category is used with uffd write protect-enabled ranges to
|
|
implement memory dirty tracking in userspace:
|
|
|
|
1. The userfaultfd file descriptor is created with ``userfaultfd`` syscall.
|
|
2. The ``UFFD_FEATURE_WP_UNPOPULATED`` and ``UFFD_FEATURE_WP_ASYNC`` features
|
|
are set by ``UFFDIO_API`` IOCTL.
|
|
3. The memory range is registered with ``UFFDIO_REGISTER_MODE_WP`` mode
|
|
through ``UFFDIO_REGISTER`` IOCTL.
|
|
4. Then any part of the registered memory or the whole memory region must
|
|
be write protected using ``PAGEMAP_SCAN`` IOCTL with flag ``PM_SCAN_WP_MATCHING``
|
|
or the ``UFFDIO_WRITEPROTECT`` IOCTL can be used. Both of these perform the
|
|
same operation. The former is better in terms of performance.
|
|
5. Now the ``PAGEMAP_SCAN`` IOCTL can be used to either just find pages which
|
|
have been written to since they were last marked and/or optionally write protect
|
|
the pages as well.
|