- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in

linux-next for a couple of months without, to my knowledge, any negative
   reports (or any positive ones, come to that).
 
 - Also the Maple Tree from Liam R.  Howlett.  An overlapping range-based
   tree for vmas.  It it apparently slight more efficient in its own right,
   but is mainly targeted at enabling work to reduce mmap_lock contention.
 
   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.
 
   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   (https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
   This has yet to be addressed due to Liam's unfortunately timed
   vacation.  He is now back and we'll get this fixed up.
 
 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer.  It uses
   clang-generated instrumentation to detect used-unintialized bugs down to
   the single bit level.
 
   KMSAN keeps finding bugs.  New ones, as well as the legacy ones.
 
 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.
 
 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
   file/shmem-backed pages.
 
 - userfaultfd updates from Axel Rasmussen
 
 - zsmalloc cleanups from Alexey Romanov
 
 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
 
 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.
 
 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.
 
 - memcg cleanups from Kairui Song.
 
 - memcg fixes and cleanups from Johannes Weiner.
 
 - Vishal Moola provides more folio conversions
 
 - Zhang Yi removed ll_rw_block() :(
 
 - migration enhancements from Peter Xu
 
 - migration error-path bugfixes from Huang Ying
 
 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths.  For optimizations by PMEM drivers, DRM
   drivers, etc.
 
 - vma merging improvements from Jakub Matěn.
 
 - NUMA hinting cleanups from David Hildenbrand.
 
 - xu xin added aditional userspace visibility into KSM merging activity.
 
 - THP & KSM code consolidation from Qi Zheng.
 
 - more folio work from Matthew Wilcox.
 
 - KASAN updates from Andrey Konovalov.
 
 - DAMON cleanups from Kaixu Xia.
 
 - DAMON work from SeongJae Park: fixes, cleanups.
 
 - hugetlb sysfs cleanups from Muchun Song.
 
 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
 joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
 bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
 =xfWx
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
   linux-next for a couple of months without, to my knowledge, any
   negative reports (or any positive ones, come to that).

 - Also the Maple Tree from Liam Howlett. An overlapping range-based
   tree for vmas. It it apparently slightly more efficient in its own
   right, but is mainly targeted at enabling work to reduce mmap_lock
   contention.

   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.

   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   at [1]. This has yet to be addressed due to Liam's unfortunately
   timed vacation. He is now back and we'll get this fixed up.

 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
   clang-generated instrumentation to detect used-unintialized bugs down
   to the single bit level.

   KMSAN keeps finding bugs. New ones, as well as the legacy ones.

 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.

 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
   support file/shmem-backed pages.

 - userfaultfd updates from Axel Rasmussen

 - zsmalloc cleanups from Alexey Romanov

 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
   memory-failure

 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.

 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.

 - memcg cleanups from Kairui Song.

 - memcg fixes and cleanups from Johannes Weiner.

 - Vishal Moola provides more folio conversions

 - Zhang Yi removed ll_rw_block() :(

 - migration enhancements from Peter Xu

 - migration error-path bugfixes from Huang Ying

 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths. For optimizations by PMEM drivers, DRM
   drivers, etc.

 - vma merging improvements from Jakub Matěn.

 - NUMA hinting cleanups from David Hildenbrand.

 - xu xin added aditional userspace visibility into KSM merging
   activity.

 - THP & KSM code consolidation from Qi Zheng.

 - more folio work from Matthew Wilcox.

 - KASAN updates from Andrey Konovalov.

 - DAMON cleanups from Kaixu Xia.

 - DAMON work from SeongJae Park: fixes, cleanups.

 - hugetlb sysfs cleanups from Muchun Song.

 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.

Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]

* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
  hugetlb: allocate vma lock for all sharable vmas
  hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
  hugetlb: fix vma lock handling during split vma and range unmapping
  mglru: mm/vmscan.c: fix imprecise comments
  mm/mglru: don't sync disk for each aging cycle
  mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
  mm: memcontrol: use do_memsw_account() in a few more places
  mm: memcontrol: deprecate swapaccounting=0 mode
  mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
  mm/secretmem: remove reduntant return value
  mm/hugetlb: add available_huge_pages() func
  mm: remove unused inline functions from include/linux/mm_inline.h
  selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
  selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
  selftests/vm: add thp collapse shmem testing
  selftests/vm: add thp collapse file and tmpfs testing
  selftests/vm: modularize thp collapse memory operations
  selftests/vm: dedup THP helpers
  mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
  mm/madvise: add file and shmem support to MADV_COLLAPSE
  ...
This commit is contained in:
Linus Torvalds 2022-10-10 17:53:04 -07:00
commit 27bc50fc90
409 changed files with 65691 additions and 7933 deletions

View File

@ -0,0 +1,25 @@
What: /sys/devices/virtual/memory_tiering/
Date: August 2022
Contact: Linux memory management mailing list <linux-mm@kvack.org>
Description: A collection of all the memory tiers allocated.
Individual memory tier details are contained in subdirectories
named by the abstract distance of the memory tier.
/sys/devices/virtual/memory_tiering/memory_tierN/
What: /sys/devices/virtual/memory_tiering/memory_tierN/
/sys/devices/virtual/memory_tiering/memory_tierN/nodes
Date: August 2022
Contact: Linux memory management mailing list <linux-mm@kvack.org>
Description: Directory with details of a specific memory tier
This is the directory containing information about a particular
memory tier, memtierN, where N is derived based on abstract distance.
A smaller value of N implies a higher (faster) memory tier in the
hierarchy.
nodes: NUMA nodes that are part of this memory tier.

View File

@ -13,7 +13,7 @@ a) waiting for a CPU (while being runnable)
b) completion of synchronous block I/O initiated by the task
c) swapping in pages
d) memory reclaim
e) thrashing page cache
e) thrashing
f) direct compact
g) write-protect copy

View File

@ -299,7 +299,7 @@ Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
lruvec->lru_lock; PG_lru bit of page->flags is cleared before
isolating a page from its LRU under lruvec->lru_lock.
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
2.7 Kernel Memory Extension
-----------------------------------------------
With the Kernel memory extension, the Memory Controller is able to limit
@ -386,8 +386,6 @@ U != 0, K >= U:
a. Enable CONFIG_CGROUPS
b. Enable CONFIG_MEMCG
c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
-------------------------------------------------------------------

View File

@ -1469,6 +1469,14 @@
Permit 'security.evm' to be updated regardless of
current integrity status.
early_page_ext [KNL] Enforces page_ext initialization to earlier
stages so cover more early boot allocations.
Please note that as side effect some optimizations
might be disabled to achieve that (e.g. parallelized
memory initialization is disabled) so the boot process
might take longer, especially on systems with a lot of
memory. Available with CONFIG_PAGE_EXTENSION=y.
failslab=
fail_usercopy=
fail_page_alloc=
@ -6041,12 +6049,6 @@
This parameter controls use of the Protected
Execution Facility on pSeries.
swapaccount= [KNL]
Format: [0|1]
Enable accounting of swap in memory resource
controller if no parameter or 1 is given or disable
it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst)
swiotlb= [ARM,IA-64,PPC,MIPS,X86]
Format: { <int> [,<int>] | force | noforce }
<int> -- Number of I/O TLB slabs

View File

@ -5,10 +5,10 @@ CMA Debugfs Interface
The CMA debugfs interface is useful to retrieve basic information out of the
different CMA areas and to test allocation/release in each of the areas.
Each CMA zone represents a directory under <debugfs>/cma/, indexed by the
kernel's CMA index. So the first CMA zone would be:
Each CMA area represents a directory under <debugfs>/cma/, represented by
its CMA name like below:
<debugfs>/cma/cma-0
<debugfs>/cma/<cma_name>
The structure of the files created under that directory is as follows:
@ -18,8 +18,8 @@ The structure of the files created under that directory is as follows:
- [RO] bitmap: The bitmap of page states in the zone.
- [WO] alloc: Allocate N pages from that CMA area. For example::
echo 5 > <debugfs>/cma/cma-2/alloc
echo 5 > <debugfs>/cma/<cma_name>/alloc
would try to allocate 5 pages from the cma-2 area.
would try to allocate 5 pages from the 'cma_name' area.
- [WO] free: Free N pages from that CMA area, similar to the above.

View File

@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
========================
Monitoring Data Accesses
========================
==========================
DAMON: Data Access MONitor
==========================
:doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring.
Using DAMON, users can analyze the memory access patterns of their systems and

View File

@ -29,16 +29,9 @@ called DAMON Operator (DAMO). It is available at
https://github.com/awslabs/damo. The examples below assume that ``damo`` is on
your ``$PATH``. It's not mandatory, though.
Because DAMO is using the debugfs interface (refer to :doc:`usage` for the
detail) of DAMON, you should ensure debugfs is mounted. Mount it manually as
below::
# mount -t debugfs none /sys/kernel/debug/
or append the following line to your ``/etc/fstab`` file so that your system
can automatically mount debugfs upon booting::
debugfs /sys/kernel/debug debugfs defaults 0 0
Because DAMO is using the sysfs interface (refer to :doc:`usage` for the
detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is
mounted.
Recording Data Access Patterns

View File

@ -393,6 +393,11 @@ the files as above. Above is only for an example.
debugfs Interface
=================
.. note::
DAMON debugfs interface will be removed after next LTS kernel is released, so
users should move to the :ref:`sysfs interface <sysfs_interface>`.
DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``,
``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and
``rm_contexts`` under its debugfs directory, ``<debugfs>/damon/``.

View File

@ -32,6 +32,7 @@ the Linux memory management.
idle_page_tracking
ksm
memory-hotplug
multigen_lru
nommu-mmap
numa_memory_policy
numaperf

View File

@ -184,6 +184,42 @@ The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
be increased accordingly.
Monitoring KSM profit
=====================
KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate a number of rmap_items to
save each scanned page's brief rmap information. Some of these pages may
be merged, but some may not be abled to be merged after being checked
several times, which are unprofitable memory consumed.
1) How to determine whether KSM save memory or consume memory in system-wide
range? Here is a simple approximate calculation for reference::
general_profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
sizeof(rmap_item);
where all_rmap_items can be easily obtained by summing ``pages_sharing``,
``pages_shared``, ``pages_unshared`` and ``pages_volatile``.
2) The KSM profit inner a single process can be similarly obtained by the
following approximate calculation::
process_profit =~ ksm_merging_pages * sizeof(page) -
ksm_rmap_items * sizeof(rmap_item).
where ksm_merging_pages is shown under the directory ``/proc/<pid>/``,
and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``.
From the perspective of application, a high ratio of ``ksm_rmap_items`` to
``ksm_merging_pages`` means a bad madvise-applied policy, so developers or
administrators have to rethink how to change madvise policy. Giving an example
for reference, a page's size is usually 4K, and the rmap_item's size is
separately 32B on 32-bit CPU architecture and 64B on 64-bit CPU architecture.
so if the ``ksm_rmap_items/ksm_merging_pages`` ratio exceeds 64 on 64-bit CPU
or exceeds 128 on 32-bit CPU, then the app's madvise policy should be dropped,
because the ksm profit is approximately zero or negative.
Monitoring KSM events
=====================

View File

@ -0,0 +1,162 @@
.. SPDX-License-Identifier: GPL-2.0
=============
Multi-Gen LRU
=============
The multi-gen LRU is an alternative LRU implementation that optimizes
page reclaim and improves performance under memory pressure. Page
reclaim decides the kernel's caching policy and ability to overcommit
memory. It directly impacts the kswapd CPU usage and RAM efficiency.
Quick start
===========
Build the kernel with the following configurations.
* ``CONFIG_LRU_GEN=y``
* ``CONFIG_LRU_GEN_ENABLED=y``
All set!
Runtime options
===============
``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
following subsections.
Kill switch
-----------
``enabled`` accepts different values to enable or disable the
following components. Its default value depends on
``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
unless some of them have unforeseen side effects. Writing to
``enabled`` has no effect when a component is not supported by the
hardware, and valid values will be accepted even when the main switch
is off.
====== ===============================================================
Values Components
====== ===============================================================
0x0001 The main switch for the multi-gen LRU.
0x0002 Clearing the accessed bit in leaf page table entries in large
batches, when MMU sets it (e.g., on x86). This behavior can
theoretically worsen lock contention (mmap_lock). If it is
disabled, the multi-gen LRU will suffer a minor performance
degradation for workloads that contiguously map hot pages,
whose accessed bits can be otherwise cleared by fewer larger
batches.
0x0004 Clearing the accessed bit in non-leaf page table entries as
well, when MMU sets it (e.g., on x86). This behavior was not
verified on x86 varieties other than Intel and AMD. If it is
disabled, the multi-gen LRU will suffer a negligible
performance degradation.
[yYnN] Apply to all the components above.
====== ===============================================================
E.g.,
::
echo y >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0007
echo 5 >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0005
Thrashing prevention
--------------------
Personal computers are more sensitive to thrashing because it can
cause janks (lags when rendering UI) and negatively impact user
experience. The multi-gen LRU offers thrashing prevention to the
majority of laptop and desktop users who do not have ``oomd``.
Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
``N`` milliseconds from getting evicted. The OOM killer is triggered
if this working set cannot be kept in memory. In other words, this
option works as an adjustable pressure relief valve, and when open, it
terminates applications that are hopefully not being used.
Based on the average human detectable lag (~100ms), ``N=1000`` usually
eliminates intolerable janks due to thrashing. Larger values like
``N=3000`` make janks less noticeable at the risk of premature OOM
kills.
The default value ``0`` means disabled.
Experimental features
=====================
``/sys/kernel/debug/lru_gen`` accepts commands described in the
following subsections. Multiple command lines are supported, so does
concatenation with delimiters ``,`` and ``;``.
``/sys/kernel/debug/lru_gen_full`` provides additional stats for
debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
evicted generations in this file.
Working set estimation
----------------------
Working set estimation measures how much memory an application needs
in a given time interval, and it is usually done with little impact on
the performance of the application. E.g., data centers want to
optimize job scheduling (bin packing) to improve memory utilizations.
When a new job comes in, the job scheduler needs to find out whether
each server it manages can allocate a certain amount of memory for
this new job before it can pick a candidate. To do so, the job
scheduler needs to estimate the working sets of the existing jobs.
When it is read, ``lru_gen`` returns a histogram of numbers of pages
accessed over different time intervals for each memcg and node.
``MAX_NR_GENS`` decides the number of bins for each histogram. The
histograms are noncumulative.
::
memcg memcg_id memcg_path
node node_id
min_gen_nr age_in_ms nr_anon_pages nr_file_pages
...
max_gen_nr age_in_ms nr_anon_pages nr_file_pages
Each bin contains an estimated number of pages that have been accessed
within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
the former is the largest and that of the latter is the smallest.
Users can write the following command to ``lru_gen`` to create a new
generation ``max_gen_nr+1``:
``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``
``can_swap`` defaults to the swap setting and, if it is set to ``1``,
it forces the scan of anon pages when swap is off, and vice versa.
``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
employs heuristics to reduce the overhead, which is likely to reduce
the coverage as well.
A typical use case is that a job scheduler runs this command at a
certain time interval to create new generations, and it ranks the
servers it manages based on the sizes of their cold pages defined by
this time interval.
Proactive reclaim
-----------------
Proactive reclaim induces page reclaim when there is no memory
pressure. It usually targets cold pages only. E.g., when a new job
comes in, the job scheduler wants to proactively reclaim cold pages on
the server it selected, to improve the chance of successfully landing
this new job.
Users can write the following command to ``lru_gen`` to evict
generations less than or equal to ``min_gen_nr``.
``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``
``min_gen_nr`` should be less than ``max_gen_nr-1``, since
``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
the active list) and therefore cannot be evicted. ``swappiness``
overrides the default value in ``/proc/sys/vm/swappiness``.
``nr_to_reclaim`` limits the number of pages to evict.
A typical use case is that a job scheduler runs this command before it
tries to land a new job on a server. If it fails to materialize enough
cold pages because of the overestimation, it retries on the next
server according to the ranking result obtained from the working set
estimation step. This less forceful approach limits the impacts on the
existing jobs.

View File

@ -191,7 +191,14 @@ allocation failure to throttle the next allocation attempt::
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
The khugepaged progress can be seen in the number of pages collapsed::
The khugepaged progress can be seen in the number of pages collapsed (note
that this counter may not be an exact count of the number of pages
collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
one 2M hugepage. Each may happen independently, or together, depending on
the type of memory and the failures that occur. As such, this value should
be interpreted roughly as a sign of progress, and counters in /proc/vmstat
consulted for more accurate accounting)::
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
@ -366,10 +373,9 @@ thp_split_pmd
page table entry.
thp_zero_page_alloc
is incremented every time a huge zero page is
successfully allocated. It includes allocations which where
dropped due race with other allocation. Note, it doesn't count
every map of the huge zero page, only its allocation.
is incremented every time a huge zero page used for thp is
successfully allocated. Note, it doesn't count every map of
the huge zero page, only its allocation.
thp_zero_page_alloc_failed
is incremented if kernel fails to allocate

View File

@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick.
Design
======
Userfaults are delivered and resolved through the ``userfaultfd`` syscall.
Userspace creates a new userfaultfd, initializes it, and registers one or more
regions of virtual memory with it. Then, any page faults which occur within the
region(s) result in a message being delivered to the userfaultfd, notifying
userspace of the fault.
The ``userfaultfd`` (aside from registering and unregistering virtual
memory ranges) provides two primary functionalities:
@ -34,12 +37,11 @@ The real advantage of userfaults if compared to regular virtual memory
management of mremap/mprotect is that the userfaults in all their
operations never involve heavyweight structures like vmas (in fact the
``userfaultfd`` runtime load never takes the mmap_lock for writing).
Vmas are not suitable for page- (or hugepage) granular fault tracking
when dealing with virtual address spaces that could span
Terabytes. Too many vmas would be needed for that.
The ``userfaultfd`` once opened by invoking the syscall, can also be
The ``userfaultfd``, once created, can also be
passed using unix domain sockets to a manager process, so the same
manager process could handle the userfaults of a multitude of
different processes without them being aware about what is going on
@ -50,6 +52,39 @@ is a corner case that would currently return ``-EBUSY``).
API
===
Creating a userfaultfd
----------------------
There are two ways to create a new userfaultfd, each of which provide ways to
restrict access to this functionality (since historically userfaultfds which
handle kernel page faults have been a useful tool for exploiting the kernel).
The first way, supported since userfaultfd was introduced, is the
userfaultfd(2) syscall. Access to this is controlled in several ways:
- Any user can always create a userfaultfd which traps userspace page faults
only. Such a userfaultfd can be created using the userfaultfd(2) syscall
with the flag UFFD_USER_MODE_ONLY.
- In order to also trap kernel page faults for the address space, either the
process needs the CAP_SYS_PTRACE capability, or the system must have
vm.unprivileged_userfaultfd set to 1. By default, vm.unprivileged_userfaultfd
is set to 0.
The second way, added to the kernel more recently, is by opening
/dev/userfaultfd and issuing a USERFAULTFD_IOC_NEW ioctl to it. This method
yields equivalent userfaultfds to the userfaultfd(2) syscall.
Unlike userfaultfd(2), access to /dev/userfaultfd is controlled via normal
filesystem permissions (user/group/mode), which gives fine grained access to
userfaultfd specifically, without also granting other unrelated privileges at
the same time (as e.g. granting CAP_SYS_PTRACE would do). Users who have access
to /dev/userfaultfd can always create userfaultfds that trap kernel page faults;
vm.unprivileged_userfaultfd is not considered.
Initializing a userfaultfd
--------------------------
When first opened the ``userfaultfd`` must be enabled invoking the
``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
a later API version) which will specify the ``read/POLLIN`` protocol

View File

@ -635,6 +635,17 @@ different types of memory (represented as different NUMA nodes) to
place the hot pages in the fast memory. This is implemented based on
unmapping and page fault too.
numa_balancing_promote_rate_limit_MBps
======================================
Too high promotion/demotion throughput between different memory types
may hurt application latency. This can be used to rate limit the
promotion throughput. The per-node max promotion throughput in MB/s
will be limited to be no more than the set value.
A rule of thumb is to set this to less than 1/10 of the PMEM node
write bandwidth.
oops_all_cpu_backtrace
======================

View File

@ -926,6 +926,9 @@ calls without any restrictions.
The default value is 0.
Another way to control permissions for userfaultfd is to use
/dev/userfaultfd instead of userfaultfd(2). See
Documentation/admin-guide/mm/userfaultfd.rst.
user_reserve_kbytes
===================

View File

@ -37,6 +37,7 @@ Library functionality that is used throughout the kernel.
kref
assoc_array
xarray
maple_tree
idr
circular-buffers
rbtree

View File

@ -0,0 +1,217 @@
.. SPDX-License-Identifier: GPL-2.0+
==========
Maple Tree
==========
:Author: Liam R. Howlett
Overview
========
The Maple Tree is a B-Tree data type which is optimized for storing
non-overlapping ranges, including ranges of size 1. The tree was designed to
be simple to use and does not require a user written search method. It
supports iterating over a range of entries and going to the previous or next
entry in a cache-efficient manner. The tree can also be put into an RCU-safe
mode of operation which allows reading and writing concurrently. Writers must
synchronize on a lock, which can be the default spinlock, or the user can set
the lock to an external lock of a different type.
The Maple Tree maintains a small memory footprint and was designed to use
modern processor cache efficiently. The majority of the users will be able to
use the normal API. An :ref:`maple-tree-advanced-api` exists for more complex
scenarios. The most important usage of the Maple Tree is the tracking of the
virtual memory areas.
The Maple Tree can store values between ``0`` and ``ULONG_MAX``. The Maple
Tree reserves values with the bottom two bits set to '10' which are below 4096
(ie 2, 6, 10 .. 4094) for internal use. If the entries may use reserved
entries then the users can convert the entries using xa_mk_value() and convert
them back by calling xa_to_value(). If the user needs to use a reserved
value, then the user can convert the value when using the
:ref:`maple-tree-advanced-api`, but are blocked by the normal API.
The Maple Tree can also be configured to support searching for a gap of a given
size (or larger).
Pre-allocating of nodes is also supported using the
:ref:`maple-tree-advanced-api`. This is useful for users who must guarantee a
successful store operation within a given
code segment when allocating cannot be done. Allocations of nodes are
relatively small at around 256 bytes.
.. _maple-tree-normal-api:
Normal API
==========
Start by initialising a maple tree, either with DEFINE_MTREE() for statically
allocated maple trees or mt_init() for dynamically allocated ones. A
freshly-initialised maple tree contains a ``NULL`` pointer for the range ``0``
- ``ULONG_MAX``. There are currently two types of maple trees supported: the
allocation tree and the regular tree. The regular tree has a higher branching
factor for internal nodes. The allocation tree has a lower branching factor
but allows the user to search for a gap of a given size or larger from either
``0`` upwards or ``ULONG_MAX`` down. An allocation tree can be used by
passing in the ``MT_FLAGS_ALLOC_RANGE`` flag when initialising the tree.
You can then set entries using mtree_store() or mtree_store_range().
mtree_store() will overwrite any entry with the new entry and return 0 on
success or an error code otherwise. mtree_store_range() works in the same way
but takes a range. mtree_load() is used to retrieve the entry stored at a
given index. You can use mtree_erase() to erase an entire range by only
knowing one value within that range, or mtree_store() call with an entry of
NULL may be used to partially erase a range or many ranges at once.
If you want to only store a new entry to a range (or index) if that range is
currently ``NULL``, you can use mtree_insert_range() or mtree_insert() which
return -EEXIST if the range is not empty.
You can search for an entry from an index upwards by using mt_find().
You can walk each entry within a range by calling mt_for_each(). You must
provide a temporary variable to store a cursor. If you want to walk each
element of the tree then ``0`` and ``ULONG_MAX`` may be used as the range. If
the caller is going to hold the lock for the duration of the walk then it is
worth looking at the mas_for_each() API in the :ref:`maple-tree-advanced-api`
section.
Sometimes it is necessary to ensure the next call to store to a maple tree does
not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case.
Finally, you can remove all entries from a maple tree by calling
mtree_destroy(). If the maple tree entries are pointers, you may wish to free
the entries first.
Allocating Nodes
----------------
The allocations are handled by the internal tree code. See
:ref:`maple-tree-advanced-alloc` for other options.
Locking
-------
You do not have to worry about locking. See :ref:`maple-tree-advanced-locks`
for other options.
The Maple Tree uses RCU and an internal spinlock to synchronise access:
Takes RCU read lock:
* mtree_load()
* mt_find()
* mt_for_each()
* mt_next()
* mt_prev()
Takes ma_lock internally:
* mtree_store()
* mtree_store_range()
* mtree_insert()
* mtree_insert_range()
* mtree_erase()
* mtree_destroy()
* mt_set_in_rcu()
* mt_clear_in_rcu()
If you want to take advantage of the internal lock to protect the data
structures that you are storing in the Maple Tree, you can call mtree_lock()
before calling mtree_load(), then take a reference count on the object you
have found before calling mtree_unlock(). This will prevent stores from
removing the object from the tree between looking up the object and
incrementing the refcount. You can also use RCU to avoid dereferencing
freed memory, but an explanation of that is beyond the scope of this
document.
.. _maple-tree-advanced-api:
Advanced API
============
The advanced API offers more flexibility and better performance at the
cost of an interface which can be harder to use and has fewer safeguards.
You must take care of your own locking while using the advanced API.
You can use the ma_lock, RCU or an external lock for protection.
You can mix advanced and normal operations on the same array, as long
as the locking is compatible. The :ref:`maple-tree-normal-api` is implemented
in terms of the advanced API.
The advanced API is based around the ma_state, this is where the 'mas'
prefix originates. The ma_state struct keeps track of tree operations to make
life easier for both internal and external tree users.
Initialising the maple tree is the same as in the :ref:`maple-tree-normal-api`.
Please see above.
The maple state keeps track of the range start and end in mas->index and
mas->last, respectively.
mas_walk() will walk the tree to the location of mas->index and set the
mas->index and mas->last according to the range for the entry.
You can set entries using mas_store(). mas_store() will overwrite any entry
with the new entry and return the first existing entry that is overwritten.
The range is passed in as members of the maple state: index and last.
You can use mas_erase() to erase an entire range by setting index and
last of the maple state to the desired range to erase. This will erase
the first range that is found in that range, set the maple state index
and last as the range that was erased and return the entry that existed
at that location.
You can walk each entry within a range by using mas_for_each(). If you want
to walk each element of the tree then ``0`` and ``ULONG_MAX`` may be used as
the range. If the lock needs to be periodically dropped, see the locking
section mas_pause().
Using a maple state allows mas_next() and mas_prev() to function as if the
tree was a linked list. With such a high branching factor the amortized
performance penalty is outweighed by cache optimization. mas_next() will
return the next entry which occurs after the entry at index. mas_prev()
will return the previous entry which occurs before the entry at index.
mas_find() will find the first entry which exists at or above index on
the first call, and the next entry from every subsequent calls.
mas_find_rev() will find the fist entry which exists at or below the last on
the first call, and the previous entry from every subsequent calls.
If the user needs to yield the lock during an operation, then the maple state
must be paused using mas_pause().
There are a few extra interfaces provided when using an allocation tree.
If you wish to search for a gap within a range, then mas_empty_area()
or mas_empty_area_rev() can be used. mas_empty_area() searches for a gap
starting at the lowest index given up to the maximum of the range.
mas_empty_area_rev() searches for a gap starting at the highest index given
and continues downward to the lower bound of the range.
.. _maple-tree-advanced-alloc:
Advanced Allocating Nodes
-------------------------
Allocations are usually handled internally to the tree, however if allocations
need to occur before a write occurs then calling mas_expected_entries() will
allocate the worst-case number of needed nodes to insert the provided number of
ranges. This also causes the tree to enter mass insertion mode. Once
insertions are complete calling mas_destroy() on the maple state will free the
unused allocations.
.. _maple-tree-advanced-locks:
Advanced Locking
----------------
The maple tree uses a spinlock by default, but external locks can be used for
tree updates as well. To use an external lock, the tree must be initialized
with the ``MT_FLAGS_LOCK_EXTERN flag``, this is usually done with the
MTREE_INIT_EXT() #define, which takes an external lock as an argument.
Functions and structures
========================
.. kernel-doc:: include/linux/maple_tree.h
.. kernel-doc:: lib/maple_tree.c

View File

@ -19,9 +19,6 @@ User Space Memory Access
Memory Allocation Controls
==========================
.. kernel-doc:: include/linux/gfp.h
:internal:
.. kernel-doc:: include/linux/gfp_types.h
:doc: Page mobility and placement hints

View File

@ -24,6 +24,7 @@ Documentation/dev-tools/testing-overview.rst
kcov
gcov
kasan
kmsan
ubsan
kmemleak
kcsan

View File

@ -111,9 +111,17 @@ parameter can be used to control panic and reporting behaviour:
report or also panic the kernel (default: ``report``). The panic happens even
if ``kasan_multi_shot`` is enabled.
Hardware Tag-Based KASAN mode (see the section about various modes below) is
intended for use in production as a security mitigation. Therefore, it supports
additional boot parameters that allow disabling KASAN or controlling features:
Software and Hardware Tag-Based KASAN modes (see the section about various
modes below) support altering stack trace collection behavior:
- ``kasan.stacktrace=off`` or ``=on`` disables or enables alloc and free stack
traces collection (default: ``on``).
- ``kasan.stack_ring_size=<number of entries>`` specifies the number of entries
in the stack ring (default: ``32768``).
Hardware Tag-Based KASAN mode is intended for use in production as a security
mitigation. Therefore, it supports additional boot parameters that allow
disabling KASAN altogether or controlling its features:
- ``kasan=off`` or ``=on`` controls whether KASAN is enabled (default: ``on``).
@ -132,9 +140,6 @@ additional boot parameters that allow disabling KASAN or controlling features:
- ``kasan.vmalloc=off`` or ``=on`` disables or enables tagging of vmalloc
allocations (default: ``on``).
- ``kasan.stacktrace=off`` or ``=on`` disables or enables alloc and free stack
traces collection (default: ``on``).
Error reports
~~~~~~~~~~~~~

View File

@ -0,0 +1,427 @@
.. SPDX-License-Identifier: GPL-2.0
.. Copyright (C) 2022, Google LLC.
===================================
The Kernel Memory Sanitizer (KMSAN)
===================================
KMSAN is a dynamic error detector aimed at finding uses of uninitialized
values. It is based on compiler instrumentation, and is quite similar to the
userspace `MemorySanitizer tool`_.
An important note is that KMSAN is not intended for production use, because it
drastically increases kernel memory footprint and slows the whole system down.
Usage
=====
Building the kernel
-------------------
In order to build a kernel with KMSAN you will need a fresh Clang (14.0.6+).
Please refer to `LLVM documentation`_ for the instructions on how to build Clang.
Now configure and build the kernel with CONFIG_KMSAN enabled.
Example report
--------------
Here is an example of a KMSAN report::
=====================================================
BUG: KMSAN: uninit-value in test_uninit_kmsan_check_memory+0x1be/0x380 [kmsan_test]
test_uninit_kmsan_check_memory+0x1be/0x380 mm/kmsan/kmsan_test.c:273
kunit_run_case_internal lib/kunit/test.c:333
kunit_try_run_case+0x206/0x420 lib/kunit/test.c:374
kunit_generic_run_threadfn_adapter+0x6d/0xc0 lib/kunit/try-catch.c:28
kthread+0x721/0x850 kernel/kthread.c:327
ret_from_fork+0x1f/0x30 ??:?
Uninit was stored to memory at:
do_uninit_local_array+0xfa/0x110 mm/kmsan/kmsan_test.c:260
test_uninit_kmsan_check_memory+0x1a2/0x380 mm/kmsan/kmsan_test.c:271
kunit_run_case_internal lib/kunit/test.c:333
kunit_try_run_case+0x206/0x420 lib/kunit/test.c:374
kunit_generic_run_threadfn_adapter+0x6d/0xc0 lib/kunit/try-catch.c:28
kthread+0x721/0x850 kernel/kthread.c:327
ret_from_fork+0x1f/0x30 ??:?
Local variable uninit created at:
do_uninit_local_array+0x4a/0x110 mm/kmsan/kmsan_test.c:256
test_uninit_kmsan_check_memory+0x1a2/0x380 mm/kmsan/kmsan_test.c:271
Bytes 4-7 of 8 are uninitialized
Memory access of size 8 starts at ffff888083fe3da0
CPU: 0 PID: 6731 Comm: kunit_try_catch Tainted: G B E 5.16.0-rc3+ #104
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
=====================================================
The report says that the local variable ``uninit`` was created uninitialized in
``do_uninit_local_array()``. The third stack trace corresponds to the place
where this variable was created.
The first stack trace shows where the uninit value was used (in
``test_uninit_kmsan_check_memory()``). The tool shows the bytes which were left
uninitialized in the local variable, as well as the stack where the value was
copied to another memory location before use.
A use of uninitialized value ``v`` is reported by KMSAN in the following cases:
- in a condition, e.g. ``if (v) { ... }``;
- in an indexing or pointer dereferencing, e.g. ``array[v]`` or ``*v``;
- when it is copied to userspace or hardware, e.g. ``copy_to_user(..., &v, ...)``;
- when it is passed as an argument to a function, and
``CONFIG_KMSAN_CHECK_PARAM_RETVAL`` is enabled (see below).
The mentioned cases (apart from copying data to userspace or hardware, which is
a security issue) are considered undefined behavior from the C11 Standard point
of view.
Disabling the instrumentation
-----------------------------
A function can be marked with ``__no_kmsan_checks``. Doing so makes KMSAN
ignore uninitialized values in that function and mark its output as initialized.
As a result, the user will not get KMSAN reports related to that function.
Another function attribute supported by KMSAN is ``__no_sanitize_memory``.
Applying this attribute to a function will result in KMSAN not instrumenting
it, which can be helpful if we do not want the compiler to interfere with some
low-level code (e.g. that marked with ``noinstr`` which implicitly adds
``__no_sanitize_memory``).
This however comes at a cost: stack allocations from such functions will have
incorrect shadow/origin values, likely leading to false positives. Functions
called from non-instrumented code may also receive incorrect metadata for their
parameters.
As a rule of thumb, avoid using ``__no_sanitize_memory`` explicitly.
It is also possible to disable KMSAN for a single file (e.g. main.o)::
KMSAN_SANITIZE_main.o := n
or for the whole directory::
KMSAN_SANITIZE := n
in the Makefile. Think of this as applying ``__no_sanitize_memory`` to every
function in the file or directory. Most users won't need KMSAN_SANITIZE, unless
their code gets broken by KMSAN (e.g. runs at early boot time).
Support
=======
In order for KMSAN to work the kernel must be built with Clang, which so far is
the only compiler that has KMSAN support. The kernel instrumentation pass is
based on the userspace `MemorySanitizer tool`_.
The runtime library only supports x86_64 at the moment.
How KMSAN works
===============
KMSAN shadow memory
-------------------
KMSAN associates a metadata byte (also called shadow byte) with every byte of
kernel memory. A bit in the shadow byte is set iff the corresponding bit of the
kernel memory byte is uninitialized. Marking the memory uninitialized (i.e.
setting its shadow bytes to ``0xff``) is called poisoning, marking it
initialized (setting the shadow bytes to ``0x00``) is called unpoisoning.
When a new variable is allocated on the stack, it is poisoned by default by
instrumentation code inserted by the compiler (unless it is a stack variable
that is immediately initialized). Any new heap allocation done without
``__GFP_ZERO`` is also poisoned.
Compiler instrumentation also tracks the shadow values as they are used along
the code. When needed, instrumentation code invokes the runtime library in
``mm/kmsan/`` to persist shadow values.
The shadow value of a basic or compound type is an array of bytes of the same
length. When a constant value is written into memory, that memory is unpoisoned.
When a value is read from memory, its shadow memory is also obtained and
propagated into all the operations which use that value. For every instruction
that takes one or more values the compiler generates code that calculates the
shadow of the result depending on those values and their shadows.
Example::
int a = 0xff; // i.e. 0x000000ff
int b;
int c = a | b;
In this case the shadow of ``a`` is ``0``, shadow of ``b`` is ``0xffffffff``,
shadow of ``c`` is ``0xffffff00``. This means that the upper three bytes of
``c`` are uninitialized, while the lower byte is initialized.
Origin tracking
---------------
Every four bytes of kernel memory also have a so-called origin mapped to them.
This origin describes the point in program execution at which the uninitialized
value was created. Every origin is associated with either the full allocation
stack (for heap-allocated memory), or the function containing the uninitialized
variable (for locals).
When an uninitialized variable is allocated on stack or heap, a new origin
value is created, and that variable's origin is filled with that value. When a
value is read from memory, its origin is also read and kept together with the
shadow. For every instruction that takes one or more values, the origin of the
result is one of the origins corresponding to any of the uninitialized inputs.
If a poisoned value is written into memory, its origin is written to the
corresponding storage as well.
Example 1::
int a = 42;
int b;
int c = a + b;
In this case the origin of ``b`` is generated upon function entry, and is
stored to the origin of ``c`` right before the addition result is written into
memory.
Several variables may share the same origin address, if they are stored in the
same four-byte chunk. In this case every write to either variable updates the
origin for all of them. We have to sacrifice precision in this case, because
storing origins for individual bits (and even bytes) would be too costly.
Example 2::
int combine(short a, short b) {
union ret_t {
int i;
short s[2];
} ret;
ret.s[0] = a;
ret.s[1] = b;
return ret.i;
}
If ``a`` is initialized and ``b`` is not, the shadow of the result would be
0xffff0000, and the origin of the result would be the origin of ``b``.
``ret.s[0]`` would have the same origin, but it will never be used, because
that variable is initialized.
If both function arguments are uninitialized, only the origin of the second
argument is preserved.
Origin chaining
~~~~~~~~~~~~~~~
To ease debugging, KMSAN creates a new origin for every store of an
uninitialized value to memory. The new origin references both its creation stack
and the previous origin the value had. This may cause increased memory
consumption, so we limit the length of origin chains in the runtime.
Clang instrumentation API
-------------------------
Clang instrumentation pass inserts calls to functions defined in
``mm/kmsan/nstrumentation.c`` into the kernel code.
Shadow manipulation
~~~~~~~~~~~~~~~~~~~
For every memory access the compiler emits a call to a function that returns a
pair of pointers to the shadow and origin addresses of the given memory::
typedef struct {
void *shadow, *origin;
} shadow_origin_ptr_t
shadow_origin_ptr_t __msan_metadata_ptr_for_load_{1,2,4,8}(void *addr)
shadow_origin_ptr_t __msan_metadata_ptr_for_store_{1,2,4,8}(void *addr)
shadow_origin_ptr_t __msan_metadata_ptr_for_load_n(void *addr, uintptr_t size)
shadow_origin_ptr_t __msan_metadata_ptr_for_store_n(void *addr, uintptr_t size)
The function name depends on the memory access size.
The compiler makes sure that for every loaded value its shadow and origin
values are read from memory. When a value is stored to memory, its shadow and
origin are also stored using the metadata pointers.
Handling locals
~~~~~~~~~~~~~~~
A special function is used to create a new origin value for a local variable and
set the origin of that variable to that value::
void __msan_poison_alloca(void *addr, uintptr_t size, char *descr)
Access to per-task data
~~~~~~~~~~~~~~~~~~~~~~~
At the beginning of every instrumented function KMSAN inserts a call to
``__msan_get_context_state()``::
kmsan_context_state *__msan_get_context_state(void)
``kmsan_context_state`` is declared in ``include/linux/kmsan.h``::
struct kmsan_context_state {
char param_tls[KMSAN_PARAM_SIZE];
char retval_tls[KMSAN_RETVAL_SIZE];
char va_arg_tls[KMSAN_PARAM_SIZE];
char va_arg_origin_tls[KMSAN_PARAM_SIZE];
u64 va_arg_overflow_size_tls;
char param_origin_tls[KMSAN_PARAM_SIZE];
depot_stack_handle_t retval_origin_tls;
};
This structure is used by KMSAN to pass parameter shadows and origins between
instrumented functions (unless the parameters are checked immediately by
``CONFIG_KMSAN_CHECK_PARAM_RETVAL``).
Passing uninitialized values to functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Clang's MemorySanitizer instrumentation has an option,
``-fsanitize-memory-param-retval``, which makes the compiler check function
parameters passed by value, as well as function return values.
The option is controlled by ``CONFIG_KMSAN_CHECK_PARAM_RETVAL``, which is
enabled by default to let KMSAN report uninitialized values earlier.
Please refer to the `LKML discussion`_ for more details.
Because of the way the checks are implemented in LLVM (they are only applied to
parameters marked as ``noundef``), not all parameters are guaranteed to be
checked, so we cannot give up the metadata storage in ``kmsan_context_state``.
String functions
~~~~~~~~~~~~~~~~
The compiler replaces calls to ``memcpy()``/``memmove()``/``memset()`` with the
following functions. These functions are also called when data structures are
initialized or copied, making sure shadow and origin values are copied alongside
with the data::
void *__msan_memcpy(void *dst, void *src, uintptr_t n)
void *__msan_memmove(void *dst, void *src, uintptr_t n)
void *__msan_memset(void *dst, int c, uintptr_t n)
Error reporting
~~~~~~~~~~~~~~~
For each use of a value the compiler emits a shadow check that calls
``__msan_warning()`` in the case that value is poisoned::
void __msan_warning(u32 origin)
``__msan_warning()`` causes KMSAN runtime to print an error report.
Inline assembly instrumentation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
KMSAN instruments every inline assembly output with a call to::
void __msan_instrument_asm_store(void *addr, uintptr_t size)
, which unpoisons the memory region.
This approach may mask certain errors, but it also helps to avoid a lot of
false positives in bitwise operations, atomics etc.
Sometimes the pointers passed into inline assembly do not point to valid memory.
In such cases they are ignored at runtime.
Runtime library
---------------
The code is located in ``mm/kmsan/``.
Per-task KMSAN state
~~~~~~~~~~~~~~~~~~~~
Every task_struct has an associated KMSAN task state that holds the KMSAN
context (see above) and a per-task flag disallowing KMSAN reports::
struct kmsan_context {
...
bool allow_reporting;
struct kmsan_context_state cstate;
...
}
struct task_struct {
...
struct kmsan_context kmsan;
...
}
KMSAN contexts
~~~~~~~~~~~~~~
When running in a kernel task context, KMSAN uses ``current->kmsan.cstate`` to
hold the metadata for function parameters and return values.
But in the case the kernel is running in the interrupt, softirq or NMI context,
where ``current`` is unavailable, KMSAN switches to per-cpu interrupt state::
DEFINE_PER_CPU(struct kmsan_ctx, kmsan_percpu_ctx);
Metadata allocation
~~~~~~~~~~~~~~~~~~~
There are several places in the kernel for which the metadata is stored.
1. Each ``struct page`` instance contains two pointers to its shadow and
origin pages::
struct page {
...
struct page *shadow, *origin;
...
};
At boot-time, the kernel allocates shadow and origin pages for every available
kernel page. This is done quite late, when the kernel address space is already
fragmented, so normal data pages may arbitrarily interleave with the metadata
pages.
This means that in general for two contiguous memory pages their shadow/origin
pages may not be contiguous. Consequently, if a memory access crosses the
boundary of a memory block, accesses to shadow/origin memory may potentially
corrupt other pages or read incorrect values from them.
In practice, contiguous memory pages returned by the same ``alloc_pages()``
call will have contiguous metadata, whereas if these pages belong to two
different allocations their metadata pages can be fragmented.
For the kernel data (``.data``, ``.bss`` etc.) and percpu memory regions
there also are no guarantees on metadata contiguity.
In the case ``__msan_metadata_ptr_for_XXX_YYY()`` hits the border between two
pages with non-contiguous metadata, it returns pointers to fake shadow/origin regions::
char dummy_load_page[PAGE_SIZE] __attribute__((aligned(PAGE_SIZE)));
char dummy_store_page[PAGE_SIZE] __attribute__((aligned(PAGE_SIZE)));
``dummy_load_page`` is zero-initialized, so reads from it always yield zeroes.
All stores to ``dummy_store_page`` are ignored.
2. For vmalloc memory and modules, there is a direct mapping between the memory
range, its shadow and origin. KMSAN reduces the vmalloc area by 3/4, making only
the first quarter available to ``vmalloc()``. The second quarter of the vmalloc
area contains shadow memory for the first quarter, the third one holds the
origins. A small part of the fourth quarter contains shadow and origins for the
kernel modules. Please refer to ``arch/x86/include/asm/pgtable_64_types.h`` for
more details.
When an array of pages is mapped into a contiguous virtual memory space, their
shadow and origin pages are similarly mapped into contiguous regions.
References
==========
E. Stepanov, K. Serebryany. `MemorySanitizer: fast detector of uninitialized
memory use in C++
<https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43308.pdf>`_.
In Proceedings of CGO 2015.
.. _MemorySanitizer tool: https://clang.llvm.org/docs/MemorySanitizer.html
.. _LLVM documentation: https://llvm.org/docs/GettingStarted.html
.. _LKML discussion: https://lore.kernel.org/all/20220614144853.3693273-1-glider@google.com/

View File

@ -51,6 +51,7 @@ above structured documentation, or deleted if it has served its purpose.
ksm
memory-model
mmu_notifier
multigen_lru
numa
overcommit-accounting
page_migration

View File

@ -26,7 +26,7 @@ tree.
If a KSM page is shared between less than ``max_page_sharing`` VMAs,
the node of the stable tree that represents such KSM page points to a
list of struct rmap_item and the ``page->mapping`` of the
list of struct ksm_rmap_item and the ``page->mapping`` of the
KSM page points to the stable tree node.
When the sharing passes this threshold, KSM adds a second dimension to

View File

@ -0,0 +1,159 @@
.. SPDX-License-Identifier: GPL-2.0
=============
Multi-Gen LRU
=============
The multi-gen LRU is an alternative LRU implementation that optimizes
page reclaim and improves performance under memory pressure. Page
reclaim decides the kernel's caching policy and ability to overcommit
memory. It directly impacts the kswapd CPU usage and RAM efficiency.
Design overview
===============
Objectives
----------
The design objectives are:
* Good representation of access recency
* Try to profit from spatial locality
* Fast paths to make obvious choices
* Simple self-correcting heuristics
The representation of access recency is at the core of all LRU
implementations. In the multi-gen LRU, each generation represents a
group of pages with similar access recency. Generations establish a
(time-based) common frame of reference and therefore help make better
choices, e.g., between different memcgs on a computer or different
computers in a data center (for job scheduling).
Exploiting spatial locality improves efficiency when gathering the
accessed bit. A rmap walk targets a single page and does not try to
profit from discovering a young PTE. A page table walk can sweep all
the young PTEs in an address space, but the address space can be too
sparse to make a profit. The key is to optimize both methods and use
them in combination.
Fast paths reduce code complexity and runtime overhead. Unmapped pages
do not require TLB flushes; clean pages do not require writeback.
These facts are only helpful when other conditions, e.g., access
recency, are similar. With generations as a common frame of reference,
additional factors stand out. But obvious choices might not be good
choices; thus self-correction is necessary.
The benefits of simple self-correcting heuristics are self-evident.
Again, with generations as a common frame of reference, this becomes
attainable. Specifically, pages in the same generation can be
categorized based on additional factors, and a feedback loop can
statistically compare the refault percentages across those categories
and infer which of them are better choices.
Assumptions
-----------
The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
* Accesses through page tables
* Accesses through file descriptors
The protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
applications usually do not prepare themselves for major page
faults like they do for blocked I/O. E.g., GUI applications
commonly use dedicated I/O threads to avoid blocking rendering
threads.
There are also two access patterns:
* Accesses exhibiting temporal locality
* Accesses not exhibiting temporal locality
For the reasons listed above, the former channel is assumed to follow
the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
present, and the latter channel is assumed to follow the latter
pattern unless outlying refaults have been observed.
Workflow overview
=================
Evictable pages are divided into multiple generations for each
``lruvec``. The youngest generation number is stored in
``lrugen->max_seq`` for both anon and file types as they are aged on
an equal footing. The oldest generation numbers are stored in
``lrugen->min_seq[]`` separately for anon and file types as clean file
pages can be evicted regardless of swap constraints. These three
variables are monotonically increasing.
Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
bits in order to fit into the gen counter in ``folio->flags``. Each
truncated generation number is an index to ``lrugen->lists[]``. The
sliding window technique is used to track at least ``MIN_NR_GENS`` and
at most ``MAX_NR_GENS`` generations. The gen counter stores a value
within ``[1, MAX_NR_GENS]`` while a page is on one of
``lrugen->lists[]``; otherwise it stores zero.
Each generation is divided into multiple tiers. A page accessed ``N``
times through file descriptors is in tier ``order_base_2(N)``. Unlike
generations, tiers do not have dedicated ``lrugen->lists[]``. In
contrast to moving across generations, which requires the LRU lock,
moving across tiers only involves atomic operations on
``folio->flags`` and therefore has a negligible cost. A feedback loop
modeled after the PID controller monitors refaults over all the tiers
from anon and file types and decides which tiers from which types to
evict or protect.
There are two conceptually independent procedures: the aging and the
eviction. They form a closed-loop system, i.e., the page reclaim.
Aging
-----
The aging produces young generations. Given an ``lruvec``, it
increments ``max_seq`` when ``max_seq-min_seq+1`` approaches
``MIN_NR_GENS``. The aging promotes hot pages to the youngest
generation when it finds them accessed through page tables; the
demotion of cold pages happens consequently when it increments
``max_seq``. The aging uses page table walks and rmap walks to find
young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
and calls ``walk_page_range()`` with each ``mm_struct`` on this list
to scan PTEs, and after each iteration, it increments ``max_seq``. For
the latter, when the eviction walks the rmap and finds a young PTE,
the aging scans the adjacent PTEs. For both, on finding a young PTE,
the aging clears the accessed bit and updates the gen counter of the
page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.
Eviction
--------
The eviction consumes old generations. Given an ``lruvec``, it
increments ``min_seq`` when ``lrugen->lists[]`` indexed by
``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
evict from, it first compares ``min_seq[]`` to select the older type.
If both types are equally old, it selects the one whose first tier has
a lower refault percentage. The first tier contains single-use
unmapped clean pages, which are the best bet. The eviction sorts a
page according to its gen counter if the aging has found this page
accessed through page tables and updated its gen counter. It also
moves a page to the next generation, i.e., ``min_seq+1``, if this page
was accessed multiple times through file descriptors and the feedback
loop has detected outlying refaults from the tier this page is in. To
this end, the feedback loop uses the first tier as the baseline, for
the reason stated earlier.
Summary
-------
The multi-gen LRU can be disassembled into the following parts:
* Generations
* Rmap walks
* Page table walks
* Bloom filters
* PID controller
The aging and the eviction form a producer-consumer model;
specifically, the latter drives the former by the sliding window over
generations. Within the aging, rmap walks drive page table walks by
inserting hot densely populated page tables to the Bloom filters.
Within the eviction, the PID controller uses refaults as the feedback
to select types to evict and tiers to protect.

View File

@ -94,6 +94,11 @@ Usage
Page allocated via order XXX, ...
PFN XXX ...
// Detailed stack
By default, it will do full pfn dump, to start with a given pfn,
page_owner supports fseek.
FILE *fp = fopen("/sys/kernel/debug/page_owner", "r");
fseek(fp, pfn_start, SEEK_SET);
The ``page_owner_sort`` tool ignores ``PFN`` rows, puts the remaining rows
in buf, uses regexp to extract the page order value, counts the times

View File

@ -11004,7 +11004,6 @@ F: arch/*/include/asm/*kasan.h
F: arch/*/mm/kasan_init*
F: include/linux/kasan*.h
F: lib/Kconfig.kasan
F: lib/test_kasan*.c
F: mm/kasan/
F: scripts/Makefile.kasan
@ -11438,6 +11437,20 @@ F: kernel/kmod.c
F: lib/test_kmod.c
F: tools/testing/selftests/kmod/
KMSAN
M: Alexander Potapenko <glider@google.com>
R: Marco Elver <elver@google.com>
R: Dmitry Vyukov <dvyukov@google.com>
L: kasan-dev@googlegroups.com
S: Maintained
F: Documentation/dev-tools/kmsan.rst
F: arch/*/include/asm/kmsan.h
F: arch/*/mm/kmsan_*
F: include/linux/kmsan*.h
F: lib/Kconfig.kmsan
F: mm/kmsan/
F: scripts/Makefile.kmsan
KPROBES
M: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
M: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
@ -12168,6 +12181,18 @@ L: linux-man@vger.kernel.org
S: Maintained
W: http://www.kernel.org/doc/man-pages
MAPLE TREE
M: Liam R. Howlett <Liam.Howlett@oracle.com>
L: linux-mm@kvack.org
S: Supported
F: Documentation/core-api/maple_tree.rst
F: include/linux/maple_tree.h
F: include/trace/events/maple_tree.h
F: lib/maple_tree.c
F: lib/test_maple_tree.c
F: tools/testing/radix-tree/linux/maple_tree.h
F: tools/testing/radix-tree/maple.c
MARDUK (CREATOR CI40) DEVICE TREE SUPPORT
M: Rahul Bedarkar <rahulbedarkar89@gmail.com>
L: linux-mips@vger.kernel.org

View File

@ -1081,6 +1081,7 @@ include-y := scripts/Makefile.extrawarn
include-$(CONFIG_DEBUG_INFO) += scripts/Makefile.debug
include-$(CONFIG_KASAN) += scripts/Makefile.kasan
include-$(CONFIG_KCSAN) += scripts/Makefile.kcsan
include-$(CONFIG_KMSAN) += scripts/Makefile.kmsan
include-$(CONFIG_UBSAN) += scripts/Makefile.ubsan
include-$(CONFIG_KCOV) += scripts/Makefile.kcov
include-$(CONFIG_RANDSTRUCT) += scripts/Makefile.randstruct

View File

@ -1416,6 +1416,14 @@ config DYNAMIC_SIGFRAME
config HAVE_ARCH_NODE_DEV_GROUP
bool
config ARCH_HAS_NONLEAF_PMD_YOUNG
bool
help
Architectures that select this option are capable of setting the
accessed bit in non-leaf PMD entries when using them as part of linear
address translations. Page table walkers that clear the accessed bit
may use this capability to reduce their search space.
source "kernel/gcov/Kconfig"
source "scripts/gcc-plugins/Kconfig"

View File

@ -76,6 +76,8 @@
#define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
/* compatibility flags */
#define MAP_FILE 0

View File

@ -554,7 +554,7 @@ config ARC_BUILTIN_DTB_NAME
endmenu # "ARC Architecture Configuration"
config FORCE_MAX_ZONEORDER
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
default "12" if ARC_HUGEPAGE_16M
default "11"

View File

@ -1362,7 +1362,7 @@ config ARM_MODULE_PLTS
Disabling this is usually safe for small single-platform
configurations. If unsure, say y.
config FORCE_MAX_ZONEORDER
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
default "12" if SOC_AM33XX
default "9" if SA1111

View File

@ -31,7 +31,7 @@ CONFIG_SOC_VF610=y
CONFIG_SMP=y
CONFIG_ARM_PSCI=y
CONFIG_HIGHMEM=y
CONFIG_FORCE_MAX_ZONEORDER=14
CONFIG_ARCH_FORCE_MAX_ORDER=14
CONFIG_CMDLINE="noinitrd console=ttymxc0,115200"
CONFIG_KEXEC=y
CONFIG_CPU_FREQ=y

View File

@ -26,7 +26,7 @@ CONFIG_THUMB2_KERNEL=y
# CONFIG_THUMB2_AVOID_R_ARM_THM_JUMP11 is not set
# CONFIG_ARM_PATCH_IDIV is not set
CONFIG_HIGHMEM=y
CONFIG_FORCE_MAX_ZONEORDER=12
CONFIG_ARCH_FORCE_MAX_ORDER=12
CONFIG_SECCOMP=y
CONFIG_KEXEC=y
CONFIG_EFI=y

View File

@ -12,7 +12,7 @@ CONFIG_ARCH_OXNAS=y
CONFIG_MACH_OX820=y
CONFIG_SMP=y
CONFIG_NR_CPUS=16
CONFIG_FORCE_MAX_ZONEORDER=12
CONFIG_ARCH_FORCE_MAX_ORDER=12
CONFIG_SECCOMP=y
CONFIG_ARM_APPENDED_DTB=y
CONFIG_ARM_ATAG_DTB_COMPAT=y

View File

@ -21,7 +21,7 @@ CONFIG_MACH_AKITA=y
CONFIG_MACH_BORZOI=y
CONFIG_PXA_SYSTEMS_CPLDS=y
CONFIG_AEABI=y
CONFIG_FORCE_MAX_ZONEORDER=9
CONFIG_ARCH_FORCE_MAX_ORDER=9
CONFIG_CMDLINE="root=/dev/ram0 ro"
CONFIG_KEXEC=y
CONFIG_CPU_FREQ=y

View File

@ -19,7 +19,7 @@ CONFIG_ATMEL_CLOCKSOURCE_TCB=y
# CONFIG_CACHE_L2X0 is not set
# CONFIG_ARM_PATCH_IDIV is not set
# CONFIG_CPU_SW_DOMAIN_PAN is not set
CONFIG_FORCE_MAX_ZONEORDER=15
CONFIG_ARCH_FORCE_MAX_ORDER=15
CONFIG_UACCESS_WITH_MEMCPY=y
# CONFIG_ATAGS is not set
CONFIG_CMDLINE="console=ttyS0,115200 earlyprintk ignore_loglevel"

View File

@ -17,7 +17,7 @@ CONFIG_ARCH_SUNPLUS=y
# CONFIG_VDSO is not set
CONFIG_SMP=y
CONFIG_THUMB2_KERNEL=y
CONFIG_FORCE_MAX_ZONEORDER=12
CONFIG_ARCH_FORCE_MAX_ORDER=12
CONFIG_VFP=y
CONFIG_NEON=y
CONFIG_MODULES=y

View File

@ -1431,7 +1431,7 @@ config XEN
help
Say Y if you want to run Linux in a Virtual Machine on Xen on ARM64.
config FORCE_MAX_ZONEORDER
config ARCH_FORCE_MAX_ORDER
int
default "14" if ARM64_64K_PAGES
default "12" if ARM64_16K_PAGES

View File

@ -1082,24 +1082,13 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
* page after fork() + CoW for pfn mappings. We don't always have a
* hardware-managed access flag on arm64.
*/
static inline bool arch_faults_on_old_pte(void)
{
/* The register read below requires a stable CPU to make any sense */
cant_migrate();
return !cpu_has_hw_af();
}
#define arch_faults_on_old_pte arch_faults_on_old_pte
#define arch_has_hw_pte_young cpu_has_hw_af
/*
* Experimentally, it's cheap to set the access flag in hardware and we
* benefit from prefaulting mappings as 'old' to start with.
*/
static inline bool arch_wants_old_prefaulted_pte(void)
{
return !arch_faults_on_old_pte();
}
#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
#define arch_wants_old_prefaulted_pte cpu_has_hw_af
static inline bool pud_sect_supported(void)
{

View File

@ -8,9 +8,9 @@
#include <asm/cpufeature.h>
#include <asm/mte.h>
#define for_each_mte_vma(tsk, vma) \
#define for_each_mte_vma(vmi, vma) \
if (system_supports_mte()) \
for (vma = tsk->mm->mmap; vma; vma = vma->vm_next) \
for_each_vma(vmi, vma) \
if (vma->vm_flags & VM_MTE)
static unsigned long mte_vma_tag_dump_size(struct vm_area_struct *vma)
@ -81,8 +81,9 @@ Elf_Half elf_core_extra_phdrs(void)
{
struct vm_area_struct *vma;
int vma_count = 0;
VMA_ITERATOR(vmi, current->mm, 0);
for_each_mte_vma(current, vma)
for_each_mte_vma(vmi, vma)
vma_count++;
return vma_count;
@ -91,8 +92,9 @@ Elf_Half elf_core_extra_phdrs(void)
int elf_core_write_extra_phdrs(struct coredump_params *cprm, loff_t offset)
{
struct vm_area_struct *vma;
VMA_ITERATOR(vmi, current->mm, 0);
for_each_mte_vma(current, vma) {
for_each_mte_vma(vmi, vma) {
struct elf_phdr phdr;
phdr.p_type = PT_AARCH64_MEMTAG_MTE;
@ -116,8 +118,9 @@ size_t elf_core_extra_data_size(void)
{
struct vm_area_struct *vma;
size_t data_size = 0;
VMA_ITERATOR(vmi, current->mm, 0);
for_each_mte_vma(current, vma)
for_each_mte_vma(vmi, vma)
data_size += mte_vma_tag_dump_size(vma);
return data_size;
@ -126,8 +129,9 @@ size_t elf_core_extra_data_size(void)
int elf_core_write_extra_data(struct coredump_params *cprm)
{
struct vm_area_struct *vma;
VMA_ITERATOR(vmi, current->mm, 0);
for_each_mte_vma(current, vma) {
for_each_mte_vma(vmi, vma) {
if (vma->vm_flags & VM_DONTDUMP)
continue;

View File

@ -133,10 +133,11 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
{
struct mm_struct *mm = task->mm;
struct vm_area_struct *vma;
VMA_ITERATOR(vmi, mm, 0);
mmap_read_lock(mm);
for (vma = mm->mmap; vma; vma = vma->vm_next) {
for_each_vma(vmi, vma) {
unsigned long size = vma->vm_end - vma->vm_start;
if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm))

View File

@ -245,7 +245,7 @@ static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
{
VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
return page_folio(pfn_to_page(swp_offset(entry)));
return page_folio(pfn_to_page(swp_offset_pfn(entry)));
}
void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,

View File

@ -332,7 +332,7 @@ config HIGHMEM
select KMAP_LOCAL
default y
config FORCE_MAX_ZONEORDER
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
default "11"

View File

@ -200,7 +200,7 @@ config IA64_CYCLONE
Say Y here to enable support for IBM EXA Cyclone time source.
If you're unsure, answer N.
config FORCE_MAX_ZONEORDER
config ARCH_FORCE_MAX_ORDER
int "MAX_ORDER (11 - 17)" if !HUGETLB_PAGE
range 11 17 if !HUGETLB_PAGE
default "17" if HUGETLB_PAGE

View File

@ -11,10 +11,10 @@
#define SECTION_SIZE_BITS (30)
#define MAX_PHYSMEM_BITS (50)
#ifdef CONFIG_FORCE_MAX_ZONEORDER
#if ((CONFIG_FORCE_MAX_ZONEORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS)
#ifdef CONFIG_ARCH_FORCE_MAX_ORDER
#if ((CONFIG_ARCH_FORCE_MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS)
#undef SECTION_SIZE_BITS
#define SECTION_SIZE_BITS (CONFIG_FORCE_MAX_ZONEORDER - 1 + PAGE_SHIFT)
#define SECTION_SIZE_BITS (CONFIG_ARCH_FORCE_MAX_ORDER - 1 + PAGE_SHIFT)
#endif
#endif

View File

@ -377,7 +377,7 @@ config NODES_SHIFT
default "6"
depends on NUMA
config FORCE_MAX_ZONEORDER
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
range 14 64 if PAGE_SIZE_64KB
default "14" if PAGE_SIZE_64KB

View File

@ -397,7 +397,7 @@ config SINGLE_MEMORY_CHUNK
order" to save memory that could be wasted for unused memory map.
Say N if not sure.
config FORCE_MAX_ZONEORDER
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order" if ADVANCED
depends on !SINGLE_MEMORY_CHUNK
default "11"

View File

@ -2140,7 +2140,7 @@ config PAGE_SIZE_64KB
endchoice
config FORCE_MAX_ZONEORDER
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
range 14 64 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB
default "14" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB

View File

@ -9,7 +9,6 @@ CONFIG_HIGH_RES_TIMERS=y
CONFIG_LOG_BUF_SHIFT=16
CONFIG_CGROUPS=y
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y

View File

@ -3,7 +3,6 @@ CONFIG_NO_HZ_IDLE=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_BLK_CGROUP=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y

View File

@ -103,6 +103,8 @@
#define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
/* compatibility flags */
#define MAP_FILE 0

View File

@ -44,7 +44,7 @@ menu "Kernel features"
source "kernel/Kconfig.hz"
config FORCE_MAX_ZONEORDER
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
range 9 20
default "11"

View File

@ -70,6 +70,8 @@
#define MADV_WIPEONFORK 71 /* Zero memory on fork, child only */
#define MADV_KEEPONFORK 72 /* Undo MADV_WIPEONFORK */
#define MADV_COLLAPSE 73 /* Synchronous hugepage collapse */
#define MADV_HWPOISON 100 /* poison a page for testing */
#define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */

View File

@ -657,15 +657,20 @@ static inline unsigned long mm_total_size(struct mm_struct *mm)
{
struct vm_area_struct *vma;
unsigned long usize = 0;
VMA_ITERATOR(vmi, mm, 0);
for (vma = mm->mmap; vma && usize < parisc_cache_flush_threshold; vma = vma->vm_next)
for_each_vma(vmi, vma) {
if (usize >= parisc_cache_flush_threshold)
break;
usize += vma->vm_end - vma->vm_start;
}
return usize;
}
void flush_cache_mm(struct mm_struct *mm)
{
struct vm_area_struct *vma;
VMA_ITERATOR(vmi, mm, 0);
/*
* Flushing the whole cache on each cpu takes forever on
@ -685,7 +690,7 @@ void flush_cache_mm(struct mm_struct *mm)
}
/* Flush mm */
for (vma = mm->mmap; vma; vma = vma->vm_next)
for_each_vma(vmi, vma)
flush_cache_pages(vma, vma->vm_start, vma->vm_end);
}

View File

@ -846,7 +846,7 @@ config DATA_SHIFT
in that case. If PIN_TLB is selected, it must be aligned to 8M as
8M pages will be pinned.
config FORCE_MAX_ZONEORDER
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
range 8 9 if PPC64 && PPC_64K_PAGES
default "9" if PPC64 && PPC_64K_PAGES

View File

@ -30,7 +30,7 @@ CONFIG_PREEMPT=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
CONFIG_BINFMT_MISC=m
CONFIG_MATH_EMULATION=y
CONFIG_FORCE_MAX_ZONEORDER=17
CONFIG_ARCH_FORCE_MAX_ORDER=17
CONFIG_PCI=y
CONFIG_PCIEPORTBUS=y
CONFIG_PCI_MSI=y

View File

@ -41,7 +41,7 @@ CONFIG_FIXED_PHY=y
CONFIG_FONT_8x16=y
CONFIG_FONT_8x8=y
CONFIG_FONTS=y
CONFIG_FORCE_MAX_ZONEORDER=13
CONFIG_ARCH_FORCE_MAX_ORDER=13
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAME_WARN=1024
CONFIG_FTL=y

View File

@ -17,7 +17,6 @@ CONFIG_LOG_CPU_MAX_BUF_SHIFT=13
CONFIG_NUMA_BALANCING=y
CONFIG_CGROUPS=y
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CPUSETS=y

View File

@ -18,7 +18,6 @@ CONFIG_LOG_CPU_MAX_BUF_SHIFT=13
CONFIG_NUMA_BALANCING=y
CONFIG_CGROUPS=y
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CPUSETS=y

View File

@ -115,18 +115,18 @@ struct vdso_data *arch_get_vdso_data(void *vvar_page)
int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
{
struct mm_struct *mm = task->mm;
VMA_ITERATOR(vmi, mm, 0);
struct vm_area_struct *vma;
mmap_read_lock(mm);
for (vma = mm->mmap; vma; vma = vma->vm_next) {
for_each_vma(vmi, vma) {
unsigned long size = vma->vm_end - vma->vm_start;
if (vma_is_special_mapping(vma, &vvar_spec))
zap_page_range(vma, vma->vm_start, size);
}
mmap_read_unlock(mm);
return 0;
}

View File

@ -81,14 +81,15 @@ EXPORT_SYMBOL(hash__flush_range);
void hash__flush_tlb_mm(struct mm_struct *mm)
{
struct vm_area_struct *mp;
VMA_ITERATOR(vmi, mm, 0);
/*
* It is safe to go down the mm's list of vmas when called
* from dup_mmap, holding mmap_lock. It would also be safe from
* unmap_region or exit_mmap, but not from vmtruncate on SMP -
* but it seems dup_mmap is the only SMP case which gets here.
* It is safe to iterate the vmas when called from dup_mmap,
* holding mmap_lock. It would also be safe from unmap_region
* or exit_mmap, but not from vmtruncate on SMP - but it seems
* dup_mmap is the only SMP case which gets here.
*/
for (mp = mm->mmap; mp != NULL; mp = mp->vm_next)
for_each_vma(vmi, mp)
hash__flush_range(mp->vm_mm, mp->vm_start, mp->vm_end);
}
EXPORT_SYMBOL(hash__flush_tlb_mm);

View File

@ -149,24 +149,15 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
unsigned long len)
{
struct vm_area_struct *vma;
VMA_ITERATOR(vmi, mm, addr);
/*
* We don't try too hard, we just mark all the vma in that range
* VM_NOHUGEPAGE and split them.
*/
vma = find_vma(mm, addr);
/*
* If the range is in unmapped range, just return
*/
if (vma && ((addr + len) <= vma->vm_start))
return;
while (vma) {
if (vma->vm_start >= (addr + len))
break;
for_each_vma_range(vmi, vma, addr + len) {
vma->vm_flags |= VM_NOHUGEPAGE;
walk_page_vma(vma, &subpage_walk_ops, NULL);
vma = vma->vm_next;
}
}
#else

View File

@ -114,11 +114,12 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
{
struct mm_struct *mm = task->mm;
struct vm_area_struct *vma;
VMA_ITERATOR(vmi, mm, 0);
struct __vdso_info *vdso_info = mm->context.vdso_info;
mmap_read_lock(mm);
for (vma = mm->mmap; vma; vma = vma->vm_next) {
for_each_vma(vmi, vma) {
unsigned long size = vma->vm_end - vma->vm_start;
if (vma_is_special_mapping(vma, vdso_info->dm))

View File

@ -69,10 +69,11 @@ static struct page *find_timens_vvar_page(struct vm_area_struct *vma)
int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
{
struct mm_struct *mm = task->mm;
VMA_ITERATOR(vmi, mm, 0);
struct vm_area_struct *vma;
mmap_read_lock(mm);
for (vma = mm->mmap; vma; vma = vma->vm_next) {
for_each_vma(vmi, vma) {
unsigned long size = vma->vm_end - vma->vm_start;
if (!vma_is_special_mapping(vma, &vvar_mapping))

View File

@ -81,8 +81,9 @@ unsigned long _copy_from_user_key(void *to, const void __user *from,
might_fault();
if (!should_fail_usercopy()) {
instrument_copy_from_user(to, from, n);
instrument_copy_from_user_before(to, from, n);
res = raw_copy_from_user_key(to, from, n, key);
instrument_copy_from_user_after(to, from, n, res);
}
if (unlikely(res))
memset(to + (n - res), 0, res);

View File

@ -2515,8 +2515,9 @@ static const struct mm_walk_ops thp_split_walk_ops = {
static inline void thp_split_mm(struct mm_struct *mm)
{
struct vm_area_struct *vma;
VMA_ITERATOR(vmi, mm, 0);
for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) {
for_each_vma(vmi, vma) {
vma->vm_flags &= ~VM_HUGEPAGE;
vma->vm_flags |= VM_NOHUGEPAGE;
walk_page_vma(vma, &thp_split_walk_ops, NULL);
@ -2584,8 +2585,9 @@ int gmap_mark_unmergeable(void)
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
int ret;
VMA_ITERATOR(vmi, mm, 0);
for (vma = mm->mmap; vma; vma = vma->vm_next) {
for_each_vma(vmi, vma) {
ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
MADV_UNMERGEABLE, &vma->vm_flags);
if (ret)

View File

@ -237,16 +237,6 @@ int pud_huge(pud_t pud)
return pud_large(pud);
}
struct page *
follow_huge_pud(struct mm_struct *mm, unsigned long address,
pud_t *pud, int flags)
{
if (flags & FOLL_GET)
return NULL;
return pud_page(*pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
}
bool __init arch_hugetlb_valid_size(unsigned long size)
{
if (MACHINE_HAS_EDAT1 && size == PMD_SIZE)

View File

@ -8,7 +8,7 @@ CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_BLK_DEV_BSG is not set
CONFIG_CPU_SUBTYPE_SH7724=y
CONFIG_FORCE_MAX_ZONEORDER=12
CONFIG_ARCH_FORCE_MAX_ORDER=12
CONFIG_MEMORY_SIZE=0x10000000
CONFIG_FLATMEM_MANUAL=y
CONFIG_SH_ECOVEC=y

View File

@ -16,7 +16,6 @@ CONFIG_CPUSETS=y
# CONFIG_PROC_PID_CPUSET is not set
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_MEMCG=y
CONFIG_CGROUP_MEMCG_SWAP=y
CONFIG_CGROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_BLK_CGROUP=y

View File

@ -14,7 +14,6 @@ CONFIG_CPUSETS=y
# CONFIG_PROC_PID_CPUSET is not set
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_MEMCG=y
CONFIG_CGROUP_MEMCG_SWAP=y
CONFIG_CGROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_BLK_DEV_INITRD=y

View File

@ -18,7 +18,7 @@ config PAGE_OFFSET
default "0x80000000" if MMU
default "0x00000000"
config FORCE_MAX_ZONEORDER
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
range 9 64 if PAGE_SIZE_16KB
default "9" if PAGE_SIZE_16KB

View File

@ -269,7 +269,7 @@ config ARCH_SPARSEMEM_ENABLE
config ARCH_SPARSEMEM_DEFAULT
def_bool y if SPARC64
config FORCE_MAX_ZONEORDER
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
default "13"
help

View File

@ -584,21 +584,19 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
void flush_tlb_mm(struct mm_struct *mm)
{
struct vm_area_struct *vma = mm->mmap;
struct vm_area_struct *vma;
VMA_ITERATOR(vmi, mm, 0);
while (vma != NULL) {
for_each_vma(vmi, vma)
fix_range(mm, vma->vm_start, vma->vm_end, 0);
vma = vma->vm_next;
}
}
void force_flush_all(void)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = mm->mmap;
struct vm_area_struct *vma;
VMA_ITERATOR(vmi, mm, 0);
while (vma != NULL) {
for_each_vma(vmi, vma)
fix_range(mm, vma->vm_start, vma->vm_end, 1);
vma = vma->vm_next;
}
}

View File

@ -85,6 +85,7 @@ config X86
select ARCH_HAS_PMEM_API if X86_64
select ARCH_HAS_PTE_DEVMAP if X86_64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_NONLEAF_PMD_YOUNG if PGTABLE_LEVELS > 2
select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64
select ARCH_HAS_COPY_MC if X86_64
select ARCH_HAS_SET_MEMORY
@ -130,7 +131,9 @@ config X86
select CLKEVT_I8253
select CLOCKSOURCE_VALIDATE_LAST_CYCLE
select CLOCKSOURCE_WATCHDOG
select DCACHE_WORD_ACCESS
# Word-size accesses may read uninitialized data past the trailing \0
# in strings and cause false KMSAN reports.
select DCACHE_WORD_ACCESS if !KMSAN
select DYNAMIC_SIGFRAME
select EDAC_ATOMIC_SCRUB
select EDAC_SUPPORT
@ -168,6 +171,7 @@ config X86
select HAVE_ARCH_KASAN if X86_64
select HAVE_ARCH_KASAN_VMALLOC if X86_64
select HAVE_ARCH_KFENCE
select HAVE_ARCH_KMSAN if X86_64
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS if MMU
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if MMU && COMPAT
@ -328,6 +332,10 @@ config GENERIC_ISA_DMA
def_bool y
depends on ISA_DMA_API
config GENERIC_CSUM
bool
default y if KMSAN || KASAN
config GENERIC_BUG
def_bool y
depends on BUG

View File

@ -12,6 +12,7 @@
# Sanitizer runtimes are unavailable and cannot be linked for early boot code.
KASAN_SANITIZE := n
KCSAN_SANITIZE := n
KMSAN_SANITIZE := n
OBJECT_FILES_NON_STANDARD := y
# Kernel does not boot with kcov instrumentation here.

View File

@ -20,6 +20,7 @@
# Sanitizer runtimes are unavailable and cannot be linked for early boot code.
KASAN_SANITIZE := n
KCSAN_SANITIZE := n
KMSAN_SANITIZE := n
OBJECT_FILES_NON_STANDARD := y
# Prevents link failures: __sanitizer_cov_trace_pc() is not linked in.

View File

@ -11,6 +11,9 @@ include $(srctree)/lib/vdso/Makefile
# Sanitizer runtimes are unavailable and cannot be linked here.
KASAN_SANITIZE := n
KMSAN_SANITIZE_vclock_gettime.o := n
KMSAN_SANITIZE_vgetcpu.o := n
UBSAN_SANITIZE := n
KCSAN_SANITIZE := n
OBJECT_FILES_NON_STANDARD := y

View File

@ -127,17 +127,17 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
{
struct mm_struct *mm = task->mm;
struct vm_area_struct *vma;
VMA_ITERATOR(vmi, mm, 0);
mmap_read_lock(mm);
for (vma = mm->mmap; vma; vma = vma->vm_next) {
for_each_vma(vmi, vma) {
unsigned long size = vma->vm_end - vma->vm_start;
if (vma_is_special_mapping(vma, &vvar_mapping))
zap_page_range(vma, vma->vm_start, size);
}
mmap_read_unlock(mm);
return 0;
}
#else
@ -354,6 +354,7 @@ int map_vdso_once(const struct vdso_image *image, unsigned long addr)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
VMA_ITERATOR(vmi, mm, 0);
mmap_write_lock(mm);
/*
@ -363,7 +364,7 @@ int map_vdso_once(const struct vdso_image *image, unsigned long addr)
* We could search vma near context.vdso, but it's a slowpath,
* so let's explicitly check all VMAs to be completely sure.
*/
for (vma = mm->mmap; vma; vma = vma->vm_next) {
for_each_vma(vmi, vma) {
if (vma_is_special_mapping(vma, &vdso_mapping) ||
vma_is_special_mapping(vma, &vvar_mapping)) {
mmap_write_unlock(mm);

View File

@ -1,9 +1,13 @@
/* SPDX-License-Identifier: GPL-2.0 */
#define _HAVE_ARCH_COPY_AND_CSUM_FROM_USER 1
#define HAVE_CSUM_COPY_USER
#define _HAVE_ARCH_CSUM_AND_COPY
#ifdef CONFIG_X86_32
# include <asm/checksum_32.h>
#ifdef CONFIG_GENERIC_CSUM
# include <asm-generic/checksum.h>
#else
# define _HAVE_ARCH_COPY_AND_CSUM_FROM_USER 1
# define HAVE_CSUM_COPY_USER
# define _HAVE_ARCH_CSUM_AND_COPY
# ifdef CONFIG_X86_32
# include <asm/checksum_32.h>
# else
# include <asm/checksum_64.h>
# endif
#endif

View File

@ -0,0 +1,87 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
* x86 KMSAN support.
*
* Copyright (C) 2022, Google LLC
* Author: Alexander Potapenko <glider@google.com>
*/
#ifndef _ASM_X86_KMSAN_H
#define _ASM_X86_KMSAN_H
#ifndef MODULE
#include <asm/cpu_entry_area.h>
#include <asm/processor.h>
#include <linux/mmzone.h>
DECLARE_PER_CPU(char[CPU_ENTRY_AREA_SIZE], cpu_entry_area_shadow);
DECLARE_PER_CPU(char[CPU_ENTRY_AREA_SIZE], cpu_entry_area_origin);
/*
* Functions below are declared in the header to make sure they are inlined.
* They all are called from kmsan_get_metadata() for every memory access in
* the kernel, so speed is important here.
*/
/*
* Compute metadata addresses for the CPU entry area on x86.
*/
static inline void *arch_kmsan_get_meta_or_null(void *addr, bool is_origin)
{
unsigned long addr64 = (unsigned long)addr;
char *metadata_array;
unsigned long off;
int cpu;
if ((addr64 < CPU_ENTRY_AREA_BASE) ||
(addr64 >= (CPU_ENTRY_AREA_BASE + CPU_ENTRY_AREA_MAP_SIZE)))
return NULL;
cpu = (addr64 - CPU_ENTRY_AREA_BASE) / CPU_ENTRY_AREA_SIZE;
off = addr64 - (unsigned long)get_cpu_entry_area(cpu);
if ((off < 0) || (off >= CPU_ENTRY_AREA_SIZE))
return NULL;
metadata_array = is_origin ? cpu_entry_area_origin :
cpu_entry_area_shadow;
return &per_cpu(metadata_array[off], cpu);
}
/*
* Taken from arch/x86/mm/physaddr.h to avoid using an instrumented version.
*/
static inline bool kmsan_phys_addr_valid(unsigned long addr)
{
if (IS_ENABLED(CONFIG_PHYS_ADDR_T_64BIT))
return !(addr >> boot_cpu_data.x86_phys_bits);
else
return true;
}
/*
* Taken from arch/x86/mm/physaddr.c to avoid using an instrumented version.
*/
static inline bool kmsan_virt_addr_valid(void *addr)
{
unsigned long x = (unsigned long)addr;
unsigned long y = x - __START_KERNEL_map;
/* use the carry flag to determine if x was < __START_KERNEL_map */
if (unlikely(x > y)) {
x = y + phys_base;
if (y >= KERNEL_IMAGE_SIZE)
return false;
} else {
x = y + (__START_KERNEL_map - PAGE_OFFSET);
/* carry flag will be set if starting x was >= PAGE_OFFSET */
if ((x > y) || !kmsan_phys_addr_valid(x))
return false;
}
return pfn_valid(x >> PAGE_SHIFT);
}
#endif /* !MODULE */
#endif /* _ASM_X86_KMSAN_H */

View File

@ -8,6 +8,8 @@
#include <asm/cpufeatures.h>
#include <asm/alternative.h>
#include <linux/kmsan-checks.h>
/* duplicated to the one in bootmem.h */
extern unsigned long max_pfn;
extern unsigned long phys_base;
@ -47,6 +49,11 @@ void clear_page_erms(void *page);
static inline void clear_page(void *page)
{
/*
* Clean up KMSAN metadata for the page being cleared. The assembly call
* below clobbers @page, so we perform unpoisoning before it.
*/
kmsan_unpoison_memory(page, PAGE_SIZE);
alternative_call_2(clear_page_orig,
clear_page_rep, X86_FEATURE_REP_GOOD,
clear_page_erms, X86_FEATURE_ERMS,

View File

@ -256,10 +256,10 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
/* We always extract/encode the offset by shifting it all the way up, and then down again */
#define SWP_OFFSET_SHIFT (SWP_OFFSET_FIRST_BIT + SWP_TYPE_BITS)
#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5)
#define __swp_type(x) (((x).val) & 0x1f)
#define __swp_offset(x) ((x).val >> 5)
#define __swp_entry(type, offset) ((swp_entry_t){(type) | (offset) << 5})
#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
#define __swp_type(x) (((x).val) & ((1UL << SWP_TYPE_BITS) - 1))
#define __swp_offset(x) ((x).val >> SWP_TYPE_BITS)
#define __swp_entry(type, offset) ((swp_entry_t){(type) | (offset) << SWP_TYPE_BITS})
/*
* Normally, __swp_entry() converts from arch-independent swp_entry_t to

View File

@ -815,7 +815,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
static inline int pmd_bad(pmd_t pmd)
{
return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) !=
(_KERNPG_TABLE & ~_PAGE_ACCESSED);
}
static inline unsigned long pages_to_mb(unsigned long npg)
@ -1431,10 +1432,10 @@ static inline bool arch_has_pfn_modify_check(void)
return boot_cpu_has_bug(X86_BUG_L1TF);
}
#define arch_faults_on_old_pte arch_faults_on_old_pte
static inline bool arch_faults_on_old_pte(void)
#define arch_has_hw_pte_young arch_has_hw_pte_young
static inline bool arch_has_hw_pte_young(void)
{
return false;
return true;
}
#ifdef CONFIG_PAGE_TABLE_CHECK

View File

@ -139,7 +139,52 @@ extern unsigned int ptrs_per_p4d;
# define VMEMMAP_START __VMEMMAP_BASE_L4
#endif /* CONFIG_DYNAMIC_MEMORY_LAYOUT */
#define VMALLOC_END (VMALLOC_START + (VMALLOC_SIZE_TB << 40) - 1)
/*
* End of the region for which vmalloc page tables are pre-allocated.
* For non-KMSAN builds, this is the same as VMALLOC_END.
* For KMSAN builds, VMALLOC_START..VMEMORY_END is 4 times bigger than
* VMALLOC_START..VMALLOC_END (see below).
*/
#define VMEMORY_END (VMALLOC_START + (VMALLOC_SIZE_TB << 40) - 1)
#ifndef CONFIG_KMSAN
#define VMALLOC_END VMEMORY_END
#else
/*
* In KMSAN builds vmalloc area is four times smaller, and the remaining 3/4
* are used to keep the metadata for virtual pages. The memory formerly
* belonging to vmalloc area is now laid out as follows:
*
* 1st quarter: VMALLOC_START to VMALLOC_END - new vmalloc area
* 2nd quarter: KMSAN_VMALLOC_SHADOW_START to
* VMALLOC_END+KMSAN_VMALLOC_SHADOW_OFFSET - vmalloc area shadow
* 3rd quarter: KMSAN_VMALLOC_ORIGIN_START to
* VMALLOC_END+KMSAN_VMALLOC_ORIGIN_OFFSET - vmalloc area origins
* 4th quarter: KMSAN_MODULES_SHADOW_START to KMSAN_MODULES_ORIGIN_START
* - shadow for modules,
* KMSAN_MODULES_ORIGIN_START to
* KMSAN_MODULES_ORIGIN_START + MODULES_LEN - origins for modules.
*/
#define VMALLOC_QUARTER_SIZE ((VMALLOC_SIZE_TB << 40) >> 2)
#define VMALLOC_END (VMALLOC_START + VMALLOC_QUARTER_SIZE - 1)
/*
* vmalloc metadata addresses are calculated by adding shadow/origin offsets
* to vmalloc address.
*/
#define KMSAN_VMALLOC_SHADOW_OFFSET VMALLOC_QUARTER_SIZE
#define KMSAN_VMALLOC_ORIGIN_OFFSET (VMALLOC_QUARTER_SIZE << 1)
#define KMSAN_VMALLOC_SHADOW_START (VMALLOC_START + KMSAN_VMALLOC_SHADOW_OFFSET)
#define KMSAN_VMALLOC_ORIGIN_START (VMALLOC_START + KMSAN_VMALLOC_ORIGIN_OFFSET)
/*
* The shadow/origin for modules are placed one by one in the last 1/4 of
* vmalloc space.
*/
#define KMSAN_MODULES_SHADOW_START (VMALLOC_END + KMSAN_VMALLOC_ORIGIN_OFFSET + 1)
#define KMSAN_MODULES_ORIGIN_START (KMSAN_MODULES_SHADOW_START + MODULES_LEN)
#endif /* CONFIG_KMSAN */
#define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE)
/* The module sections ends with the start of the fixmap */

View File

@ -2,6 +2,8 @@
#ifndef _ASM_X86_SPARSEMEM_H
#define _ASM_X86_SPARSEMEM_H
#include <linux/types.h>
#ifdef CONFIG_SPARSEMEM
/*
* generic non-linear memory support:

View File

@ -11,11 +11,23 @@
function. */
#define __HAVE_ARCH_MEMCPY 1
#if defined(__SANITIZE_MEMORY__)
#undef memcpy
void *__msan_memcpy(void *dst, const void *src, size_t size);
#define memcpy __msan_memcpy
#else
extern void *memcpy(void *to, const void *from, size_t len);
#endif
extern void *__memcpy(void *to, const void *from, size_t len);
#define __HAVE_ARCH_MEMSET
#if defined(__SANITIZE_MEMORY__)
extern void *__msan_memset(void *s, int c, size_t n);
#undef memset
#define memset __msan_memset
#else
void *memset(void *s, int c, size_t n);
#endif
void *__memset(void *s, int c, size_t n);
#define __HAVE_ARCH_MEMSET16
@ -55,7 +67,13 @@ static inline void *memset64(uint64_t *s, uint64_t v, size_t n)
}
#define __HAVE_ARCH_MEMMOVE
#if defined(__SANITIZE_MEMORY__)
#undef memmove
void *__msan_memmove(void *dest, const void *src, size_t len);
#define memmove __msan_memmove
#else
void *memmove(void *dest, const void *src, size_t count);
#endif
void *__memmove(void *dest, const void *src, size_t count);
int memcmp(const void *cs, const void *ct, size_t count);
@ -64,8 +82,7 @@ char *strcpy(char *dest, const char *src);
char *strcat(char *dest, const char *src);
int strcmp(const char *cs, const char *ct);
#if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)
#if (defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__))
/*
* For files that not instrumented (e.g. mm/slub.c) we
* should use not instrumented version of mem* functions.
@ -73,7 +90,9 @@ int strcmp(const char *cs, const char *ct);
#undef memcpy
#define memcpy(dst, src, len) __memcpy(dst, src, len)
#undef memmove
#define memmove(dst, src, len) __memmove(dst, src, len)
#undef memset
#define memset(s, c, n) __memset(s, c, n)
#ifndef __NO_FORTIFY

View File

@ -5,6 +5,7 @@
* User space memory access functions
*/
#include <linux/compiler.h>
#include <linux/instrumented.h>
#include <linux/kasan-checks.h>
#include <linux/string.h>
#include <asm/asm.h>
@ -103,6 +104,7 @@ extern int __get_user_bad(void);
: "=a" (__ret_gu), "=r" (__val_gu), \
ASM_CALL_CONSTRAINT \
: "0" (ptr), "i" (sizeof(*(ptr)))); \
instrument_get_user(__val_gu); \
(x) = (__force __typeof__(*(ptr))) __val_gu; \
__builtin_expect(__ret_gu, 0); \
})
@ -192,9 +194,11 @@ extern void __put_user_nocheck_8(void);
int __ret_pu; \
void __user *__ptr_pu; \
register __typeof__(*(ptr)) __val_pu asm("%"_ASM_AX); \
__chk_user_ptr(ptr); \
__ptr_pu = (ptr); \
__val_pu = (x); \
__typeof__(*(ptr)) __x = (x); /* eval x once */ \
__typeof__(ptr) __ptr = (ptr); /* eval ptr once */ \
__chk_user_ptr(__ptr); \
__ptr_pu = __ptr; \
__val_pu = __x; \
asm volatile("call __" #fn "_%P[size]" \
: "=c" (__ret_pu), \
ASM_CALL_CONSTRAINT \
@ -202,6 +206,7 @@ extern void __put_user_nocheck_8(void);
"r" (__val_pu), \
[size] "i" (sizeof(*(ptr))) \
:"ebx"); \
instrument_put_user(__x, __ptr, sizeof(*(ptr))); \
__builtin_expect(__ret_pu, 0); \
})
@ -248,23 +253,25 @@ extern void __put_user_nocheck_8(void);
#define __put_user_size(x, ptr, size, label) \
do { \
__typeof__(*(ptr)) __x = (x); /* eval x once */ \
__chk_user_ptr(ptr); \
switch (size) { \
case 1: \
__put_user_goto(x, ptr, "b", "iq", label); \
__put_user_goto(__x, ptr, "b", "iq", label); \
break; \
case 2: \
__put_user_goto(x, ptr, "w", "ir", label); \
__put_user_goto(__x, ptr, "w", "ir", label); \
break; \
case 4: \
__put_user_goto(x, ptr, "l", "ir", label); \
__put_user_goto(__x, ptr, "l", "ir", label); \
break; \
case 8: \
__put_user_goto_u64(x, ptr, label); \
__put_user_goto_u64(__x, ptr, label); \
break; \
default: \
__put_user_bad(); \
} \
instrument_put_user(__x, ptr, size); \
} while (0)
#ifdef CONFIG_CC_HAS_ASM_GOTO_OUTPUT
@ -305,6 +312,7 @@ do { \
default: \
(x) = __get_user_bad(); \
} \
instrument_get_user(x); \
} while (0)
#define __get_user_asm(x, addr, itype, ltype, label) \

View File

@ -29,6 +29,8 @@ KASAN_SANITIZE_sev.o := n
# With some compiler versions the generated code results in boot hangs, caused
# by several compilation units. To be safe, disable all instrumentation.
KCSAN_SANITIZE := n
KMSAN_SANITIZE_head$(BITS).o := n
KMSAN_SANITIZE_nmi.o := n
# If instrumentation of this dir is enabled, boot hangs during first second.
# Probably could be more selective here, but note that files related to irqs,

View File

@ -12,6 +12,7 @@ endif
# If these files are instrumented, boot hangs during the first second.
KCOV_INSTRUMENT_common.o := n
KCOV_INSTRUMENT_perf_event.o := n
KMSAN_SANITIZE_common.o := n
# As above, instrumenting secondary CPU boot code causes boot hangs.
KCSAN_SANITIZE_common.o := n

View File

@ -177,6 +177,12 @@ static void show_regs_if_on_stack(struct stack_info *info, struct pt_regs *regs,
}
}
/*
* This function reads pointers from the stack and dereferences them. The
* pointers may not have their KMSAN shadow set up properly, which may result
* in false positive reports. Disable instrumentation to avoid those.
*/
__no_kmsan_checks
static void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
unsigned long *stack, const char *log_lvl)
{

View File

@ -553,6 +553,7 @@ void compat_start_thread(struct pt_regs *regs, u32 new_ip, u32 new_sp, bool x32)
* Kprobes not supported here. Set the probe on schedule instead.
* Function graph tracer not supported too.
*/
__no_kmsan_checks
__visible __notrace_funcgraph struct task_struct *
__switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{

View File

@ -95,7 +95,7 @@ void __init tboot_probe(void)
static pgd_t *tboot_pg_dir;
static struct mm_struct tboot_mm = {
.mm_rb = RB_ROOT,
.mm_mt = MTREE_INIT_EXT(mm_mt, MM_MT_FLAGS, tboot_mm.mmap_lock),
.pgd = swapper_pg_dir,
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),

View File

@ -183,6 +183,16 @@ static struct pt_regs *decode_frame_pointer(unsigned long *bp)
}
#endif
/*
* While walking the stack, KMSAN may stomp on stale locals from other
* functions that were marked as uninitialized upon function exit, and
* now hold the call frame information for the current function (e.g. the frame
* pointer). Because KMSAN does not specifically mark call frames as
* initialized, false positive reports are possible. To prevent such reports,
* we mark the functions scanning the stack (here and below) with
* __no_kmsan_checks.
*/
__no_kmsan_checks
static bool update_stack_state(struct unwind_state *state,
unsigned long *next_bp)
{
@ -250,6 +260,7 @@ static bool update_stack_state(struct unwind_state *state,
return true;
}
__no_kmsan_checks
bool unwind_next_frame(struct unwind_state *state)
{
struct pt_regs *regs;

View File

@ -65,7 +65,9 @@ ifneq ($(CONFIG_X86_CMPXCHG64),y)
endif
else
obj-y += iomap_copy_64.o
ifneq ($(CONFIG_GENERIC_CSUM),y)
lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
endif
lib-y += clear_page_64.o copy_page_64.o
lib-y += memmove_64.o memset_64.o
lib-y += copy_user_64.o

View File

@ -1,6 +1,7 @@
#include <linux/string.h>
#include <linux/module.h>
#include <linux/io.h>
#include <linux/kmsan-checks.h>
#define movs(type,to,from) \
asm volatile("movs" type:"=&D" (to), "=&S" (from):"0" (to), "1" (from):"memory")
@ -37,6 +38,8 @@ static void string_memcpy_fromio(void *to, const volatile void __iomem *from, si
n-=2;
}
rep_movs(to, (const void *)from, n);
/* KMSAN must treat values read from devices as initialized. */
kmsan_unpoison_memory(to, n);
}
static void string_memcpy_toio(volatile void __iomem *to, const void *from, size_t n)
@ -44,6 +47,8 @@ static void string_memcpy_toio(volatile void __iomem *to, const void *from, size
if (unlikely(!n))
return;
/* Make sure uninitialized memory isn't copied to devices. */
kmsan_check_memory(from, n);
/* Align any unaligned destination IO */
if (unlikely(1 & (unsigned long)to)) {
movs("b", to, from);

View File

@ -14,6 +14,8 @@ KASAN_SANITIZE_pgprot.o := n
# Disable KCSAN entirely, because otherwise we get warnings that some functions
# reference __initdata sections.
KCSAN_SANITIZE := n
# Avoid recursion by not calling KMSAN hooks for CEA code.
KMSAN_SANITIZE_cpu_entry_area.o := n
ifdef CONFIG_FUNCTION_TRACER
CFLAGS_REMOVE_mem_encrypt.o = -pg
@ -44,6 +46,9 @@ obj-$(CONFIG_HIGHMEM) += highmem_32.o
KASAN_SANITIZE_kasan_init_$(BITS).o := n
obj-$(CONFIG_KASAN) += kasan_init_$(BITS).o
KMSAN_SANITIZE_kmsan_shadow.o := n
obj-$(CONFIG_KMSAN) += kmsan_shadow.o
obj-$(CONFIG_MMIOTRACE) += mmiotrace.o
mmiotrace-y := kmmio.o pf_in.o mmio-mod.o
obj-$(CONFIG_MMIOTRACE_TEST) += testmmiotrace.o

View File

@ -260,7 +260,7 @@ static noinline int vmalloc_fault(unsigned long address)
}
NOKPROBE_SYMBOL(vmalloc_fault);
void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
static void __arch_sync_kernel_mappings(unsigned long start, unsigned long end)
{
unsigned long addr;
@ -284,6 +284,27 @@ void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
}
}
void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
{
__arch_sync_kernel_mappings(start, end);
#ifdef CONFIG_KMSAN
/*
* KMSAN maintains two additional metadata page mappings for the
* [VMALLOC_START, VMALLOC_END) range. These mappings start at
* KMSAN_VMALLOC_SHADOW_START and KMSAN_VMALLOC_ORIGIN_START and
* have to be synced together with the vmalloc memory mapping.
*/
if (start >= VMALLOC_START && end < VMALLOC_END) {
__arch_sync_kernel_mappings(
start - VMALLOC_START + KMSAN_VMALLOC_SHADOW_START,
end - VMALLOC_START + KMSAN_VMALLOC_SHADOW_START);
__arch_sync_kernel_mappings(
start - VMALLOC_START + KMSAN_VMALLOC_ORIGIN_START,
end - VMALLOC_START + KMSAN_VMALLOC_ORIGIN_START);
}
#endif
}
static bool low_pfn(unsigned long pfn)
{
return pfn < max_low_pfn;

View File

@ -1054,7 +1054,7 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache)
}
#ifdef CONFIG_SWAP
unsigned long max_swapfile_size(void)
unsigned long arch_max_swapfile_size(void)
{
unsigned long pages;

View File

@ -1288,7 +1288,7 @@ static void __init preallocate_vmalloc_pages(void)
unsigned long addr;
const char *lvl;
for (addr = VMALLOC_START; addr <= VMALLOC_END; addr = ALIGN(addr + 1, PGDIR_SIZE)) {
for (addr = VMALLOC_START; addr <= VMEMORY_END; addr = ALIGN(addr + 1, PGDIR_SIZE)) {
pgd_t *pgd = pgd_offset_k(addr);
p4d_t *p4d;
pud_t *pud;

View File

@ -17,6 +17,7 @@
#include <linux/cc_platform.h>
#include <linux/efi.h>
#include <linux/pgtable.h>
#include <linux/kmsan.h>
#include <asm/set_memory.h>
#include <asm/e820/api.h>
@ -479,6 +480,8 @@ void iounmap(volatile void __iomem *addr)
return;
}
kmsan_iounmap_page_range((unsigned long)addr,
(unsigned long)addr + get_vm_area_size(p));
memtype_free(p->phys_addr, p->phys_addr + get_vm_area_size(p));
/* Finally remove it */

Some files were not shown because too many files have changed in this diff Show More