mirror of
https://github.com/torvalds/linux.git
synced 2024-11-24 21:21:41 +00:00
Many singleton patches against the MM code. The patch series which
are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series "maple_tree: add mt_free_one() and mt_attr() helpers" "Some cleanups of maple tree" - In the series "mm: use memmap_on_memory semantics for dax/kmem" Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series "Add folio_zero_tail() and folio_fill_tail()" "Make folio_start_writeback return void" "Fix fault handler's handling of poisoned tail pages" "Convert aops->error_remove_page to ->error_remove_folio" "Finish two folio conversions" "More swap folio conversions" - Kefeng Wang has also contributed folio-related work in the series "mm: cleanup and use more folio in page fault" - Jim Cromie has improved the kmemleak reporting output in the series "tweak kmemleak report format". - In the series "stackdepot: allow evicting stack traces" Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series "mm: page_alloc: fixes for high atomic reserve caluculations". - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series "samples: introduce cgroup events listeners". - Some mapletree maintanance work from Liam Howlett in the series "maple_tree: iterator state changes". - Nhat Pham has improved zswap's approach to writeback in the series "workload-specific and memory pressure-driven zswap writeback". - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series "mm/damon: let users feed and tame/auto-tune DAMOS" "selftests/damon: add Python-written DAMON functionality tests" "mm/damon: misc updates for 6.8" - Yosry Ahmed has improved memcg's stats flushing in the series "mm: memcg: subtree stats flushing and thresholds". - In the series "Multi-size THP for anonymous memory" Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series "More buffer_head cleanups". - Suren Baghdasaryan has done work on Andrea Arcangeli's series "userfaultfd move option". UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a "KSM Advisor", in the series "mm/ksm: Add ksm advisor". This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series "mm/zswap: dstmem reuse optimizations and cleanups". - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is "Clean up the writeback paths". - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series "kasan: save mempool stack traces". - Andrey also performed some KASAN maintenance work in the series "kasan: assorted clean-ups". - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series "mm/rmap: interface overhaul". - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series "mm/mglru: Kconfig cleanup". - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series "Remove some lruvec page accounting functions". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU= =0NHs -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...
This commit is contained in:
commit
fb46e22a9e
@ -25,12 +25,14 @@ Description: Writing 'on' or 'off' to this file makes the kdamond starts or
|
||||
stops, respectively. Reading the file returns the keywords
|
||||
based on the current status. Writing 'commit' to this file
|
||||
makes the kdamond reads the user inputs in the sysfs files
|
||||
except 'state' again. Writing 'update_schemes_stats' to the
|
||||
file updates contents of schemes stats files of the kdamond.
|
||||
Writing 'update_schemes_tried_regions' to the file updates
|
||||
contents of 'tried_regions' directory of every scheme directory
|
||||
of this kdamond. Writing 'update_schemes_tried_bytes' to the
|
||||
file updates only '.../tried_regions/total_bytes' files of this
|
||||
except 'state' again. Writing 'commit_schemes_quota_goals' to
|
||||
this file makes the kdamond reads the quota goal files again.
|
||||
Writing 'update_schemes_stats' to the file updates contents of
|
||||
schemes stats files of the kdamond. Writing
|
||||
'update_schemes_tried_regions' to the file updates contents of
|
||||
'tried_regions' directory of every scheme directory of this
|
||||
kdamond. Writing 'update_schemes_tried_bytes' to the file
|
||||
updates only '.../tried_regions/total_bytes' files of this
|
||||
kdamond. Writing 'clear_schemes_tried_regions' to the file
|
||||
removes contents of the 'tried_regions' directory.
|
||||
|
||||
@ -212,6 +214,25 @@ Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Writing to and reading from this file sets and gets the quotas
|
||||
charge reset interval of the scheme in milliseconds.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/nr_goals
|
||||
Date: Nov 2023
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Writing a number 'N' to this file creates the number of
|
||||
directories for setting automatic tuning of the scheme's
|
||||
aggressiveness named '0' to 'N-1' under the goals/ directory.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/<G>/target_value
|
||||
Date: Nov 2023
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Writing to and reading from this file sets and gets the target
|
||||
value of the goal metric.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/<G>/current_value
|
||||
Date: Nov 2023
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Writing to and reading from this file sets and gets the current
|
||||
value of the goal metric.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/weights/sz_permil
|
||||
Date: Mar 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
|
@ -328,7 +328,7 @@ as idle::
|
||||
From now on, any pages on zram are idle pages. The idle mark
|
||||
will be removed until someone requests access of the block.
|
||||
IOW, unless there is access request, those pages are still idle pages.
|
||||
Additionally, when CONFIG_ZRAM_MEMORY_TRACKING is enabled pages can be
|
||||
Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be
|
||||
marked as idle based on how long (in seconds) it's been since they were
|
||||
last accessed::
|
||||
|
||||
|
@ -1693,6 +1693,21 @@ PAGE_SIZE multiple when read back.
|
||||
limit, it will refuse to take any more stores before existing
|
||||
entries fault back in or are written out to disk.
|
||||
|
||||
memory.zswap.writeback
|
||||
A read-write single value file. The default value is "1". The
|
||||
initial value of the root cgroup is 1, and when a new cgroup is
|
||||
created, it inherits the current value of its parent.
|
||||
|
||||
When this is set to 0, all swapping attempts to swapping devices
|
||||
are disabled. This included both zswap writebacks, and swapping due
|
||||
to zswap store failures. If the zswap store failures are recurring
|
||||
(for e.g if the pages are incompressible), users can observe
|
||||
reclaim inefficiency after disabling writeback (because the same
|
||||
pages might be rejected again and again).
|
||||
|
||||
Note that this is subtly different from setting memory.swap.max to
|
||||
0, as it still allows for pages to be written to the zswap pool.
|
||||
|
||||
memory.pressure
|
||||
A read-only nested-keyed file.
|
||||
|
||||
|
@ -172,7 +172,7 @@ variables.
|
||||
Offset of the free_list's member. This value is used to compute the number
|
||||
of free pages.
|
||||
|
||||
Each zone has a free_area structure array called free_area[MAX_ORDER + 1].
|
||||
Each zone has a free_area structure array called free_area[NR_PAGE_ORDERS].
|
||||
The free_list represents a linked list of free page blocks.
|
||||
|
||||
(list_head, next|prev)
|
||||
@ -189,11 +189,11 @@ Offsets of the vmap_area's members. They carry vmalloc-specific
|
||||
information. Makedumpfile gets the start address of the vmalloc region
|
||||
from this.
|
||||
|
||||
(zone.free_area, MAX_ORDER + 1)
|
||||
-------------------------------
|
||||
(zone.free_area, NR_PAGE_ORDERS)
|
||||
--------------------------------
|
||||
|
||||
Free areas descriptor. User-space tools use this value to iterate the
|
||||
free_area ranges. MAX_ORDER is used by the zone buddy allocator.
|
||||
free_area ranges. NR_PAGE_ORDERS is used by the zone buddy allocator.
|
||||
|
||||
prb
|
||||
---
|
||||
|
@ -970,17 +970,17 @@
|
||||
buddy allocator. Bigger value increase the probability
|
||||
of catching random memory corruption, but reduce the
|
||||
amount of memory for normal system use. The maximum
|
||||
possible value is MAX_ORDER/2. Setting this parameter
|
||||
to 1 or 2 should be enough to identify most random
|
||||
memory corruption problems caused by bugs in kernel or
|
||||
driver code when a CPU writes to (or reads from) a
|
||||
random memory location. Note that there exists a class
|
||||
of memory corruptions problems caused by buggy H/W or
|
||||
F/W or by drivers badly programming DMA (basically when
|
||||
memory is written at bus level and the CPU MMU is
|
||||
bypassed) which are not detectable by
|
||||
CONFIG_DEBUG_PAGEALLOC, hence this option will not help
|
||||
tracking down these problems.
|
||||
possible value is MAX_PAGE_ORDER/2. Setting this
|
||||
parameter to 1 or 2 should be enough to identify most
|
||||
random memory corruption problems caused by bugs in
|
||||
kernel or driver code when a CPU writes to (or reads
|
||||
from) a random memory location. Note that there exists
|
||||
a class of memory corruptions problems caused by buggy
|
||||
H/W or F/W or by drivers badly programming DMA
|
||||
(basically when memory is written at bus level and the
|
||||
CPU MMU is bypassed) which are not detectable by
|
||||
CONFIG_DEBUG_PAGEALLOC, hence this option will not
|
||||
help tracking down these problems.
|
||||
|
||||
debug_pagealloc=
|
||||
[KNL] When CONFIG_DEBUG_PAGEALLOC is set, this parameter
|
||||
@ -4136,7 +4136,7 @@
|
||||
[KNL] Minimal page reporting order
|
||||
Format: <integer>
|
||||
Adjust the minimal page reporting order. The page
|
||||
reporting is disabled when it exceeds MAX_ORDER.
|
||||
reporting is disabled when it exceeds MAX_PAGE_ORDER.
|
||||
|
||||
panic= [KNL] Kernel behaviour on panic: delay <timeout>
|
||||
timeout > 0: seconds before rebooting
|
||||
|
@ -59,41 +59,47 @@ Files Hierarchy
|
||||
The files hierarchy of DAMON sysfs interface is shown below. In the below
|
||||
figure, parents-children relations are represented with indentations, each
|
||||
directory is having ``/`` suffix, and files in each directory are separated by
|
||||
comma (","). ::
|
||||
comma (",").
|
||||
|
||||
/sys/kernel/mm/damon/admin
|
||||
│ kdamonds/nr_kdamonds
|
||||
│ │ 0/state,pid
|
||||
│ │ │ contexts/nr_contexts
|
||||
│ │ │ │ 0/avail_operations,operations
|
||||
│ │ │ │ │ monitoring_attrs/
|
||||
.. parsed-literal::
|
||||
|
||||
:ref:`/sys/kernel/mm/damon <sysfs_root>`/admin
|
||||
│ :ref:`kdamonds <sysfs_kdamonds>`/nr_kdamonds
|
||||
│ │ :ref:`0 <sysfs_kdamond>`/state,pid
|
||||
│ │ │ :ref:`contexts <sysfs_contexts>`/nr_contexts
|
||||
│ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations
|
||||
│ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/
|
||||
│ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
|
||||
│ │ │ │ │ │ nr_regions/min,max
|
||||
│ │ │ │ │ targets/nr_targets
|
||||
│ │ │ │ │ │ 0/pid_target
|
||||
│ │ │ │ │ │ │ regions/nr_regions
|
||||
│ │ │ │ │ │ │ │ 0/start,end
|
||||
│ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets
|
||||
│ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target
|
||||
│ │ │ │ │ │ │ :ref:`regions <sysfs_regions>`/nr_regions
|
||||
│ │ │ │ │ │ │ │ :ref:`0 <sysfs_region>`/start,end
|
||||
│ │ │ │ │ │ │ │ ...
|
||||
│ │ │ │ │ │ ...
|
||||
│ │ │ │ │ schemes/nr_schemes
|
||||
│ │ │ │ │ │ 0/action,apply_interval_us
|
||||
│ │ │ │ │ │ │ access_pattern/
|
||||
│ │ │ │ │ :ref:`schemes <sysfs_schemes>`/nr_schemes
|
||||
│ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,apply_interval_us
|
||||
│ │ │ │ │ │ │ :ref:`access_pattern <sysfs_access_pattern>`/
|
||||
│ │ │ │ │ │ │ │ sz/min,max
|
||||
│ │ │ │ │ │ │ │ nr_accesses/min,max
|
||||
│ │ │ │ │ │ │ │ age/min,max
|
||||
│ │ │ │ │ │ │ quotas/ms,bytes,reset_interval_ms
|
||||
│ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms
|
||||
│ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
|
||||
│ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low
|
||||
│ │ │ │ │ │ │ filters/nr_filters
|
||||
│ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals
|
||||
│ │ │ │ │ │ │ │ │ 0/target_value,current_value
|
||||
│ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low
|
||||
│ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters
|
||||
│ │ │ │ │ │ │ │ 0/type,matching,memcg_id
|
||||
│ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
|
||||
│ │ │ │ │ │ │ tried_regions/total_bytes
|
||||
│ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
|
||||
│ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes
|
||||
│ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age
|
||||
│ │ │ │ │ │ │ │ ...
|
||||
│ │ │ │ │ │ ...
|
||||
│ │ │ │ ...
|
||||
│ │ ...
|
||||
|
||||
.. _sysfs_root:
|
||||
|
||||
Root
|
||||
----
|
||||
|
||||
@ -102,6 +108,8 @@ has one directory named ``admin``. The directory contains the files for
|
||||
privileged user space programs' control of DAMON. User space tools or daemons
|
||||
having the root permission could use this directory.
|
||||
|
||||
.. _sysfs_kdamonds:
|
||||
|
||||
kdamonds/
|
||||
---------
|
||||
|
||||
@ -113,6 +121,8 @@ details) exists. In the beginning, this directory has only one file,
|
||||
child directories named ``0`` to ``N-1``. Each directory represents each
|
||||
kdamond.
|
||||
|
||||
.. _sysfs_kdamond:
|
||||
|
||||
kdamonds/<N>/
|
||||
-------------
|
||||
|
||||
@ -120,29 +130,37 @@ In each kdamond directory, two files (``state`` and ``pid``) and one directory
|
||||
(``contexts``) exist.
|
||||
|
||||
Reading ``state`` returns ``on`` if the kdamond is currently running, or
|
||||
``off`` if it is not running. Writing ``on`` or ``off`` makes the kdamond be
|
||||
in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the
|
||||
user inputs in the sysfs files except ``state`` file again. Writing
|
||||
``update_schemes_stats`` to ``state`` file updates the contents of stats files
|
||||
for each DAMON-based operation scheme of the kdamond. For details of the
|
||||
stats, please refer to :ref:`stats section <sysfs_schemes_stats>`.
|
||||
``off`` if it is not running.
|
||||
|
||||
Writing ``update_schemes_tried_regions`` to ``state`` file updates the
|
||||
DAMON-based operation scheme action tried regions directory for each
|
||||
DAMON-based operation scheme of the kdamond. Writing
|
||||
``update_schemes_tried_bytes`` to ``state`` file updates only
|
||||
``.../tried_regions/total_bytes`` files. Writing
|
||||
``clear_schemes_tried_regions`` to ``state`` file clears the DAMON-based
|
||||
operating scheme action tried regions directory for each DAMON-based operation
|
||||
scheme of the kdamond. For details of the DAMON-based operation scheme action
|
||||
tried regions directory, please refer to :ref:`tried_regions section
|
||||
<sysfs_schemes_tried_regions>`.
|
||||
Users can write below commands for the kdamond to the ``state`` file.
|
||||
|
||||
- ``on``: Start running.
|
||||
- ``off``: Stop running.
|
||||
- ``commit``: Read the user inputs in the sysfs files except ``state`` file
|
||||
again.
|
||||
- ``commit_schemes_quota_goals``: Read the DAMON-based operation schemes'
|
||||
:ref:`quota goals <sysfs_schemes_quota_goals>`.
|
||||
- ``update_schemes_stats``: Update the contents of stats files for each
|
||||
DAMON-based operation scheme of the kdamond. For details of the stats,
|
||||
please refer to :ref:`stats section <sysfs_schemes_stats>`.
|
||||
- ``update_schemes_tried_regions``: Update the DAMON-based operation scheme
|
||||
action tried regions directory for each DAMON-based operation scheme of the
|
||||
kdamond. For details of the DAMON-based operation scheme action tried
|
||||
regions directory, please refer to
|
||||
:ref:`tried_regions section <sysfs_schemes_tried_regions>`.
|
||||
- ``update_schemes_tried_bytes``: Update only ``.../tried_regions/total_bytes``
|
||||
files.
|
||||
- ``clear_schemes_tried_regions``: Clear the DAMON-based operating scheme
|
||||
action tried regions directory for each DAMON-based operation scheme of the
|
||||
kdamond.
|
||||
|
||||
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
|
||||
|
||||
``contexts`` directory contains files for controlling the monitoring contexts
|
||||
that this kdamond will execute.
|
||||
|
||||
.. _sysfs_contexts:
|
||||
|
||||
kdamonds/<N>/contexts/
|
||||
----------------------
|
||||
|
||||
@ -153,7 +171,7 @@ number (``N``) to the file creates the number of child directories named as
|
||||
details). At the moment, only one context per kdamond is supported, so only
|
||||
``0`` or ``1`` can be written to the file.
|
||||
|
||||
.. _sysfs_contexts:
|
||||
.. _sysfs_context:
|
||||
|
||||
contexts/<N>/
|
||||
-------------
|
||||
@ -203,6 +221,8 @@ writing to and rading from the files.
|
||||
For more details about the intervals and monitoring regions range, please refer
|
||||
to the Design document (:doc:`/mm/damon/design`).
|
||||
|
||||
.. _sysfs_targets:
|
||||
|
||||
contexts/<N>/targets/
|
||||
---------------------
|
||||
|
||||
@ -210,6 +230,8 @@ In the beginning, this directory has only one file, ``nr_targets``. Writing a
|
||||
number (``N``) to the file creates the number of child directories named ``0``
|
||||
to ``N-1``. Each directory represents each monitoring target.
|
||||
|
||||
.. _sysfs_target:
|
||||
|
||||
targets/<N>/
|
||||
------------
|
||||
|
||||
@ -244,6 +266,8 @@ In the beginning, this directory has only one file, ``nr_regions``. Writing a
|
||||
number (``N``) to the file creates the number of child directories named ``0``
|
||||
to ``N-1``. Each directory represents each initial monitoring target region.
|
||||
|
||||
.. _sysfs_region:
|
||||
|
||||
regions/<N>/
|
||||
------------
|
||||
|
||||
@ -254,6 +278,8 @@ region by writing to and reading from the files, respectively.
|
||||
Each region should not overlap with others. ``end`` of directory ``N`` should
|
||||
be equal or smaller than ``start`` of directory ``N+1``.
|
||||
|
||||
.. _sysfs_schemes:
|
||||
|
||||
contexts/<N>/schemes/
|
||||
---------------------
|
||||
|
||||
@ -265,6 +291,8 @@ In the beginning, this directory has only one file, ``nr_schemes``. Writing a
|
||||
number (``N``) to the file creates the number of child directories named ``0``
|
||||
to ``N-1``. Each directory represents each DAMON-based operation scheme.
|
||||
|
||||
.. _sysfs_scheme:
|
||||
|
||||
schemes/<N>/
|
||||
------------
|
||||
|
||||
@ -277,7 +305,7 @@ The ``action`` file is for setting and getting the scheme's :ref:`action
|
||||
from the file and their meaning are as below.
|
||||
|
||||
Note that support of each action depends on the running DAMON operations set
|
||||
:ref:`implementation <sysfs_contexts>`.
|
||||
:ref:`implementation <sysfs_context>`.
|
||||
|
||||
- ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``.
|
||||
Supported by ``vaddr`` and ``fvaddr`` operations set.
|
||||
@ -299,6 +327,8 @@ Note that support of each action depends on the running DAMON operations set
|
||||
The ``apply_interval_us`` file is for setting and getting the scheme's
|
||||
:ref:`apply_interval <damon_design_damos>` in microseconds.
|
||||
|
||||
.. _sysfs_access_pattern:
|
||||
|
||||
schemes/<N>/access_pattern/
|
||||
---------------------------
|
||||
|
||||
@ -312,6 +342,8 @@ to and reading from the ``min`` and ``max`` files under ``sz``,
|
||||
``nr_accesses``, and ``age`` directories, respectively. Note that the ``min``
|
||||
and the ``max`` form a closed interval.
|
||||
|
||||
.. _sysfs_quotas:
|
||||
|
||||
schemes/<N>/quotas/
|
||||
-------------------
|
||||
|
||||
@ -319,8 +351,7 @@ The directory for the :ref:`quotas <damon_design_damos_quotas>` of the given
|
||||
DAMON-based operation scheme.
|
||||
|
||||
Under ``quotas`` directory, three files (``ms``, ``bytes``,
|
||||
``reset_interval_ms``) and one directory (``weights``) having three files
|
||||
(``sz_permil``, ``nr_accesses_permil``, and ``age_permil``) in it exist.
|
||||
``reset_interval_ms``) and two directores (``weights`` and ``goals``) exist.
|
||||
|
||||
You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and
|
||||
``reset interval`` in milliseconds by writing the values to the three files,
|
||||
@ -330,11 +361,37 @@ apply the action to only up to ``bytes`` bytes of memory regions within the
|
||||
``reset_interval_ms``. Setting both ``ms`` and ``bytes`` zero disables the
|
||||
quota limits.
|
||||
|
||||
You can also set the :ref:`prioritization weights
|
||||
Under ``weights`` directory, three files (``sz_permil``,
|
||||
``nr_accesses_permil``, and ``age_permil``) exist.
|
||||
You can set the :ref:`prioritization weights
|
||||
<damon_design_damos_quotas_prioritization>` for size, access frequency, and age
|
||||
in per-thousand unit by writing the values to the three files under the
|
||||
``weights`` directory.
|
||||
|
||||
.. _sysfs_schemes_quota_goals:
|
||||
|
||||
schemes/<N>/quotas/goals/
|
||||
-------------------------
|
||||
|
||||
The directory for the :ref:`automatic quota tuning goals
|
||||
<damon_design_damos_quotas_auto_tuning>` of the given DAMON-based operation
|
||||
scheme.
|
||||
|
||||
In the beginning, this directory has only one file, ``nr_goals``. Writing a
|
||||
number (``N``) to the file creates the number of child directories named ``0``
|
||||
to ``N-1``. Each directory represents each goal and current achievement.
|
||||
Among the multiple feedback, the best one is used.
|
||||
|
||||
Each goal directory contains two files, namely ``target_value`` and
|
||||
``current_value``. Users can set and get any number to those files to set the
|
||||
feedback. User space main workload's latency or throughput, system metrics
|
||||
like free memory ratio or memory pressure stall time (PSI) could be example
|
||||
metrics for the values. Note that users should write
|
||||
``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond
|
||||
directory <sysfs_kdamond>` to pass the feedback to DAMON.
|
||||
|
||||
.. _sysfs_watermarks:
|
||||
|
||||
schemes/<N>/watermarks/
|
||||
-----------------------
|
||||
|
||||
@ -354,6 +411,8 @@ as below.
|
||||
|
||||
The ``interval`` should written in microseconds unit.
|
||||
|
||||
.. _sysfs_filters:
|
||||
|
||||
schemes/<N>/filters/
|
||||
--------------------
|
||||
|
||||
@ -394,7 +453,7 @@ pages of all memory cgroups except ``/having_care_already``.::
|
||||
echo N > 1/matching
|
||||
|
||||
Note that ``anon`` and ``memcg`` filters are currently supported only when
|
||||
``paddr`` :ref:`implementation <sysfs_contexts>` is being used.
|
||||
``paddr`` :ref:`implementation <sysfs_context>` is being used.
|
||||
|
||||
Also, memory regions that are filtered out by ``addr`` or ``target`` filters
|
||||
are not counted as the scheme has tried to those, while regions that filtered
|
||||
@ -449,6 +508,8 @@ and query-like efficient data access monitoring results retrievals. For the
|
||||
latter use case, in particular, users can set the ``action`` as ``stat`` and
|
||||
set the ``access pattern`` as their interested pattern that they want to query.
|
||||
|
||||
.. _sysfs_schemes_tried_region:
|
||||
|
||||
tried_regions/<N>/
|
||||
------------------
|
||||
|
||||
|
@ -80,6 +80,9 @@ pages_to_scan
|
||||
how many pages to scan before ksmd goes to sleep
|
||||
e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``.
|
||||
|
||||
The pages_to_scan value cannot be changed if ``advisor_mode`` has
|
||||
been set to scan-time.
|
||||
|
||||
Default: 100 (chosen for demonstration purposes)
|
||||
|
||||
sleep_millisecs
|
||||
@ -164,6 +167,29 @@ smart_scan
|
||||
optimization is enabled. The ``pages_skipped`` metric shows how
|
||||
effective the setting is.
|
||||
|
||||
advisor_mode
|
||||
The ``advisor_mode`` selects the current advisor. Two modes are
|
||||
supported: none and scan-time. The default is none. By setting
|
||||
``advisor_mode`` to scan-time, the scan time advisor is enabled.
|
||||
The section about ``advisor`` explains in detail how the scan time
|
||||
advisor works.
|
||||
|
||||
adivsor_max_cpu
|
||||
specifies the upper limit of the cpu percent usage of the ksmd
|
||||
background thread. The default is 70.
|
||||
|
||||
advisor_target_scan_time
|
||||
specifies the target scan time in seconds to scan all the candidate
|
||||
pages. The default value is 200 seconds.
|
||||
|
||||
advisor_min_pages_to_scan
|
||||
specifies the lower limit of the ``pages_to_scan`` parameter of the
|
||||
scan time advisor. The default is 500.
|
||||
|
||||
adivsor_max_pages_to_scan
|
||||
specifies the upper limit of the ``pages_to_scan`` parameter of the
|
||||
scan time advisor. The default is 30000.
|
||||
|
||||
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
|
||||
|
||||
general_profit
|
||||
@ -263,6 +289,35 @@ ksm_swpin_copy
|
||||
note that KSM page might be copied when swapping in because do_swap_page()
|
||||
cannot do all the locking needed to reconstitute a cross-anon_vma KSM page.
|
||||
|
||||
Advisor
|
||||
=======
|
||||
|
||||
The number of candidate pages for KSM is dynamic. It can be often observed
|
||||
that during the startup of an application more candidate pages need to be
|
||||
processed. Without an advisor the ``pages_to_scan`` parameter needs to be
|
||||
sized for the maximum number of candidate pages. The scan time advisor can
|
||||
changes the ``pages_to_scan`` parameter based on demand.
|
||||
|
||||
The advisor can be enabled, so KSM can automatically adapt to changes in the
|
||||
number of candidate pages to scan. Two advisors are implemented: none and
|
||||
scan-time. With none, no advisor is enabled. The default is none.
|
||||
|
||||
The scan time advisor changes the ``pages_to_scan`` parameter based on the
|
||||
observed scan times. The possible values for the ``pages_to_scan`` parameter is
|
||||
limited by the ``advisor_max_cpu`` parameter. In addition there is also the
|
||||
``advisor_target_scan_time`` parameter. This parameter sets the target time to
|
||||
scan all the KSM candidate pages. The parameter ``advisor_target_scan_time``
|
||||
decides how aggressive the scan time advisor scans candidate pages. Lower
|
||||
values make the scan time advisor to scan more aggresively. This is the most
|
||||
important parameter for the configuration of the scan time advisor.
|
||||
|
||||
The initial value and the maximum value can be changed with
|
||||
``advisor_min_pages_to_scan`` and ``advisor_max_pages_to_scan``. The default
|
||||
values are sufficient for most workloads and use cases.
|
||||
|
||||
The ``pages_to_scan`` parameter is re-calculated after a scan has been completed.
|
||||
|
||||
|
||||
--
|
||||
Izik Eidus,
|
||||
Hugh Dickins, 17 Nov 2009
|
||||
|
@ -253,6 +253,7 @@ Following flags about pages are currently supported:
|
||||
- ``PAGE_IS_SWAPPED`` - Page is in swapped
|
||||
- ``PAGE_IS_PFNZERO`` - Page has zero PFN
|
||||
- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed
|
||||
- ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty
|
||||
|
||||
The ``struct pm_scan_arg`` is used as the argument of the IOCTL.
|
||||
|
||||
|
@ -45,10 +45,25 @@ components:
|
||||
the two is using hugepages just because of the fact the TLB miss is
|
||||
going to run faster.
|
||||
|
||||
Modern kernels support "multi-size THP" (mTHP), which introduces the
|
||||
ability to allocate memory in blocks that are bigger than a base page
|
||||
but smaller than traditional PMD-size (as described above), in
|
||||
increments of a power-of-2 number of pages. mTHP can back anonymous
|
||||
memory (for example 16K, 32K, 64K, etc). These THPs continue to be
|
||||
PTE-mapped, but in many cases can still provide similar benefits to
|
||||
those outlined above: Page faults are significantly reduced (by a
|
||||
factor of e.g. 4, 8, 16, etc), but latency spikes are much less
|
||||
prominent because the size of each page isn't as huge as the PMD-sized
|
||||
variant and there is less memory to clear in each page fault. Some
|
||||
architectures also employ TLB compression mechanisms to squeeze more
|
||||
entries in when a set of PTEs are virtually and physically contiguous
|
||||
and approporiately aligned. In this case, TLB misses will occur less
|
||||
often.
|
||||
|
||||
THP can be enabled system wide or restricted to certain tasks or even
|
||||
memory ranges inside task's address space. Unless THP is completely
|
||||
disabled, there is ``khugepaged`` daemon that scans memory and
|
||||
collapses sequences of basic pages into huge pages.
|
||||
collapses sequences of basic pages into PMD-sized huge pages.
|
||||
|
||||
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
|
||||
interface and using madvise(2) and prctl(2) system calls.
|
||||
@ -95,12 +110,40 @@ Global THP controls
|
||||
Transparent Hugepage Support for anonymous memory can be entirely disabled
|
||||
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
|
||||
regions (to avoid the risk of consuming more memory resources) or enabled
|
||||
system wide. This can be achieved with one of::
|
||||
system wide. This can be achieved per-supported-THP-size with one of::
|
||||
|
||||
echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
|
||||
echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
|
||||
echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
|
||||
|
||||
where <size> is the hugepage size being addressed, the available sizes
|
||||
for which vary by system.
|
||||
|
||||
For example::
|
||||
|
||||
echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
|
||||
|
||||
Alternatively it is possible to specify that a given hugepage size
|
||||
will inherit the top-level "enabled" value::
|
||||
|
||||
echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
|
||||
|
||||
For example::
|
||||
|
||||
echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
|
||||
|
||||
The top-level setting (for use with "inherit") can be set by issuing
|
||||
one of the following commands::
|
||||
|
||||
echo always >/sys/kernel/mm/transparent_hugepage/enabled
|
||||
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
|
||||
echo never >/sys/kernel/mm/transparent_hugepage/enabled
|
||||
|
||||
By default, PMD-sized hugepages have enabled="inherit" and all other
|
||||
hugepage sizes have enabled="never". If enabling multiple hugepage
|
||||
sizes, the kernel will select the most appropriate enabled size for a
|
||||
given allocation.
|
||||
|
||||
It's also possible to limit defrag efforts in the VM to generate
|
||||
anonymous hugepages in case they're not immediately free to madvise
|
||||
regions or to never try to defrag memory and simply fallback to regular
|
||||
@ -146,25 +189,34 @@ madvise
|
||||
never
|
||||
should be self-explanatory.
|
||||
|
||||
By default kernel tries to use huge zero page on read page fault to
|
||||
anonymous mapping. It's possible to disable huge zero page by writing 0
|
||||
or enable it back by writing 1::
|
||||
By default kernel tries to use huge, PMD-mappable zero page on read
|
||||
page fault to anonymous mapping. It's possible to disable huge zero
|
||||
page by writing 0 or enable it back by writing 1::
|
||||
|
||||
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||
|
||||
Some userspace (such as a test program, or an optimized memory allocation
|
||||
library) may want to know the size (in bytes) of a transparent hugepage::
|
||||
Some userspace (such as a test program, or an optimized memory
|
||||
allocation library) may want to know the size (in bytes) of a
|
||||
PMD-mappable transparent hugepage::
|
||||
|
||||
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
|
||||
|
||||
khugepaged will be automatically started when
|
||||
transparent_hugepage/enabled is set to "always" or "madvise, and it'll
|
||||
be automatically shutdown if it's set to "never".
|
||||
khugepaged will be automatically started when one or more hugepage
|
||||
sizes are enabled (either by directly setting "always" or "madvise",
|
||||
or by setting "inherit" while the top-level enabled is set to "always"
|
||||
or "madvise"), and it'll be automatically shutdown when the last
|
||||
hugepage size is disabled (either by directly setting "never", or by
|
||||
setting "inherit" while the top-level enabled is set to "never").
|
||||
|
||||
Khugepaged controls
|
||||
-------------------
|
||||
|
||||
.. note::
|
||||
khugepaged currently only searches for opportunities to collapse to
|
||||
PMD-sized THP and no attempt is made to collapse to other THP
|
||||
sizes.
|
||||
|
||||
khugepaged runs usually at low frequency so while one may not want to
|
||||
invoke defrag algorithms synchronously during the page faults, it
|
||||
should be worth invoking defrag at least in khugepaged. However it's
|
||||
@ -282,19 +334,26 @@ force
|
||||
Need of application restart
|
||||
===========================
|
||||
|
||||
The transparent_hugepage/enabled values and tmpfs mount option only affect
|
||||
future behavior. So to make them effective you need to restart any
|
||||
application that could have been using hugepages. This also applies to the
|
||||
regions registered in khugepaged.
|
||||
The transparent_hugepage/enabled and
|
||||
transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount
|
||||
option only affect future behavior. So to make them effective you need
|
||||
to restart any application that could have been using hugepages. This
|
||||
also applies to the regions registered in khugepaged.
|
||||
|
||||
Monitoring usage
|
||||
================
|
||||
|
||||
The number of anonymous transparent huge pages currently used by the
|
||||
.. note::
|
||||
Currently the below counters only record events relating to
|
||||
PMD-sized THP. Events relating to other THP sizes are not included.
|
||||
|
||||
The number of PMD-sized anonymous transparent huge pages currently used by the
|
||||
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
|
||||
To identify what applications are using anonymous transparent huge pages,
|
||||
it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
|
||||
for each mapping.
|
||||
To identify what applications are using PMD-sized anonymous transparent huge
|
||||
pages, it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages
|
||||
fields for each mapping. (Note that AnonHugePages only applies to traditional
|
||||
PMD-sized THP for historical reasons and should have been called
|
||||
AnonHugePmdMapped).
|
||||
|
||||
The number of file transparent huge pages mapped to userspace is available
|
||||
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
|
||||
@ -413,7 +472,7 @@ for huge pages.
|
||||
Optimizing the applications
|
||||
===========================
|
||||
|
||||
To be guaranteed that the kernel will map a 2M page immediately in any
|
||||
To be guaranteed that the kernel will map a THP immediately in any
|
||||
memory region, the mmap region has to be hugepage naturally
|
||||
aligned. posix_memalign() can provide that guarantee.
|
||||
|
||||
|
@ -113,6 +113,9 @@ events, except page fault notifications, may be generated:
|
||||
areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating
|
||||
support for shmem virtual memory areas.
|
||||
|
||||
- ``UFFD_FEATURE_MOVE`` indicates that the kernel supports moving an
|
||||
existing page contents from userspace.
|
||||
|
||||
The userland application should set the feature flags it intends to use
|
||||
when invoking the ``UFFDIO_API`` ioctl, to request that those features be
|
||||
enabled if supported.
|
||||
|
@ -153,6 +153,26 @@ attribute, e. g.::
|
||||
|
||||
Setting this parameter to 100 will disable the hysteresis.
|
||||
|
||||
Some users cannot tolerate the swapping that comes with zswap store failures
|
||||
and zswap writebacks. Swapping can be disabled entirely (without disabling
|
||||
zswap itself) on a cgroup-basis as follows:
|
||||
|
||||
echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback
|
||||
|
||||
Note that if the store failures are recurring (for e.g if the pages are
|
||||
incompressible), users can observe reclaim inefficiency after disabling
|
||||
writeback (because the same pages might be rejected again and again).
|
||||
|
||||
When there is a sizable amount of cold memory residing in the zswap pool, it
|
||||
can be advantageous to proactively write these cold pages to swap and reclaim
|
||||
the memory for other use cases. By default, the zswap shrinker is disabled.
|
||||
User can enable it as follows:
|
||||
|
||||
echo Y > /sys/module/zswap/parameters/shrinker_enabled
|
||||
|
||||
This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON`` is
|
||||
selected.
|
||||
|
||||
A debugfs interface is provided for various statistic about pool size, number
|
||||
of pages stored, same-value filled pages and various counters for the reasons
|
||||
pages are rejected.
|
||||
|
@ -81,6 +81,9 @@ section.
|
||||
Sometimes it is necessary to ensure the next call to store to a maple tree does
|
||||
not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case.
|
||||
|
||||
You can use mtree_dup() to duplicate an entire maple tree. It is a more
|
||||
efficient way than inserting all elements one by one into a new tree.
|
||||
|
||||
Finally, you can remove all entries from a maple tree by calling
|
||||
mtree_destroy(). If the maple tree entries are pointers, you may wish to free
|
||||
the entries first.
|
||||
@ -112,6 +115,7 @@ Takes ma_lock internally:
|
||||
* mtree_insert()
|
||||
* mtree_insert_range()
|
||||
* mtree_erase()
|
||||
* mtree_dup()
|
||||
* mtree_destroy()
|
||||
* mt_set_in_rcu()
|
||||
* mt_clear_in_rcu()
|
||||
|
@ -261,7 +261,7 @@ prototypes::
|
||||
struct folio *src, enum migrate_mode);
|
||||
int (*launder_folio)(struct folio *);
|
||||
bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count);
|
||||
int (*error_remove_page)(struct address_space *, struct page *);
|
||||
int (*error_remove_folio)(struct address_space *, struct folio *);
|
||||
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
|
||||
int (*swap_deactivate)(struct file *);
|
||||
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
|
||||
@ -287,7 +287,7 @@ direct_IO:
|
||||
migrate_folio: yes (both)
|
||||
launder_folio: yes
|
||||
is_partially_uptodate: yes
|
||||
error_remove_page: yes
|
||||
error_remove_folio: yes
|
||||
swap_activate: no
|
||||
swap_deactivate: no
|
||||
swap_rw: yes, unlocks
|
||||
|
@ -528,9 +528,9 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
|
||||
does not take into account swapped out page of underlying shmem objects.
|
||||
"Locked" indicates whether the mapping is locked in memory or not.
|
||||
|
||||
"THPeligible" indicates whether the mapping is eligible for allocating THP
|
||||
pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
|
||||
It just shows the current status.
|
||||
"THPeligible" indicates whether the mapping is eligible for allocating
|
||||
naturally aligned THP pages of any currently enabled size. 1 if true, 0
|
||||
otherwise.
|
||||
|
||||
"VmFlags" field deserves a separate description. This member represents the
|
||||
kernel flags associated with the particular virtual memory area in two letter
|
||||
|
@ -823,7 +823,7 @@ cache in your filesystem. The following members are defined:
|
||||
bool (*is_partially_uptodate) (struct folio *, size_t from,
|
||||
size_t count);
|
||||
void (*is_dirty_writeback)(struct folio *, bool *, bool *);
|
||||
int (*error_remove_page) (struct mapping *mapping, struct page *page);
|
||||
int (*error_remove_folio)(struct mapping *mapping, struct folio *);
|
||||
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
|
||||
int (*swap_deactivate)(struct file *);
|
||||
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
|
||||
@ -1034,8 +1034,8 @@ cache in your filesystem. The following members are defined:
|
||||
VM if a folio should be treated as dirty or writeback for the
|
||||
purposes of stalling.
|
||||
|
||||
``error_remove_page``
|
||||
normally set to generic_error_remove_page if truncation is ok
|
||||
``error_remove_folio``
|
||||
normally set to generic_error_remove_folio if truncation is ok
|
||||
for this address space. Used for memory failure handling.
|
||||
Setting this implies you deal with pages going away under you,
|
||||
unless you have them locked or reference counts increased.
|
||||
|
@ -18,8 +18,6 @@ PTE Page Table Helpers
|
||||
+---------------------------+--------------------------------------------------+
|
||||
| pte_same | Tests whether both PTE entries are the same |
|
||||
+---------------------------+--------------------------------------------------+
|
||||
| pte_bad | Tests a non-table mapped PTE |
|
||||
+---------------------------+--------------------------------------------------+
|
||||
| pte_present | Tests a valid mapped PTE |
|
||||
+---------------------------+--------------------------------------------------+
|
||||
| pte_young | Tests a young PTE |
|
||||
|
@ -5,6 +5,18 @@ Design
|
||||
======
|
||||
|
||||
|
||||
.. _damon_design_execution_model_and_data_structures:
|
||||
|
||||
Execution Model and Data Structures
|
||||
===================================
|
||||
|
||||
The monitoring-related information including the monitoring request
|
||||
specification and DAMON-based operation schemes are stored in a data structure
|
||||
called DAMON ``context``. DAMON executes each context with a kernel thread
|
||||
called ``kdamond``. Multiple kdamonds could run in parallel, for different
|
||||
types of monitoring.
|
||||
|
||||
|
||||
Overall Architecture
|
||||
====================
|
||||
|
||||
@ -346,6 +358,19 @@ the weight will be respected are up to the underlying prioritization mechanism
|
||||
implementation.
|
||||
|
||||
|
||||
.. _damon_design_damos_quotas_auto_tuning:
|
||||
|
||||
Aim-oriented Feedback-driven Auto-tuning
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Automatic feedback-driven quota tuning. Instead of setting the absolute quota
|
||||
value, users can repeatedly provide numbers representing how much of their goal
|
||||
for the scheme is achieved as feedback. DAMOS then automatically tunes the
|
||||
aggressiveness (the quota) of the corresponding scheme. For example, if DAMOS
|
||||
is under achieving the goal, DAMOS automatically increases the quota. If DAMOS
|
||||
is over achieving the goal, it decreases the quota.
|
||||
|
||||
|
||||
.. _damon_design_damos_watermarks:
|
||||
|
||||
Watermarks
|
||||
@ -477,15 +502,3 @@ modules for proactive reclamation and LRU lists manipulation are provided. For
|
||||
more detail, please read the usage documents for those
|
||||
(:doc:`/admin-guide/mm/damon/reclaim` and
|
||||
:doc:`/admin-guide/mm/damon/lru_sort`).
|
||||
|
||||
|
||||
.. _damon_design_execution_model_and_data_structures:
|
||||
|
||||
Execution Model and Data Structures
|
||||
===================================
|
||||
|
||||
The monitoring-related information including the monitoring request
|
||||
specification and DAMON-based operation schemes are stored in a data structure
|
||||
called DAMON ``context``. DAMON executes each context with a kernel thread
|
||||
called ``kdamond``. Multiple kdamonds could run in parallel, for different
|
||||
types of monitoring.
|
||||
|
@ -117,7 +117,7 @@ pages:
|
||||
|
||||
- map/unmap of a PMD entry for the whole THP increment/decrement
|
||||
folio->_entire_mapcount and also increment/decrement
|
||||
folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount
|
||||
folio->_nr_pages_mapped by ENTIRELY_MAPPED when _entire_mapcount
|
||||
goes from -1 to 0 or 0 to -1.
|
||||
|
||||
- map/unmap of individual pages with PTE entry increment/decrement
|
||||
@ -156,7 +156,7 @@ Partial unmap and deferred_split_folio()
|
||||
|
||||
Unmapping part of THP (with munmap() or other way) is not going to free
|
||||
memory immediately. Instead, we detect that a subpage of THP is not in use
|
||||
in page_remove_rmap() and queue the THP for splitting if memory pressure
|
||||
in folio_remove_rmap_*() and queue the THP for splitting if memory pressure
|
||||
comes. Splitting will free up unused subpages.
|
||||
|
||||
Splitting the page right away is not an option due to locking context in
|
||||
|
@ -486,7 +486,7 @@ munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages.
|
||||
Before the unevictable/mlock changes, mlocking did not mark the pages in any
|
||||
way, so unmapping them required no processing.
|
||||
|
||||
For each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls
|
||||
For each PTE (or PMD) being unmapped from a VMA, folio_remove_rmap_*() calls
|
||||
munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED
|
||||
(unless it was a PTE mapping of a part of a transparent huge page).
|
||||
|
||||
@ -511,7 +511,7 @@ userspace; truncation even unmaps and deletes any private anonymous pages
|
||||
which had been Copied-On-Write from the file pages now being truncated.
|
||||
|
||||
Mlocked pages can be munlocked and deleted in this way: like with munmap(),
|
||||
for each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls
|
||||
for each PTE (or PMD) being unmapped from a VMA, folio_remove_rmap_*() calls
|
||||
munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED
|
||||
(unless it was a PTE mapping of a part of a transparent huge page).
|
||||
|
||||
|
@ -263,20 +263,20 @@ the name indicates, this function allocates pages of memory, and the second
|
||||
argument is "order" or a power of two number of pages, that is
|
||||
(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
|
||||
order=2 ==> 16384 bytes, etc. The maximum size of a
|
||||
region allocated by __get_free_pages is determined by the MAX_ORDER macro. More
|
||||
precisely the limit can be calculated as::
|
||||
region allocated by __get_free_pages is determined by the MAX_PAGE_ORDER macro.
|
||||
More precisely the limit can be calculated as::
|
||||
|
||||
PAGE_SIZE << MAX_ORDER
|
||||
PAGE_SIZE << MAX_PAGE_ORDER
|
||||
|
||||
In a i386 architecture PAGE_SIZE is 4096 bytes
|
||||
In a 2.4/i386 kernel MAX_ORDER is 10
|
||||
In a 2.6/i386 kernel MAX_ORDER is 11
|
||||
In a 2.4/i386 kernel MAX_PAGE_ORDER is 10
|
||||
In a 2.6/i386 kernel MAX_PAGE_ORDER is 11
|
||||
|
||||
So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
|
||||
respectively, with an i386 architecture.
|
||||
|
||||
User space programs can include /usr/include/sys/user.h and
|
||||
/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
|
||||
/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_PAGE_ORDER declarations.
|
||||
|
||||
The pagesize can also be determined dynamically with the getpagesize (2)
|
||||
system call.
|
||||
@ -324,7 +324,7 @@ Definitions:
|
||||
(see /proc/slabinfo)
|
||||
<pointer size> depends on the architecture -- ``sizeof(void *)``
|
||||
<page size> depends on the architecture -- PAGE_SIZE or getpagesize (2)
|
||||
<max-order> is the value defined with MAX_ORDER
|
||||
<max-order> is the value defined with MAX_PAGE_ORDER
|
||||
<frame size> it's an upper bound of frame's capture size (more on this later)
|
||||
============== ================================================================
|
||||
|
||||
|
@ -5339,6 +5339,7 @@ L: linux-mm@kvack.org
|
||||
S: Maintained
|
||||
F: mm/memcontrol.c
|
||||
F: mm/swap_cgroup.c
|
||||
F: samples/cgroup/*
|
||||
F: tools/testing/selftests/cgroup/memcg_protection.m
|
||||
F: tools/testing/selftests/cgroup/test_hugetlb_memcg.c
|
||||
F: tools/testing/selftests/cgroup/test_kmem.c
|
||||
|
@ -1470,6 +1470,14 @@ config DYNAMIC_SIGFRAME
|
||||
config HAVE_ARCH_NODE_DEV_GROUP
|
||||
bool
|
||||
|
||||
config ARCH_HAS_HW_PTE_YOUNG
|
||||
bool
|
||||
help
|
||||
Architectures that select this option are capable of setting the
|
||||
accessed bit in PTE entries when using them as part of linear address
|
||||
translations. Architectures that require runtime check should select
|
||||
this option and override arch_has_hw_pte_young().
|
||||
|
||||
config ARCH_HAS_NONLEAF_PMD_YOUNG
|
||||
bool
|
||||
help
|
||||
|
@ -1362,7 +1362,7 @@ config ARCH_FORCE_MAX_ORDER
|
||||
default "10"
|
||||
help
|
||||
The kernel page allocator limits the size of maximal physically
|
||||
contiguous allocations. The limit is called MAX_ORDER and it
|
||||
contiguous allocations. The limit is called MAX_PAGE_ORDER and it
|
||||
defines the maximal power of two of number of pages that can be
|
||||
allocated as a single contiguous block. This option allows
|
||||
overriding the default setting when ability to allocate very
|
||||
|
@ -36,6 +36,7 @@ config ARM64
|
||||
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
|
||||
select ARCH_HAS_PTE_DEVMAP
|
||||
select ARCH_HAS_PTE_SPECIAL
|
||||
select ARCH_HAS_HW_PTE_YOUNG
|
||||
select ARCH_HAS_SETUP_DMA_OPS
|
||||
select ARCH_HAS_SET_DIRECT_MAP
|
||||
select ARCH_HAS_SET_MEMORY
|
||||
@ -1519,15 +1520,15 @@ config XEN
|
||||
|
||||
# include/linux/mmzone.h requires the following to be true:
|
||||
#
|
||||
# MAX_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
|
||||
# MAX_PAGE_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
|
||||
#
|
||||
# so the maximum value of MAX_ORDER is SECTION_SIZE_BITS - PAGE_SHIFT:
|
||||
# so the maximum value of MAX_PAGE_ORDER is SECTION_SIZE_BITS - PAGE_SHIFT:
|
||||
#
|
||||
# | SECTION_SIZE_BITS | PAGE_SHIFT | max MAX_ORDER | default MAX_ORDER |
|
||||
# ----+-------------------+--------------+-----------------+--------------------+
|
||||
# 4K | 27 | 12 | 15 | 10 |
|
||||
# 16K | 27 | 14 | 13 | 11 |
|
||||
# 64K | 29 | 16 | 13 | 13 |
|
||||
# | SECTION_SIZE_BITS | PAGE_SHIFT | max MAX_PAGE_ORDER | default MAX_PAGE_ORDER |
|
||||
# ----+-------------------+--------------+----------------------+-------------------------+
|
||||
# 4K | 27 | 12 | 15 | 10 |
|
||||
# 16K | 27 | 14 | 13 | 11 |
|
||||
# 64K | 29 | 16 | 13 | 13 |
|
||||
config ARCH_FORCE_MAX_ORDER
|
||||
int
|
||||
default "13" if ARM64_64K_PAGES
|
||||
@ -1535,16 +1536,16 @@ config ARCH_FORCE_MAX_ORDER
|
||||
default "10"
|
||||
help
|
||||
The kernel page allocator limits the size of maximal physically
|
||||
contiguous allocations. The limit is called MAX_ORDER and it
|
||||
contiguous allocations. The limit is called MAX_PAGE_ORDER and it
|
||||
defines the maximal power of two of number of pages that can be
|
||||
allocated as a single contiguous block. This option allows
|
||||
overriding the default setting when ability to allocate very
|
||||
large blocks of physically contiguous memory is required.
|
||||
|
||||
The maximal size of allocation cannot exceed the size of the
|
||||
section, so the value of MAX_ORDER should satisfy
|
||||
section, so the value of MAX_PAGE_ORDER should satisfy
|
||||
|
||||
MAX_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
|
||||
MAX_PAGE_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
|
||||
|
||||
Don't change if unsure.
|
||||
|
||||
|
@ -15,29 +15,9 @@
|
||||
|
||||
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
|
||||
|
||||
void kasan_init(void);
|
||||
|
||||
/*
|
||||
* KASAN_SHADOW_START: beginning of the kernel virtual addresses.
|
||||
* KASAN_SHADOW_END: KASAN_SHADOW_START + 1/N of kernel virtual addresses,
|
||||
* where N = (1 << KASAN_SHADOW_SCALE_SHIFT).
|
||||
*
|
||||
* KASAN_SHADOW_OFFSET:
|
||||
* This value is used to map an address to the corresponding shadow
|
||||
* address by the following formula:
|
||||
* shadow_addr = (address >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET
|
||||
*
|
||||
* (1 << (64 - KASAN_SHADOW_SCALE_SHIFT)) shadow addresses that lie in range
|
||||
* [KASAN_SHADOW_OFFSET, KASAN_SHADOW_END) cover all 64-bits of virtual
|
||||
* addresses. So KASAN_SHADOW_OFFSET should satisfy the following equation:
|
||||
* KASAN_SHADOW_OFFSET = KASAN_SHADOW_END -
|
||||
* (1ULL << (64 - KASAN_SHADOW_SCALE_SHIFT))
|
||||
*/
|
||||
#define _KASAN_SHADOW_START(va) (KASAN_SHADOW_END - (1UL << ((va) - KASAN_SHADOW_SCALE_SHIFT)))
|
||||
#define KASAN_SHADOW_START _KASAN_SHADOW_START(vabits_actual)
|
||||
|
||||
void kasan_copy_shadow(pgd_t *pgdir);
|
||||
asmlinkage void kasan_early_init(void);
|
||||
void kasan_init(void);
|
||||
void kasan_copy_shadow(pgd_t *pgdir);
|
||||
|
||||
#else
|
||||
static inline void kasan_init(void) { }
|
||||
|
@ -65,15 +65,41 @@
|
||||
#define KERNEL_END _end
|
||||
|
||||
/*
|
||||
* Generic and tag-based KASAN require 1/8th and 1/16th of the kernel virtual
|
||||
* address space for the shadow region respectively. They can bloat the stack
|
||||
* significantly, so double the (minimum) stack size when they are in use.
|
||||
* Generic and Software Tag-Based KASAN modes require 1/8th and 1/16th of the
|
||||
* kernel virtual address space for storing the shadow memory respectively.
|
||||
*
|
||||
* The mapping between a virtual memory address and its corresponding shadow
|
||||
* memory address is defined based on the formula:
|
||||
*
|
||||
* shadow_addr = (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET
|
||||
*
|
||||
* where KASAN_SHADOW_SCALE_SHIFT is the order of the number of bits that map
|
||||
* to a single shadow byte and KASAN_SHADOW_OFFSET is a constant that offsets
|
||||
* the mapping. Note that KASAN_SHADOW_OFFSET does not point to the start of
|
||||
* the shadow memory region.
|
||||
*
|
||||
* Based on this mapping, we define two constants:
|
||||
*
|
||||
* KASAN_SHADOW_START: the start of the shadow memory region;
|
||||
* KASAN_SHADOW_END: the end of the shadow memory region.
|
||||
*
|
||||
* KASAN_SHADOW_END is defined first as the shadow address that corresponds to
|
||||
* the upper bound of possible virtual kernel memory addresses UL(1) << 64
|
||||
* according to the mapping formula.
|
||||
*
|
||||
* KASAN_SHADOW_START is defined second based on KASAN_SHADOW_END. The shadow
|
||||
* memory start must map to the lowest possible kernel virtual memory address
|
||||
* and thus it depends on the actual bitness of the address space.
|
||||
*
|
||||
* As KASAN inserts redzones between stack variables, this increases the stack
|
||||
* memory usage significantly. Thus, we double the (minimum) stack size.
|
||||
*/
|
||||
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
|
||||
#define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
|
||||
#define KASAN_SHADOW_END ((UL(1) << (64 - KASAN_SHADOW_SCALE_SHIFT)) \
|
||||
+ KASAN_SHADOW_OFFSET)
|
||||
#define PAGE_END (KASAN_SHADOW_END - (1UL << (vabits_actual - KASAN_SHADOW_SCALE_SHIFT)))
|
||||
#define KASAN_SHADOW_END ((UL(1) << (64 - KASAN_SHADOW_SCALE_SHIFT)) + KASAN_SHADOW_OFFSET)
|
||||
#define _KASAN_SHADOW_START(va) (KASAN_SHADOW_END - (UL(1) << ((va) - KASAN_SHADOW_SCALE_SHIFT)))
|
||||
#define KASAN_SHADOW_START _KASAN_SHADOW_START(vabits_actual)
|
||||
#define PAGE_END KASAN_SHADOW_START
|
||||
#define KASAN_THREAD_SHIFT 1
|
||||
#else
|
||||
#define KASAN_THREAD_SHIFT 0
|
||||
|
@ -10,7 +10,7 @@
|
||||
/*
|
||||
* Section size must be at least 512MB for 64K base
|
||||
* page size config. Otherwise it will be less than
|
||||
* MAX_ORDER and the build process will fail.
|
||||
* MAX_PAGE_ORDER and the build process will fail.
|
||||
*/
|
||||
#ifdef CONFIG_ARM64_64K_PAGES
|
||||
#define SECTION_SIZE_BITS 29
|
||||
|
@ -16,7 +16,7 @@ struct hyp_pool {
|
||||
* API at EL2.
|
||||
*/
|
||||
hyp_spinlock_t lock;
|
||||
struct list_head free_area[MAX_ORDER + 1];
|
||||
struct list_head free_area[NR_PAGE_ORDERS];
|
||||
phys_addr_t range_start;
|
||||
phys_addr_t range_end;
|
||||
unsigned short max_order;
|
||||
|
@ -228,7 +228,8 @@ int hyp_pool_init(struct hyp_pool *pool, u64 pfn, unsigned int nr_pages,
|
||||
int i;
|
||||
|
||||
hyp_spin_lock_init(&pool->lock);
|
||||
pool->max_order = min(MAX_ORDER, get_order(nr_pages << PAGE_SHIFT));
|
||||
pool->max_order = min(MAX_PAGE_ORDER,
|
||||
get_order(nr_pages << PAGE_SHIFT));
|
||||
for (i = 0; i <= pool->max_order; i++)
|
||||
INIT_LIST_HEAD(&pool->free_area[i]);
|
||||
pool->range_start = phys;
|
||||
|
@ -51,7 +51,7 @@ void __init arm64_hugetlb_cma_reserve(void)
|
||||
* page allocator. Just warn if there is any change
|
||||
* breaking this assumption.
|
||||
*/
|
||||
WARN_ON(order <= MAX_ORDER);
|
||||
WARN_ON(order <= MAX_PAGE_ORDER);
|
||||
hugetlb_cma_reserve(order);
|
||||
}
|
||||
#endif /* CONFIG_CMA */
|
||||
|
@ -170,6 +170,11 @@ asmlinkage void __init kasan_early_init(void)
|
||||
{
|
||||
BUILD_BUG_ON(KASAN_SHADOW_OFFSET !=
|
||||
KASAN_SHADOW_END - (1UL << (64 - KASAN_SHADOW_SCALE_SHIFT)));
|
||||
/*
|
||||
* We cannot check the actual value of KASAN_SHADOW_START during build,
|
||||
* as it depends on vabits_actual. As a best-effort approach, check
|
||||
* potential values calculated based on VA_BITS and VA_BITS_MIN.
|
||||
*/
|
||||
BUILD_BUG_ON(!IS_ALIGNED(_KASAN_SHADOW_START(VA_BITS), PGDIR_SIZE));
|
||||
BUILD_BUG_ON(!IS_ALIGNED(_KASAN_SHADOW_START(VA_BITS_MIN), PGDIR_SIZE));
|
||||
BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_END, PGDIR_SIZE));
|
||||
|
@ -523,6 +523,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
|
||||
return pmd;
|
||||
}
|
||||
|
||||
#define pmd_dirty pmd_dirty
|
||||
static inline int pmd_dirty(pmd_t pmd)
|
||||
{
|
||||
return !!(pmd_val(pmd) & (_PAGE_DIRTY | _PAGE_MODIFIED));
|
||||
|
@ -226,32 +226,6 @@ static void __init node_mem_init(unsigned int node)
|
||||
|
||||
#ifdef CONFIG_ACPI_NUMA
|
||||
|
||||
/*
|
||||
* Sanity check to catch more bad NUMA configurations (they are amazingly
|
||||
* common). Make sure the nodes cover all memory.
|
||||
*/
|
||||
static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
|
||||
{
|
||||
int i;
|
||||
u64 numaram, biosram;
|
||||
|
||||
numaram = 0;
|
||||
for (i = 0; i < mi->nr_blks; i++) {
|
||||
u64 s = mi->blk[i].start >> PAGE_SHIFT;
|
||||
u64 e = mi->blk[i].end >> PAGE_SHIFT;
|
||||
|
||||
numaram += e - s;
|
||||
numaram -= __absent_pages_in_range(mi->blk[i].nid, s, e);
|
||||
if ((s64)numaram < 0)
|
||||
numaram = 0;
|
||||
}
|
||||
max_pfn = max_low_pfn;
|
||||
biosram = max_pfn - absent_pages_in_range(0, max_pfn);
|
||||
|
||||
BUG_ON((s64)(biosram - numaram) >= (1 << (20 - PAGE_SHIFT)));
|
||||
return true;
|
||||
}
|
||||
|
||||
static void __init add_node_intersection(u32 node, u64 start, u64 size, u32 type)
|
||||
{
|
||||
static unsigned long num_physpages;
|
||||
@ -396,7 +370,7 @@ int __init init_numa_memory(void)
|
||||
return -EINVAL;
|
||||
|
||||
init_node_memblock();
|
||||
if (numa_meminfo_cover_memory(&numa_meminfo) == false)
|
||||
if (!memblock_validate_numa_coverage(SZ_1M))
|
||||
return -EINVAL;
|
||||
|
||||
for_each_node_mask(node, node_possible_map) {
|
||||
|
@ -402,7 +402,7 @@ config ARCH_FORCE_MAX_ORDER
|
||||
default "10"
|
||||
help
|
||||
The kernel page allocator limits the size of maximal physically
|
||||
contiguous allocations. The limit is called MAX_ORDER and it
|
||||
contiguous allocations. The limit is called MAX_PAGE_ORDER and it
|
||||
defines the maximal power of two of number of pages that can be
|
||||
allocated as a single contiguous block. This option allows
|
||||
overriding the default setting when ability to allocate very
|
||||
|
@ -655,6 +655,7 @@ static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
|
||||
return pmd;
|
||||
}
|
||||
|
||||
#define pmd_dirty pmd_dirty
|
||||
static inline int pmd_dirty(pmd_t pmd)
|
||||
{
|
||||
return !!(pmd_val(pmd) & _PAGE_MODIFIED);
|
||||
|
@ -50,7 +50,7 @@ config ARCH_FORCE_MAX_ORDER
|
||||
default "10"
|
||||
help
|
||||
The kernel page allocator limits the size of maximal physically
|
||||
contiguous allocations. The limit is called MAX_ORDER and it
|
||||
contiguous allocations. The limit is called MAX_PAGE_ORDER and it
|
||||
defines the maximal power of two of number of pages that can be
|
||||
allocated as a single contiguous block. This option allows
|
||||
overriding the default setting when ability to allocate very
|
||||
|
@ -916,7 +916,7 @@ config ARCH_FORCE_MAX_ORDER
|
||||
default "10"
|
||||
help
|
||||
The kernel page allocator limits the size of maximal physically
|
||||
contiguous allocations. The limit is called MAX_ORDER and it
|
||||
contiguous allocations. The limit is called MAX_PAGE_ORDER and it
|
||||
defines the maximal power of two of number of pages that can be
|
||||
allocated as a single contiguous block. This option allows
|
||||
overriding the default setting when ability to allocate very
|
||||
|
@ -97,7 +97,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
|
||||
}
|
||||
|
||||
mmap_read_lock(mm);
|
||||
chunk = (1UL << (PAGE_SHIFT + MAX_ORDER)) /
|
||||
chunk = (1UL << (PAGE_SHIFT + MAX_PAGE_ORDER)) /
|
||||
sizeof(struct vm_area_struct *);
|
||||
chunk = min(chunk, entries);
|
||||
for (entry = 0; entry < entries; entry += chunk) {
|
||||
|
@ -615,7 +615,7 @@ void __init gigantic_hugetlb_cma_reserve(void)
|
||||
order = mmu_psize_to_shift(MMU_PAGE_16G) - PAGE_SHIFT;
|
||||
|
||||
if (order) {
|
||||
VM_WARN_ON(order <= MAX_ORDER);
|
||||
VM_WARN_ON(order <= MAX_PAGE_ORDER);
|
||||
hugetlb_cma_reserve(order);
|
||||
}
|
||||
}
|
||||
|
@ -1389,7 +1389,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
|
||||
* DMA window can be larger than available memory, which will
|
||||
* cause errors later.
|
||||
*/
|
||||
const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER);
|
||||
const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_PAGE_ORDER);
|
||||
|
||||
/*
|
||||
* We create the default window as big as we can. The constraint is
|
||||
|
@ -673,6 +673,7 @@ static inline int pmd_write(pmd_t pmd)
|
||||
return pte_write(pmd_pte(pmd));
|
||||
}
|
||||
|
||||
#define pmd_dirty pmd_dirty
|
||||
static inline int pmd_dirty(pmd_t pmd)
|
||||
{
|
||||
return pte_dirty(pmd_pte(pmd));
|
||||
|
@ -770,6 +770,7 @@ static inline int pud_write(pud_t pud)
|
||||
return (pud_val(pud) & _REGION3_ENTRY_WRITE) != 0;
|
||||
}
|
||||
|
||||
#define pmd_dirty pmd_dirty
|
||||
static inline int pmd_dirty(pmd_t pmd)
|
||||
{
|
||||
return (pmd_val(pmd) & _SEGMENT_ENTRY_DIRTY) != 0;
|
||||
|
@ -26,7 +26,7 @@ config ARCH_FORCE_MAX_ORDER
|
||||
default "10"
|
||||
help
|
||||
The kernel page allocator limits the size of maximal physically
|
||||
contiguous allocations. The limit is called MAX_ORDER and it
|
||||
contiguous allocations. The limit is called MAX_PAGE:_ORDER and it
|
||||
defines the maximal power of two of number of pages that can be
|
||||
allocated as a single contiguous block. This option allows
|
||||
overriding the default setting when ability to allocate very
|
||||
|
@ -277,7 +277,7 @@ config ARCH_FORCE_MAX_ORDER
|
||||
default "12"
|
||||
help
|
||||
The kernel page allocator limits the size of maximal physically
|
||||
contiguous allocations. The limit is called MAX_ORDER and it
|
||||
contiguous allocations. The limit is called MAX_PAGE_ORDER and it
|
||||
defines the maximal power of two of number of pages that can be
|
||||
allocated as a single contiguous block. This option allows
|
||||
overriding the default setting when ability to allocate very
|
||||
|
@ -706,6 +706,7 @@ static inline unsigned long pmd_write(pmd_t pmd)
|
||||
#define pud_write(pud) pte_write(__pte(pud_val(pud)))
|
||||
|
||||
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
||||
#define pmd_dirty pmd_dirty
|
||||
static inline unsigned long pmd_dirty(pmd_t pmd)
|
||||
{
|
||||
pte_t pte = __pte(pmd_val(pmd));
|
||||
|
@ -194,7 +194,7 @@ static void *dma_4v_alloc_coherent(struct device *dev, size_t size,
|
||||
|
||||
size = IO_PAGE_ALIGN(size);
|
||||
order = get_order(size);
|
||||
if (unlikely(order > MAX_ORDER))
|
||||
if (unlikely(order > MAX_PAGE_ORDER))
|
||||
return NULL;
|
||||
|
||||
npages = size >> IO_PAGE_SHIFT;
|
||||
|
@ -897,7 +897,7 @@ void __init cheetah_ecache_flush_init(void)
|
||||
|
||||
/* Now allocate error trap reporting scoreboard. */
|
||||
sz = NR_CPUS * (2 * sizeof(struct cheetah_err_info));
|
||||
for (order = 0; order <= MAX_ORDER; order++) {
|
||||
for (order = 0; order < NR_PAGE_ORDERS; order++) {
|
||||
if ((PAGE_SIZE << order) >= sz)
|
||||
break;
|
||||
}
|
||||
|
@ -402,8 +402,8 @@ void tsb_grow(struct mm_struct *mm, unsigned long tsb_index, unsigned long rss)
|
||||
unsigned long new_rss_limit;
|
||||
gfp_t gfp_flags;
|
||||
|
||||
if (max_tsb_size > PAGE_SIZE << MAX_ORDER)
|
||||
max_tsb_size = PAGE_SIZE << MAX_ORDER;
|
||||
if (max_tsb_size > PAGE_SIZE << MAX_PAGE_ORDER)
|
||||
max_tsb_size = PAGE_SIZE << MAX_PAGE_ORDER;
|
||||
|
||||
new_cache_index = 0;
|
||||
for (new_size = 8192; new_size < max_tsb_size; new_size <<= 1UL) {
|
||||
|
@ -373,10 +373,10 @@ int __init linux_main(int argc, char **argv)
|
||||
max_physmem = TASK_SIZE - uml_physmem - iomem_size - MIN_VMALLOC;
|
||||
|
||||
/*
|
||||
* Zones have to begin on a 1 << MAX_ORDER page boundary,
|
||||
* Zones have to begin on a 1 << MAX_PAGE_ORDER page boundary,
|
||||
* so this makes sure that's true for highmem
|
||||
*/
|
||||
max_physmem &= ~((1 << (PAGE_SHIFT + MAX_ORDER)) - 1);
|
||||
max_physmem &= ~((1 << (PAGE_SHIFT + MAX_PAGE_ORDER)) - 1);
|
||||
if (physmem_size + iomem_size > max_physmem) {
|
||||
highmem = physmem_size + iomem_size - max_physmem;
|
||||
physmem_size -= highmem;
|
||||
|
@ -88,6 +88,7 @@ config X86
|
||||
select ARCH_HAS_PMEM_API if X86_64
|
||||
select ARCH_HAS_PTE_DEVMAP if X86_64
|
||||
select ARCH_HAS_PTE_SPECIAL
|
||||
select ARCH_HAS_HW_PTE_YOUNG
|
||||
select ARCH_HAS_NONLEAF_PMD_YOUNG if PGTABLE_LEVELS > 2
|
||||
select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64
|
||||
select ARCH_HAS_COPY_MC if X86_64
|
||||
|
@ -141,6 +141,7 @@ static inline int pte_young(pte_t pte)
|
||||
return pte_flags(pte) & _PAGE_ACCESSED;
|
||||
}
|
||||
|
||||
#define pmd_dirty pmd_dirty
|
||||
static inline bool pmd_dirty(pmd_t pmd)
|
||||
{
|
||||
return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
|
||||
@ -1679,12 +1680,6 @@ static inline bool arch_has_pfn_modify_check(void)
|
||||
return boot_cpu_has_bug(X86_BUG_L1TF);
|
||||
}
|
||||
|
||||
#define arch_has_hw_pte_young arch_has_hw_pte_young
|
||||
static inline bool arch_has_hw_pte_young(void)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
#define arch_check_zapped_pte arch_check_zapped_pte
|
||||
void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte);
|
||||
|
||||
|
@ -449,37 +449,6 @@ int __node_distance(int from, int to)
|
||||
}
|
||||
EXPORT_SYMBOL(__node_distance);
|
||||
|
||||
/*
|
||||
* Sanity check to catch more bad NUMA configurations (they are amazingly
|
||||
* common). Make sure the nodes cover all memory.
|
||||
*/
|
||||
static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
|
||||
{
|
||||
u64 numaram, e820ram;
|
||||
int i;
|
||||
|
||||
numaram = 0;
|
||||
for (i = 0; i < mi->nr_blks; i++) {
|
||||
u64 s = mi->blk[i].start >> PAGE_SHIFT;
|
||||
u64 e = mi->blk[i].end >> PAGE_SHIFT;
|
||||
numaram += e - s;
|
||||
numaram -= __absent_pages_in_range(mi->blk[i].nid, s, e);
|
||||
if ((s64)numaram < 0)
|
||||
numaram = 0;
|
||||
}
|
||||
|
||||
e820ram = max_pfn - absent_pages_in_range(0, max_pfn);
|
||||
|
||||
/* We seem to lose 3 pages somewhere. Allow 1M of slack. */
|
||||
if ((s64)(e820ram - numaram) >= (1 << (20 - PAGE_SHIFT))) {
|
||||
printk(KERN_ERR "NUMA: nodes only cover %LuMB of your %LuMB e820 RAM. Not used.\n",
|
||||
(numaram << PAGE_SHIFT) >> 20,
|
||||
(e820ram << PAGE_SHIFT) >> 20);
|
||||
return false;
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
/*
|
||||
* Mark all currently memblock-reserved physical memory (which covers the
|
||||
* kernel's own memory ranges) as hot-unswappable.
|
||||
@ -585,7 +554,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
|
||||
return -EINVAL;
|
||||
}
|
||||
}
|
||||
if (!numa_meminfo_cover_memory(mi))
|
||||
|
||||
if (!memblock_validate_numa_coverage(SZ_1M))
|
||||
return -EINVAL;
|
||||
|
||||
/* Finally register nodes. */
|
||||
|
@ -793,7 +793,7 @@ config ARCH_FORCE_MAX_ORDER
|
||||
default "10"
|
||||
help
|
||||
The kernel page allocator limits the size of maximal physically
|
||||
contiguous allocations. The limit is called MAX_ORDER and it
|
||||
contiguous allocations. The limit is called MAX_PAGE_ORDER and it
|
||||
defines the maximal power of two of number of pages that can be
|
||||
allocated as a single contiguous block. This option allows
|
||||
overriding the default setting when ability to allocate very
|
||||
|
@ -18,6 +18,8 @@
|
||||
#define KASAN_SHADOW_START (XCHAL_PAGE_TABLE_VADDR + XCHAL_PAGE_TABLE_SIZE)
|
||||
/* Size of the shadow map */
|
||||
#define KASAN_SHADOW_SIZE (-KASAN_START_VADDR >> KASAN_SHADOW_SCALE_SHIFT)
|
||||
/* End of the shadow map */
|
||||
#define KASAN_SHADOW_END (KASAN_SHADOW_START + KASAN_SHADOW_SIZE)
|
||||
/* Offset for mem to shadow address transformation */
|
||||
#define KASAN_SHADOW_OFFSET __XTENSA_UL_CONST(CONFIG_KASAN_SHADOW_OFFSET)
|
||||
|
||||
|
23
block/fops.c
23
block/fops.c
@ -410,9 +410,24 @@ static int blkdev_get_block(struct inode *inode, sector_t iblock,
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
|
||||
/*
|
||||
* We cannot call mpage_writepages() as it does not take the buffer lock.
|
||||
* We must use block_write_full_folio() directly which holds the buffer
|
||||
* lock. The buffer lock provides the synchronisation with writeback
|
||||
* that filesystems rely on when they use the blockdev's mapping.
|
||||
*/
|
||||
static int blkdev_writepages(struct address_space *mapping,
|
||||
struct writeback_control *wbc)
|
||||
{
|
||||
return block_write_full_page(page, blkdev_get_block, wbc);
|
||||
struct blk_plug plug;
|
||||
int err;
|
||||
|
||||
blk_start_plug(&plug);
|
||||
err = write_cache_pages(mapping, wbc, block_write_full_folio,
|
||||
blkdev_get_block);
|
||||
blk_finish_plug(&plug);
|
||||
|
||||
return err;
|
||||
}
|
||||
|
||||
static int blkdev_read_folio(struct file *file, struct folio *folio)
|
||||
@ -449,7 +464,7 @@ const struct address_space_operations def_blk_aops = {
|
||||
.invalidate_folio = block_invalidate_folio,
|
||||
.read_folio = blkdev_read_folio,
|
||||
.readahead = blkdev_readahead,
|
||||
.writepage = blkdev_writepage,
|
||||
.writepages = blkdev_writepages,
|
||||
.write_begin = blkdev_write_begin,
|
||||
.write_end = blkdev_write_end,
|
||||
.migrate_folio = buffer_migrate_folio_norefs,
|
||||
@ -500,7 +515,7 @@ const struct address_space_operations def_blk_aops = {
|
||||
.readahead = blkdev_readahead,
|
||||
.writepages = blkdev_writepages,
|
||||
.is_partially_uptodate = iomap_is_partially_uptodate,
|
||||
.error_remove_page = generic_error_remove_page,
|
||||
.error_remove_folio = generic_error_remove_folio,
|
||||
.migrate_folio = filemap_migrate_folio,
|
||||
};
|
||||
#endif /* CONFIG_BUFFER_HEAD */
|
||||
|
@ -451,7 +451,7 @@ static int create_sgt(struct qaic_device *qdev, struct sg_table **sgt_out, u64 s
|
||||
* later
|
||||
*/
|
||||
buf_extra = (PAGE_SIZE - size % PAGE_SIZE) % PAGE_SIZE;
|
||||
max_order = min(MAX_ORDER - 1, get_order(size));
|
||||
max_order = min(MAX_PAGE_ORDER - 1, get_order(size));
|
||||
} else {
|
||||
/* allocate a single page for book keeping */
|
||||
nr_pages = 1;
|
||||
|
@ -234,7 +234,7 @@ static int binder_update_page_range(struct binder_alloc *alloc, int allocate,
|
||||
if (page->page_ptr) {
|
||||
trace_binder_alloc_lru_start(alloc, index);
|
||||
|
||||
on_lru = list_lru_del(&binder_alloc_lru, &page->lru);
|
||||
on_lru = list_lru_del_obj(&binder_alloc_lru, &page->lru);
|
||||
WARN_ON(!on_lru);
|
||||
|
||||
trace_binder_alloc_lru_end(alloc, index);
|
||||
@ -285,7 +285,7 @@ free_range:
|
||||
|
||||
trace_binder_free_lru_start(alloc, index);
|
||||
|
||||
ret = list_lru_add(&binder_alloc_lru, &page->lru);
|
||||
ret = list_lru_add_obj(&binder_alloc_lru, &page->lru);
|
||||
WARN_ON(!ret);
|
||||
|
||||
trace_binder_free_lru_end(alloc, index);
|
||||
@ -848,7 +848,7 @@ void binder_alloc_deferred_release(struct binder_alloc *alloc)
|
||||
if (!alloc->pages[i].page_ptr)
|
||||
continue;
|
||||
|
||||
on_lru = list_lru_del(&binder_alloc_lru,
|
||||
on_lru = list_lru_del_obj(&binder_alloc_lru,
|
||||
&alloc->pages[i].lru);
|
||||
page_addr = alloc->buffer + i * PAGE_SIZE;
|
||||
binder_alloc_debug(BINDER_DEBUG_BUFFER_ALLOC,
|
||||
@ -1287,4 +1287,3 @@ int binder_alloc_copy_from_buffer(struct binder_alloc *alloc,
|
||||
return binder_alloc_do_buffer_copy(alloc, false, buffer, buffer_offset,
|
||||
dest, bytes);
|
||||
}
|
||||
|
||||
|
@ -226,8 +226,8 @@ static ssize_t regmap_read_debugfs(struct regmap *map, unsigned int from,
|
||||
if (*ppos < 0 || !count)
|
||||
return -EINVAL;
|
||||
|
||||
if (count > (PAGE_SIZE << MAX_ORDER))
|
||||
count = PAGE_SIZE << MAX_ORDER;
|
||||
if (count > (PAGE_SIZE << MAX_PAGE_ORDER))
|
||||
count = PAGE_SIZE << MAX_PAGE_ORDER;
|
||||
|
||||
buf = kmalloc(count, GFP_KERNEL);
|
||||
if (!buf)
|
||||
@ -373,8 +373,8 @@ static ssize_t regmap_reg_ranges_read_file(struct file *file,
|
||||
if (*ppos < 0 || !count)
|
||||
return -EINVAL;
|
||||
|
||||
if (count > (PAGE_SIZE << MAX_ORDER))
|
||||
count = PAGE_SIZE << MAX_ORDER;
|
||||
if (count > (PAGE_SIZE << MAX_PAGE_ORDER))
|
||||
count = PAGE_SIZE << MAX_PAGE_ORDER;
|
||||
|
||||
buf = kmalloc(count, GFP_KERNEL);
|
||||
if (!buf)
|
||||
|
@ -3079,7 +3079,7 @@ static void raw_cmd_free(struct floppy_raw_cmd **ptr)
|
||||
}
|
||||
}
|
||||
|
||||
#define MAX_LEN (1UL << MAX_ORDER << PAGE_SHIFT)
|
||||
#define MAX_LEN (1UL << MAX_PAGE_ORDER << PAGE_SHIFT)
|
||||
|
||||
static int raw_cmd_copyin(int cmd, void __user *param,
|
||||
struct floppy_raw_cmd **rcmd)
|
||||
|
@ -59,8 +59,8 @@ config ZRAM_WRITEBACK
|
||||
bool "Write back incompressible or idle page to backing device"
|
||||
depends on ZRAM
|
||||
help
|
||||
With incompressible page, there is no memory saving to keep it
|
||||
in memory. Instead, write it out to backing device.
|
||||
This lets zram entries (incompressible or idle pages) be written
|
||||
back to a backing device, helping save memory.
|
||||
For this feature, admin should set up backing device via
|
||||
/sys/block/zramX/backing_dev.
|
||||
|
||||
@ -69,9 +69,18 @@ config ZRAM_WRITEBACK
|
||||
|
||||
See Documentation/admin-guide/blockdev/zram.rst for more information.
|
||||
|
||||
config ZRAM_TRACK_ENTRY_ACTIME
|
||||
bool "Track access time of zram entries"
|
||||
depends on ZRAM
|
||||
help
|
||||
With this feature zram tracks access time of every stored
|
||||
entry (page), which can be used for a more fine grained IDLE
|
||||
pages writeback.
|
||||
|
||||
config ZRAM_MEMORY_TRACKING
|
||||
bool "Track zRam block status"
|
||||
depends on ZRAM && DEBUG_FS
|
||||
select ZRAM_TRACK_ENTRY_ACTIME
|
||||
help
|
||||
With this feature, admin can track the state of allocated blocks
|
||||
of zRAM. Admin could see the information via
|
||||
@ -86,4 +95,4 @@ config ZRAM_MULTI_COMP
|
||||
This will enable multi-compression streams, so that ZRAM can
|
||||
re-compress pages using a potentially slower but more effective
|
||||
compression algorithm. Note, that IDLE page recompression
|
||||
requires ZRAM_MEMORY_TRACKING.
|
||||
requires ZRAM_TRACK_ENTRY_ACTIME.
|
||||
|
@ -174,6 +174,14 @@ static inline u32 zram_get_priority(struct zram *zram, u32 index)
|
||||
return prio & ZRAM_COMP_PRIORITY_MASK;
|
||||
}
|
||||
|
||||
static void zram_accessed(struct zram *zram, u32 index)
|
||||
{
|
||||
zram_clear_flag(zram, index, ZRAM_IDLE);
|
||||
#ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
|
||||
zram->table[index].ac_time = ktime_get_boottime();
|
||||
#endif
|
||||
}
|
||||
|
||||
static inline void update_used_max(struct zram *zram,
|
||||
const unsigned long pages)
|
||||
{
|
||||
@ -293,8 +301,9 @@ static void mark_idle(struct zram *zram, ktime_t cutoff)
|
||||
zram_slot_lock(zram, index);
|
||||
if (zram_allocated(zram, index) &&
|
||||
!zram_test_flag(zram, index, ZRAM_UNDER_WB)) {
|
||||
#ifdef CONFIG_ZRAM_MEMORY_TRACKING
|
||||
is_idle = !cutoff || ktime_after(cutoff, zram->table[index].ac_time);
|
||||
#ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
|
||||
is_idle = !cutoff || ktime_after(cutoff,
|
||||
zram->table[index].ac_time);
|
||||
#endif
|
||||
if (is_idle)
|
||||
zram_set_flag(zram, index, ZRAM_IDLE);
|
||||
@ -317,7 +326,7 @@ static ssize_t idle_store(struct device *dev,
|
||||
*/
|
||||
u64 age_sec;
|
||||
|
||||
if (IS_ENABLED(CONFIG_ZRAM_MEMORY_TRACKING) && !kstrtoull(buf, 0, &age_sec))
|
||||
if (IS_ENABLED(CONFIG_ZRAM_TRACK_ENTRY_ACTIME) && !kstrtoull(buf, 0, &age_sec))
|
||||
cutoff_time = ktime_sub(ktime_get_boottime(),
|
||||
ns_to_ktime(age_sec * NSEC_PER_SEC));
|
||||
else
|
||||
@ -841,12 +850,6 @@ static void zram_debugfs_destroy(void)
|
||||
debugfs_remove_recursive(zram_debugfs_root);
|
||||
}
|
||||
|
||||
static void zram_accessed(struct zram *zram, u32 index)
|
||||
{
|
||||
zram_clear_flag(zram, index, ZRAM_IDLE);
|
||||
zram->table[index].ac_time = ktime_get_boottime();
|
||||
}
|
||||
|
||||
static ssize_t read_block_state(struct file *file, char __user *buf,
|
||||
size_t count, loff_t *ppos)
|
||||
{
|
||||
@ -930,10 +933,6 @@ static void zram_debugfs_unregister(struct zram *zram)
|
||||
#else
|
||||
static void zram_debugfs_create(void) {};
|
||||
static void zram_debugfs_destroy(void) {};
|
||||
static void zram_accessed(struct zram *zram, u32 index)
|
||||
{
|
||||
zram_clear_flag(zram, index, ZRAM_IDLE);
|
||||
};
|
||||
static void zram_debugfs_register(struct zram *zram) {};
|
||||
static void zram_debugfs_unregister(struct zram *zram) {};
|
||||
#endif
|
||||
@ -1254,7 +1253,7 @@ static void zram_free_page(struct zram *zram, size_t index)
|
||||
{
|
||||
unsigned long handle;
|
||||
|
||||
#ifdef CONFIG_ZRAM_MEMORY_TRACKING
|
||||
#ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
|
||||
zram->table[index].ac_time = 0;
|
||||
#endif
|
||||
if (zram_test_flag(zram, index, ZRAM_IDLE))
|
||||
@ -1322,9 +1321,9 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
|
||||
void *mem;
|
||||
|
||||
value = handle ? zram_get_element(zram, index) : 0;
|
||||
mem = kmap_atomic(page);
|
||||
mem = kmap_local_page(page);
|
||||
zram_fill_page(mem, PAGE_SIZE, value);
|
||||
kunmap_atomic(mem);
|
||||
kunmap_local(mem);
|
||||
return 0;
|
||||
}
|
||||
|
||||
@ -1337,14 +1336,14 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
|
||||
|
||||
src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
|
||||
if (size == PAGE_SIZE) {
|
||||
dst = kmap_atomic(page);
|
||||
dst = kmap_local_page(page);
|
||||
memcpy(dst, src, PAGE_SIZE);
|
||||
kunmap_atomic(dst);
|
||||
kunmap_local(dst);
|
||||
ret = 0;
|
||||
} else {
|
||||
dst = kmap_atomic(page);
|
||||
dst = kmap_local_page(page);
|
||||
ret = zcomp_decompress(zstrm, src, size, dst);
|
||||
kunmap_atomic(dst);
|
||||
kunmap_local(dst);
|
||||
zcomp_stream_put(zram->comps[prio]);
|
||||
}
|
||||
zs_unmap_object(zram->mem_pool, handle);
|
||||
@ -1417,21 +1416,21 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
|
||||
unsigned long element = 0;
|
||||
enum zram_pageflags flags = 0;
|
||||
|
||||
mem = kmap_atomic(page);
|
||||
mem = kmap_local_page(page);
|
||||
if (page_same_filled(mem, &element)) {
|
||||
kunmap_atomic(mem);
|
||||
kunmap_local(mem);
|
||||
/* Free memory associated with this sector now. */
|
||||
flags = ZRAM_SAME;
|
||||
atomic64_inc(&zram->stats.same_pages);
|
||||
goto out;
|
||||
}
|
||||
kunmap_atomic(mem);
|
||||
kunmap_local(mem);
|
||||
|
||||
compress_again:
|
||||
zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]);
|
||||
src = kmap_atomic(page);
|
||||
src = kmap_local_page(page);
|
||||
ret = zcomp_compress(zstrm, src, &comp_len);
|
||||
kunmap_atomic(src);
|
||||
kunmap_local(src);
|
||||
|
||||
if (unlikely(ret)) {
|
||||
zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
|
||||
@ -1495,10 +1494,10 @@ compress_again:
|
||||
|
||||
src = zstrm->buffer;
|
||||
if (comp_len == PAGE_SIZE)
|
||||
src = kmap_atomic(page);
|
||||
src = kmap_local_page(page);
|
||||
memcpy(dst, src, comp_len);
|
||||
if (comp_len == PAGE_SIZE)
|
||||
kunmap_atomic(src);
|
||||
kunmap_local(src);
|
||||
|
||||
zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
|
||||
zs_unmap_object(zram->mem_pool, handle);
|
||||
@ -1615,9 +1614,9 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page,
|
||||
|
||||
num_recomps++;
|
||||
zstrm = zcomp_stream_get(zram->comps[prio]);
|
||||
src = kmap_atomic(page);
|
||||
src = kmap_local_page(page);
|
||||
ret = zcomp_compress(zstrm, src, &comp_len_new);
|
||||
kunmap_atomic(src);
|
||||
kunmap_local(src);
|
||||
|
||||
if (ret) {
|
||||
zcomp_stream_put(zram->comps[prio]);
|
||||
|
@ -69,7 +69,7 @@ struct zram_table_entry {
|
||||
unsigned long element;
|
||||
};
|
||||
unsigned long flags;
|
||||
#ifdef CONFIG_ZRAM_MEMORY_TRACKING
|
||||
#ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
|
||||
ktime_t ac_time;
|
||||
#endif
|
||||
};
|
||||
|
@ -906,7 +906,7 @@ static int sev_ioctl_do_get_id2(struct sev_issue_cmd *argp)
|
||||
/*
|
||||
* The length of the ID shouldn't be assumed by software since
|
||||
* it may change in the future. The allocation size is limited
|
||||
* to 1 << (PAGE_SHIFT + MAX_ORDER) by the page allocator.
|
||||
* to 1 << (PAGE_SHIFT + MAX_PAGE_ORDER) by the page allocator.
|
||||
* If the allocation fails, simply return ENOMEM rather than
|
||||
* warning in the kernel log.
|
||||
*/
|
||||
|
@ -70,11 +70,11 @@ struct hisi_acc_sgl_pool *hisi_acc_create_sgl_pool(struct device *dev,
|
||||
HISI_ACC_SGL_ALIGN_SIZE);
|
||||
|
||||
/*
|
||||
* the pool may allocate a block of memory of size PAGE_SIZE * 2^MAX_ORDER,
|
||||
* the pool may allocate a block of memory of size PAGE_SIZE * 2^MAX_PAGE_ORDER,
|
||||
* block size may exceed 2^31 on ia64, so the max of block size is 2^31
|
||||
*/
|
||||
block_size = 1 << (PAGE_SHIFT + MAX_ORDER < 32 ?
|
||||
PAGE_SHIFT + MAX_ORDER : 31);
|
||||
block_size = 1 << (PAGE_SHIFT + MAX_PAGE_ORDER < 32 ?
|
||||
PAGE_SHIFT + MAX_PAGE_ORDER : 31);
|
||||
sgl_num_per_block = block_size / sgl_size;
|
||||
block_num = count / sgl_num_per_block;
|
||||
remain_sgl = count % sgl_num_per_block;
|
||||
|
@ -367,6 +367,7 @@ static ssize_t create_store(struct device *dev, struct device_attribute *attr,
|
||||
.dax_region = dax_region,
|
||||
.size = 0,
|
||||
.id = -1,
|
||||
.memmap_on_memory = false,
|
||||
};
|
||||
struct dev_dax *dev_dax = devm_create_dev_dax(&data);
|
||||
|
||||
@ -1400,6 +1401,8 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
|
||||
dev_dax->align = dax_region->align;
|
||||
ida_init(&dev_dax->ida);
|
||||
|
||||
dev_dax->memmap_on_memory = data->memmap_on_memory;
|
||||
|
||||
inode = dax_inode(dax_dev);
|
||||
dev->devt = inode->i_rdev;
|
||||
dev->bus = &dax_bus_type;
|
||||
|
@ -23,6 +23,7 @@ struct dev_dax_data {
|
||||
struct dev_pagemap *pgmap;
|
||||
resource_size_t size;
|
||||
int id;
|
||||
bool memmap_on_memory;
|
||||
};
|
||||
|
||||
struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);
|
||||
|
@ -26,6 +26,7 @@ static int cxl_dax_region_probe(struct device *dev)
|
||||
.dax_region = dax_region,
|
||||
.id = -1,
|
||||
.size = range_len(&cxlr_dax->hpa_range),
|
||||
.memmap_on_memory = true,
|
||||
};
|
||||
|
||||
return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));
|
||||
|
@ -70,6 +70,7 @@ struct dev_dax {
|
||||
struct ida ida;
|
||||
struct device dev;
|
||||
struct dev_pagemap *pgmap;
|
||||
bool memmap_on_memory;
|
||||
int nr_range;
|
||||
struct dev_dax_range {
|
||||
unsigned long pgoff;
|
||||
|
@ -36,6 +36,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
|
||||
.dax_region = dax_region,
|
||||
.id = -1,
|
||||
.size = region_idle ? 0 : range_len(&mri->range),
|
||||
.memmap_on_memory = false,
|
||||
};
|
||||
|
||||
return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));
|
||||
|
@ -12,6 +12,7 @@
|
||||
#include <linux/mm.h>
|
||||
#include <linux/mman.h>
|
||||
#include <linux/memory-tiers.h>
|
||||
#include <linux/memory_hotplug.h>
|
||||
#include "dax-private.h"
|
||||
#include "bus.h"
|
||||
|
||||
@ -93,6 +94,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
|
||||
struct dax_kmem_data *data;
|
||||
struct memory_dev_type *mtype;
|
||||
int i, rc, mapped = 0;
|
||||
mhp_t mhp_flags;
|
||||
int numa_node;
|
||||
int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
|
||||
|
||||
@ -179,12 +181,16 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
|
||||
*/
|
||||
res->flags = IORESOURCE_SYSTEM_RAM;
|
||||
|
||||
mhp_flags = MHP_NID_IS_MGID;
|
||||
if (dev_dax->memmap_on_memory)
|
||||
mhp_flags |= MHP_MEMMAP_ON_MEMORY;
|
||||
|
||||
/*
|
||||
* Ensure that future kexec'd kernels will not treat
|
||||
* this as RAM automatically.
|
||||
*/
|
||||
rc = add_memory_driver_managed(data->mgid, range.start,
|
||||
range_len(&range), kmem_name, MHP_NID_IS_MGID);
|
||||
range_len(&range), kmem_name, mhp_flags);
|
||||
|
||||
if (rc) {
|
||||
dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
|
||||
|
@ -63,6 +63,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
|
||||
.id = id,
|
||||
.pgmap = &pgmap,
|
||||
.size = range_len(&range),
|
||||
.memmap_on_memory = false,
|
||||
};
|
||||
|
||||
return devm_create_dev_dax(&data);
|
||||
|
@ -36,7 +36,7 @@ static int i915_gem_object_get_pages_internal(struct drm_i915_gem_object *obj)
|
||||
struct sg_table *st;
|
||||
struct scatterlist *sg;
|
||||
unsigned int npages; /* restricted by sg_alloc_table */
|
||||
int max_order = MAX_ORDER;
|
||||
int max_order = MAX_PAGE_ORDER;
|
||||
unsigned int max_segment;
|
||||
gfp_t gfp;
|
||||
|
||||
|
@ -115,7 +115,7 @@ static int get_huge_pages(struct drm_i915_gem_object *obj)
|
||||
do {
|
||||
struct page *page;
|
||||
|
||||
GEM_BUG_ON(order > MAX_ORDER);
|
||||
GEM_BUG_ON(order > MAX_PAGE_ORDER);
|
||||
page = alloc_pages(GFP | __GFP_ZERO, order);
|
||||
if (!page)
|
||||
goto err;
|
||||
|
@ -175,7 +175,7 @@ static void ttm_device_init_pools(struct kunit *test)
|
||||
|
||||
if (params->pools_init_expected) {
|
||||
for (int i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
|
||||
for (int j = 0; j <= MAX_ORDER; ++j) {
|
||||
for (int j = 0; j < NR_PAGE_ORDERS; ++j) {
|
||||
pt = pool->caching[i].orders[j];
|
||||
KUNIT_EXPECT_PTR_EQ(test, pt.pool, pool);
|
||||
KUNIT_EXPECT_EQ(test, pt.caching, i);
|
||||
|
@ -109,7 +109,7 @@ static const struct ttm_pool_test_case ttm_pool_basic_cases[] = {
|
||||
},
|
||||
{
|
||||
.description = "Above the allocation limit",
|
||||
.order = MAX_ORDER + 1,
|
||||
.order = MAX_PAGE_ORDER + 1,
|
||||
},
|
||||
{
|
||||
.description = "One page, with coherent DMA mappings enabled",
|
||||
@ -118,7 +118,7 @@ static const struct ttm_pool_test_case ttm_pool_basic_cases[] = {
|
||||
},
|
||||
{
|
||||
.description = "Above the allocation limit, with coherent DMA mappings enabled",
|
||||
.order = MAX_ORDER + 1,
|
||||
.order = MAX_PAGE_ORDER + 1,
|
||||
.use_dma_alloc = true,
|
||||
},
|
||||
};
|
||||
@ -165,7 +165,7 @@ static void ttm_pool_alloc_basic(struct kunit *test)
|
||||
fst_page = tt->pages[0];
|
||||
last_page = tt->pages[tt->num_pages - 1];
|
||||
|
||||
if (params->order <= MAX_ORDER) {
|
||||
if (params->order <= MAX_PAGE_ORDER) {
|
||||
if (params->use_dma_alloc) {
|
||||
KUNIT_ASSERT_NOT_NULL(test, (void *)fst_page->private);
|
||||
KUNIT_ASSERT_NOT_NULL(test, (void *)last_page->private);
|
||||
@ -182,7 +182,7 @@ static void ttm_pool_alloc_basic(struct kunit *test)
|
||||
* order 0 blocks
|
||||
*/
|
||||
KUNIT_ASSERT_EQ(test, fst_page->private,
|
||||
min_t(unsigned int, MAX_ORDER,
|
||||
min_t(unsigned int, MAX_PAGE_ORDER,
|
||||
params->order));
|
||||
KUNIT_ASSERT_EQ(test, last_page->private, 0);
|
||||
}
|
||||
|
@ -65,11 +65,11 @@ module_param(page_pool_size, ulong, 0644);
|
||||
|
||||
static atomic_long_t allocated_pages;
|
||||
|
||||
static struct ttm_pool_type global_write_combined[MAX_ORDER + 1];
|
||||
static struct ttm_pool_type global_uncached[MAX_ORDER + 1];
|
||||
static struct ttm_pool_type global_write_combined[NR_PAGE_ORDERS];
|
||||
static struct ttm_pool_type global_uncached[NR_PAGE_ORDERS];
|
||||
|
||||
static struct ttm_pool_type global_dma32_write_combined[MAX_ORDER + 1];
|
||||
static struct ttm_pool_type global_dma32_uncached[MAX_ORDER + 1];
|
||||
static struct ttm_pool_type global_dma32_write_combined[NR_PAGE_ORDERS];
|
||||
static struct ttm_pool_type global_dma32_uncached[NR_PAGE_ORDERS];
|
||||
|
||||
static spinlock_t shrinker_lock;
|
||||
static struct list_head shrinker_list;
|
||||
@ -447,7 +447,7 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
|
||||
else
|
||||
gfp_flags |= GFP_HIGHUSER;
|
||||
|
||||
for (order = min_t(unsigned int, MAX_ORDER, __fls(num_pages));
|
||||
for (order = min_t(unsigned int, MAX_PAGE_ORDER, __fls(num_pages));
|
||||
num_pages;
|
||||
order = min_t(unsigned int, order, __fls(num_pages))) {
|
||||
struct ttm_pool_type *pt;
|
||||
@ -568,7 +568,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
|
||||
|
||||
if (use_dma_alloc || nid != NUMA_NO_NODE) {
|
||||
for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
|
||||
for (j = 0; j <= MAX_ORDER; ++j)
|
||||
for (j = 0; j < NR_PAGE_ORDERS; ++j)
|
||||
ttm_pool_type_init(&pool->caching[i].orders[j],
|
||||
pool, i, j);
|
||||
}
|
||||
@ -601,7 +601,7 @@ void ttm_pool_fini(struct ttm_pool *pool)
|
||||
|
||||
if (pool->use_dma_alloc || pool->nid != NUMA_NO_NODE) {
|
||||
for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
|
||||
for (j = 0; j <= MAX_ORDER; ++j)
|
||||
for (j = 0; j < NR_PAGE_ORDERS; ++j)
|
||||
ttm_pool_type_fini(&pool->caching[i].orders[j]);
|
||||
}
|
||||
|
||||
@ -656,7 +656,7 @@ static void ttm_pool_debugfs_header(struct seq_file *m)
|
||||
unsigned int i;
|
||||
|
||||
seq_puts(m, "\t ");
|
||||
for (i = 0; i <= MAX_ORDER; ++i)
|
||||
for (i = 0; i < NR_PAGE_ORDERS; ++i)
|
||||
seq_printf(m, " ---%2u---", i);
|
||||
seq_puts(m, "\n");
|
||||
}
|
||||
@ -667,7 +667,7 @@ static void ttm_pool_debugfs_orders(struct ttm_pool_type *pt,
|
||||
{
|
||||
unsigned int i;
|
||||
|
||||
for (i = 0; i <= MAX_ORDER; ++i)
|
||||
for (i = 0; i < NR_PAGE_ORDERS; ++i)
|
||||
seq_printf(m, " %8u", ttm_pool_type_count(&pt[i]));
|
||||
seq_puts(m, "\n");
|
||||
}
|
||||
@ -776,7 +776,7 @@ int ttm_pool_mgr_init(unsigned long num_pages)
|
||||
spin_lock_init(&shrinker_lock);
|
||||
INIT_LIST_HEAD(&shrinker_list);
|
||||
|
||||
for (i = 0; i <= MAX_ORDER; ++i) {
|
||||
for (i = 0; i < NR_PAGE_ORDERS; ++i) {
|
||||
ttm_pool_type_init(&global_write_combined[i], NULL,
|
||||
ttm_write_combined, i);
|
||||
ttm_pool_type_init(&global_uncached[i], NULL, ttm_uncached, i);
|
||||
@ -816,7 +816,7 @@ void ttm_pool_mgr_fini(void)
|
||||
{
|
||||
unsigned int i;
|
||||
|
||||
for (i = 0; i <= MAX_ORDER; ++i) {
|
||||
for (i = 0; i < NR_PAGE_ORDERS; ++i) {
|
||||
ttm_pool_type_fini(&global_write_combined[i]);
|
||||
ttm_pool_type_fini(&global_uncached[i]);
|
||||
|
||||
|
@ -188,7 +188,7 @@
|
||||
#ifdef CONFIG_CMA_ALIGNMENT
|
||||
#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + CONFIG_CMA_ALIGNMENT)
|
||||
#else
|
||||
#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + MAX_ORDER)
|
||||
#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + MAX_PAGE_ORDER)
|
||||
#endif
|
||||
|
||||
/*
|
||||
|
@ -884,7 +884,7 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
|
||||
struct page **pages;
|
||||
unsigned int i = 0, nid = dev_to_node(dev);
|
||||
|
||||
order_mask &= GENMASK(MAX_ORDER, 0);
|
||||
order_mask &= GENMASK(MAX_PAGE_ORDER, 0);
|
||||
if (!order_mask)
|
||||
return NULL;
|
||||
|
||||
|
@ -2465,8 +2465,8 @@ static bool its_parse_indirect_baser(struct its_node *its,
|
||||
* feature is not supported by hardware.
|
||||
*/
|
||||
new_order = max_t(u32, get_order(esz << ids), new_order);
|
||||
if (new_order > MAX_ORDER) {
|
||||
new_order = MAX_ORDER;
|
||||
if (new_order > MAX_PAGE_ORDER) {
|
||||
new_order = MAX_PAGE_ORDER;
|
||||
ids = ilog2(PAGE_ORDER_TO_SIZE(new_order) / (int)esz);
|
||||
pr_warn("ITS@%pa: %s Table too large, reduce ids %llu->%u\n",
|
||||
&its->phys_base, its_base_type_string[type],
|
||||
|
@ -1170,7 +1170,7 @@ static void __cache_size_refresh(void)
|
||||
* If the allocation may fail we use __get_free_pages. Memory fragmentation
|
||||
* won't have a fatal effect here, but it just causes flushes of some other
|
||||
* buffers and more I/O will be performed. Don't use __get_free_pages if it
|
||||
* always fails (i.e. order > MAX_ORDER).
|
||||
* always fails (i.e. order > MAX_PAGE_ORDER).
|
||||
*
|
||||
* If the allocation shouldn't fail we use __vmalloc. This is only for the
|
||||
* initial reserve allocation, so there's no risk of wasting all vmalloc
|
||||
|
@ -1673,7 +1673,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned int size)
|
||||
unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
|
||||
gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
|
||||
unsigned int remaining_size;
|
||||
unsigned int order = MAX_ORDER;
|
||||
unsigned int order = MAX_PAGE_ORDER;
|
||||
|
||||
retry:
|
||||
if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
|
||||
|
@ -434,7 +434,7 @@ static struct bio *clone_bio(struct dm_target *ti, struct flakey_c *fc, struct b
|
||||
|
||||
remaining_size = size;
|
||||
|
||||
order = MAX_ORDER;
|
||||
order = MAX_PAGE_ORDER;
|
||||
while (remaining_size) {
|
||||
struct page *pages;
|
||||
unsigned size_to_add, to_copy;
|
||||
|
@ -443,7 +443,7 @@ static int genwqe_mmap(struct file *filp, struct vm_area_struct *vma)
|
||||
if (vsize == 0)
|
||||
return -EINVAL;
|
||||
|
||||
if (get_order(vsize) > MAX_ORDER)
|
||||
if (get_order(vsize) > MAX_PAGE_ORDER)
|
||||
return -ENOMEM;
|
||||
|
||||
dma_map = kzalloc(sizeof(struct dma_mapping), GFP_KERNEL);
|
||||
|
@ -210,7 +210,7 @@ u32 genwqe_crc32(u8 *buff, size_t len, u32 init)
|
||||
void *__genwqe_alloc_consistent(struct genwqe_dev *cd, size_t size,
|
||||
dma_addr_t *dma_handle)
|
||||
{
|
||||
if (get_order(size) > MAX_ORDER)
|
||||
if (get_order(size) > MAX_PAGE_ORDER)
|
||||
return NULL;
|
||||
|
||||
return dma_alloc_coherent(&cd->pci_dev->dev, size, dma_handle,
|
||||
@ -308,7 +308,7 @@ int genwqe_alloc_sync_sgl(struct genwqe_dev *cd, struct genwqe_sgl *sgl,
|
||||
sgl->write = write;
|
||||
sgl->sgl_size = genwqe_sgl_size(sgl->nr_pages);
|
||||
|
||||
if (get_order(sgl->sgl_size) > MAX_ORDER) {
|
||||
if (get_order(sgl->sgl_size) > MAX_PAGE_ORDER) {
|
||||
dev_err(&pci_dev->dev,
|
||||
"[%s] err: too much memory requested!\n", __func__);
|
||||
return ret;
|
||||
|
@ -1041,7 +1041,7 @@ static void hns3_init_tx_spare_buffer(struct hns3_enet_ring *ring)
|
||||
return;
|
||||
|
||||
order = get_order(alloc_size);
|
||||
if (order > MAX_ORDER) {
|
||||
if (order > MAX_PAGE_ORDER) {
|
||||
if (net_ratelimit())
|
||||
dev_warn(ring_to_dev(ring), "failed to allocate tx spare buffer, exceed to max order\n");
|
||||
return;
|
||||
|
@ -48,7 +48,7 @@
|
||||
* of 4096 jumbo frames (MTU=9000) we will need about 9K*4K = 36MB plus
|
||||
* some padding.
|
||||
*
|
||||
* But the size of a single DMA region is limited by MAX_ORDER in the
|
||||
* But the size of a single DMA region is limited by MAX_PAGE_ORDER in the
|
||||
* kernel (about 16MB currently). To support say 4K Jumbo frames, we
|
||||
* use a set of LTBs (struct ltb_set) per pool.
|
||||
*
|
||||
@ -75,7 +75,7 @@
|
||||
* pool for the 4MB. Thus the 16 Rx and Tx queues require 32 * 5 = 160
|
||||
* plus 16 for the TSO pools for a total of 176 LTB mappings per VNIC.
|
||||
*/
|
||||
#define IBMVNIC_ONE_LTB_MAX ((u32)((1 << MAX_ORDER) * PAGE_SIZE))
|
||||
#define IBMVNIC_ONE_LTB_MAX ((u32)((1 << MAX_PAGE_ORDER) * PAGE_SIZE))
|
||||
#define IBMVNIC_ONE_LTB_SIZE min((u32)(8 << 20), IBMVNIC_ONE_LTB_MAX)
|
||||
#define IBMVNIC_LTB_SET_SIZE (38 << 20)
|
||||
|
||||
|
@ -927,8 +927,8 @@ static phys_addr_t hvfb_get_phymem(struct hv_device *hdev,
|
||||
if (request_size == 0)
|
||||
return -1;
|
||||
|
||||
if (order <= MAX_ORDER) {
|
||||
/* Call alloc_pages if the size is less than 2^MAX_ORDER */
|
||||
if (order <= MAX_PAGE_ORDER) {
|
||||
/* Call alloc_pages if the size is less than 2^MAX_PAGE_ORDER */
|
||||
page = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
|
||||
if (!page)
|
||||
return -1;
|
||||
@ -958,7 +958,7 @@ static void hvfb_release_phymem(struct hv_device *hdev,
|
||||
{
|
||||
unsigned int order = get_order(size);
|
||||
|
||||
if (order <= MAX_ORDER)
|
||||
if (order <= MAX_PAGE_ORDER)
|
||||
__free_pages(pfn_to_page(paddr >> PAGE_SHIFT), order);
|
||||
else
|
||||
dma_free_coherent(&hdev->device,
|
||||
|
@ -197,7 +197,7 @@ static int vmlfb_alloc_vram(struct vml_info *vinfo,
|
||||
va = &vinfo->vram[i];
|
||||
order = 0;
|
||||
|
||||
while (requested > (PAGE_SIZE << order) && order <= MAX_ORDER)
|
||||
while (requested > (PAGE_SIZE << order) && order <= MAX_PAGE_ORDER)
|
||||
order++;
|
||||
|
||||
err = vmlfb_alloc_vram_area(va, order, 0);
|
||||
|
@ -33,7 +33,7 @@
|
||||
#define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
|
||||
__GFP_NOMEMALLOC)
|
||||
/* The order of free page blocks to report to host */
|
||||
#define VIRTIO_BALLOON_HINT_BLOCK_ORDER MAX_ORDER
|
||||
#define VIRTIO_BALLOON_HINT_BLOCK_ORDER MAX_PAGE_ORDER
|
||||
/* The size of a free page block in bytes */
|
||||
#define VIRTIO_BALLOON_HINT_BLOCK_BYTES \
|
||||
(1 << (VIRTIO_BALLOON_HINT_BLOCK_ORDER + PAGE_SHIFT))
|
||||
|
@ -1154,13 +1154,13 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn,
|
||||
*/
|
||||
static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
|
||||
{
|
||||
unsigned long order = MAX_ORDER;
|
||||
unsigned long order = MAX_PAGE_ORDER;
|
||||
unsigned long i;
|
||||
|
||||
/*
|
||||
* We might get called for ranges that don't cover properly aligned
|
||||
* MAX_ORDER pages; however, we can only online properly aligned
|
||||
* pages with an order of MAX_ORDER at maximum.
|
||||
* MAX_PAGE_ORDER pages; however, we can only online properly aligned
|
||||
* pages with an order of MAX_PAGE_ORDER at maximum.
|
||||
*/
|
||||
while (!IS_ALIGNED(pfn | nr_pages, 1 << order))
|
||||
order--;
|
||||
@ -1280,7 +1280,7 @@ static void virtio_mem_online_page(struct virtio_mem *vm,
|
||||
bool do_online;
|
||||
|
||||
/*
|
||||
* We can get called with any order up to MAX_ORDER. If our subblock
|
||||
* We can get called with any order up to MAX_PAGE_ORDER. If our subblock
|
||||
* size is smaller than that and we have a mixture of plugged and
|
||||
* unplugged subblocks within such a page, we have to process in
|
||||
* smaller granularity. In that case we'll adjust the order exactly once
|
||||
|
22
fs/Kconfig
22
fs/Kconfig
@ -258,7 +258,7 @@ config TMPFS_QUOTA
|
||||
config ARCH_SUPPORTS_HUGETLBFS
|
||||
def_bool n
|
||||
|
||||
config HUGETLBFS
|
||||
menuconfig HUGETLBFS
|
||||
bool "HugeTLB file system support"
|
||||
depends on X86 || SPARC64 || ARCH_SUPPORTS_HUGETLBFS || BROKEN
|
||||
depends on (SYSFS || SYSCTL)
|
||||
@ -270,6 +270,17 @@ config HUGETLBFS
|
||||
|
||||
If unsure, say N.
|
||||
|
||||
if HUGETLBFS
|
||||
config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
|
||||
bool "HugeTLB Vmemmap Optimization (HVO) defaults to on"
|
||||
default n
|
||||
depends on HUGETLB_PAGE_OPTIMIZE_VMEMMAP
|
||||
help
|
||||
The HugeTLB Vmemmap Optimization (HVO) defaults to off. Say Y here to
|
||||
enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off
|
||||
(boot command line) or hugetlb_optimize_vmemmap (sysctl).
|
||||
endif # HUGETLBFS
|
||||
|
||||
config HUGETLB_PAGE
|
||||
def_bool HUGETLBFS
|
||||
select XARRAY_MULTI
|
||||
@ -279,15 +290,6 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
|
||||
depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
|
||||
depends on SPARSEMEM_VMEMMAP
|
||||
|
||||
config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
|
||||
bool "HugeTLB Vmemmap Optimization (HVO) defaults to on"
|
||||
default n
|
||||
depends on HUGETLB_PAGE_OPTIMIZE_VMEMMAP
|
||||
help
|
||||
The HugeTLB VmemmapvOptimization (HVO) defaults to off. Say Y here to
|
||||
enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off
|
||||
(boot command line) or hugetlb_optimize_vmemmap (sysctl).
|
||||
|
||||
config ARCH_HAS_GIGANTIC_PAGE
|
||||
bool
|
||||
|
||||
|
@ -5,6 +5,7 @@
|
||||
* Copyright (C) 1997-1999 Russell King
|
||||
*/
|
||||
#include <linux/buffer_head.h>
|
||||
#include <linux/mpage.h>
|
||||
#include <linux/writeback.h>
|
||||
#include "adfs.h"
|
||||
|
||||
@ -33,9 +34,10 @@ abort_toobig:
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int adfs_writepage(struct page *page, struct writeback_control *wbc)
|
||||
static int adfs_writepages(struct address_space *mapping,
|
||||
struct writeback_control *wbc)
|
||||
{
|
||||
return block_write_full_page(page, adfs_get_block, wbc);
|
||||
return mpage_writepages(mapping, wbc, adfs_get_block);
|
||||
}
|
||||
|
||||
static int adfs_read_folio(struct file *file, struct folio *folio)
|
||||
@ -76,10 +78,11 @@ static const struct address_space_operations adfs_aops = {
|
||||
.dirty_folio = block_dirty_folio,
|
||||
.invalidate_folio = block_invalidate_folio,
|
||||
.read_folio = adfs_read_folio,
|
||||
.writepage = adfs_writepage,
|
||||
.writepages = adfs_writepages,
|
||||
.write_begin = adfs_write_begin,
|
||||
.write_end = generic_write_end,
|
||||
.bmap = _adfs_bmap
|
||||
.migrate_folio = buffer_migrate_folio,
|
||||
.bmap = _adfs_bmap,
|
||||
};
|
||||
|
||||
/*
|
||||
|
@ -242,7 +242,7 @@ static void afs_kill_pages(struct address_space *mapping,
|
||||
folio_clear_uptodate(folio);
|
||||
folio_end_writeback(folio);
|
||||
folio_lock(folio);
|
||||
generic_error_remove_page(mapping, &folio->page);
|
||||
generic_error_remove_folio(mapping, folio);
|
||||
folio_unlock(folio);
|
||||
folio_put(folio);
|
||||
|
||||
@ -559,8 +559,7 @@ static void afs_extend_writeback(struct address_space *mapping,
|
||||
|
||||
if (!folio_clear_dirty_for_io(folio))
|
||||
BUG();
|
||||
if (folio_start_writeback(folio))
|
||||
BUG();
|
||||
folio_start_writeback(folio);
|
||||
afs_folio_start_fscache(caching, folio);
|
||||
|
||||
*_count -= folio_nr_pages(folio);
|
||||
@ -595,8 +594,7 @@ static ssize_t afs_write_back_from_locked_folio(struct address_space *mapping,
|
||||
|
||||
_enter(",%lx,%llx-%llx", folio_index(folio), start, end);
|
||||
|
||||
if (folio_start_writeback(folio))
|
||||
BUG();
|
||||
folio_start_writeback(folio);
|
||||
afs_folio_start_fscache(caching, folio);
|
||||
|
||||
count -= folio_nr_pages(folio);
|
||||
|
@ -1131,7 +1131,7 @@ static const struct address_space_operations bch_address_space_operations = {
|
||||
#ifdef CONFIG_MIGRATION
|
||||
.migrate_folio = filemap_migrate_folio,
|
||||
#endif
|
||||
.error_remove_page = generic_error_remove_page,
|
||||
.error_remove_folio = generic_error_remove_folio,
|
||||
};
|
||||
|
||||
struct bcachefs_fid {
|
||||
|
@ -11,6 +11,7 @@
|
||||
*/
|
||||
|
||||
#include <linux/fs.h>
|
||||
#include <linux/mpage.h>
|
||||
#include <linux/buffer_head.h>
|
||||
#include "bfs.h"
|
||||
|
||||
@ -150,9 +151,10 @@ out:
|
||||
return err;
|
||||
}
|
||||
|
||||
static int bfs_writepage(struct page *page, struct writeback_control *wbc)
|
||||
static int bfs_writepages(struct address_space *mapping,
|
||||
struct writeback_control *wbc)
|
||||
{
|
||||
return block_write_full_page(page, bfs_get_block, wbc);
|
||||
return mpage_writepages(mapping, wbc, bfs_get_block);
|
||||
}
|
||||
|
||||
static int bfs_read_folio(struct file *file, struct folio *folio)
|
||||
@ -190,9 +192,10 @@ const struct address_space_operations bfs_aops = {
|
||||
.dirty_folio = block_dirty_folio,
|
||||
.invalidate_folio = block_invalidate_folio,
|
||||
.read_folio = bfs_read_folio,
|
||||
.writepage = bfs_writepage,
|
||||
.writepages = bfs_writepages,
|
||||
.write_begin = bfs_write_begin,
|
||||
.write_end = generic_write_end,
|
||||
.migrate_folio = buffer_migrate_folio,
|
||||
.bmap = bfs_bmap,
|
||||
};
|
||||
|
||||
|
@ -10930,7 +10930,7 @@ static const struct address_space_operations btrfs_aops = {
|
||||
.release_folio = btrfs_release_folio,
|
||||
.migrate_folio = btrfs_migrate_folio,
|
||||
.dirty_folio = filemap_dirty_folio,
|
||||
.error_remove_page = generic_error_remove_page,
|
||||
.error_remove_folio = generic_error_remove_folio,
|
||||
.swap_activate = btrfs_swap_activate,
|
||||
.swap_deactivate = btrfs_swap_deactivate,
|
||||
};
|
||||
|
175
fs/buffer.c
175
fs/buffer.c
@ -199,7 +199,7 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
|
||||
int all_mapped = 1;
|
||||
static DEFINE_RATELIMIT_STATE(last_warned, HZ, 1);
|
||||
|
||||
index = block >> (PAGE_SHIFT - bd_inode->i_blkbits);
|
||||
index = ((loff_t)block << bd_inode->i_blkbits) / PAGE_SIZE;
|
||||
folio = __filemap_get_folio(bd_mapping, index, FGP_ACCESSED, 0);
|
||||
if (IS_ERR(folio))
|
||||
goto out;
|
||||
@ -372,10 +372,10 @@ static void end_buffer_async_read_io(struct buffer_head *bh, int uptodate)
|
||||
}
|
||||
|
||||
/*
|
||||
* Completion handler for block_write_full_page() - pages which are unlocked
|
||||
* during I/O, and which have PageWriteback cleared upon I/O completion.
|
||||
* Completion handler for block_write_full_folio() - folios which are unlocked
|
||||
* during I/O, and which have the writeback flag cleared upon I/O completion.
|
||||
*/
|
||||
void end_buffer_async_write(struct buffer_head *bh, int uptodate)
|
||||
static void end_buffer_async_write(struct buffer_head *bh, int uptodate)
|
||||
{
|
||||
unsigned long flags;
|
||||
struct buffer_head *first;
|
||||
@ -415,7 +415,6 @@ still_busy:
|
||||
spin_unlock_irqrestore(&first->b_uptodate_lock, flags);
|
||||
return;
|
||||
}
|
||||
EXPORT_SYMBOL(end_buffer_async_write);
|
||||
|
||||
/*
|
||||
* If a page's buffers are under async readin (end_buffer_async_read
|
||||
@ -995,11 +994,12 @@ static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
|
||||
* Initialise the state of a blockdev folio's buffers.
|
||||
*/
|
||||
static sector_t folio_init_buffers(struct folio *folio,
|
||||
struct block_device *bdev, sector_t block, int size)
|
||||
struct block_device *bdev, unsigned size)
|
||||
{
|
||||
struct buffer_head *head = folio_buffers(folio);
|
||||
struct buffer_head *bh = head;
|
||||
bool uptodate = folio_test_uptodate(folio);
|
||||
sector_t block = div_u64(folio_pos(folio), size);
|
||||
sector_t end_block = blkdev_max_block(bdev, size);
|
||||
|
||||
do {
|
||||
@ -1024,40 +1024,49 @@ static sector_t folio_init_buffers(struct folio *folio,
|
||||
}
|
||||
|
||||
/*
|
||||
* Create the page-cache page that contains the requested block.
|
||||
* Create the page-cache folio that contains the requested block.
|
||||
*
|
||||
* This is used purely for blockdev mappings.
|
||||
*
|
||||
* Returns false if we have a failure which cannot be cured by retrying
|
||||
* without sleeping. Returns true if we succeeded, or the caller should retry.
|
||||
*/
|
||||
static int
|
||||
grow_dev_page(struct block_device *bdev, sector_t block,
|
||||
pgoff_t index, int size, int sizebits, gfp_t gfp)
|
||||
static bool grow_dev_folio(struct block_device *bdev, sector_t block,
|
||||
pgoff_t index, unsigned size, gfp_t gfp)
|
||||
{
|
||||
struct inode *inode = bdev->bd_inode;
|
||||
struct folio *folio;
|
||||
struct buffer_head *bh;
|
||||
sector_t end_block;
|
||||
int ret = 0;
|
||||
sector_t end_block = 0;
|
||||
|
||||
folio = __filemap_get_folio(inode->i_mapping, index,
|
||||
FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp);
|
||||
if (IS_ERR(folio))
|
||||
return PTR_ERR(folio);
|
||||
return false;
|
||||
|
||||
bh = folio_buffers(folio);
|
||||
if (bh) {
|
||||
if (bh->b_size == size) {
|
||||
end_block = folio_init_buffers(folio, bdev,
|
||||
(sector_t)index << sizebits, size);
|
||||
goto done;
|
||||
end_block = folio_init_buffers(folio, bdev, size);
|
||||
goto unlock;
|
||||
}
|
||||
|
||||
/*
|
||||
* Retrying may succeed; for example the folio may finish
|
||||
* writeback, or buffers may be cleaned. This should not
|
||||
* happen very often; maybe we have old buffers attached to
|
||||
* this blockdev's page cache and we're trying to change
|
||||
* the block size?
|
||||
*/
|
||||
if (!try_to_free_buffers(folio)) {
|
||||
end_block = ~0ULL;
|
||||
goto unlock;
|
||||
}
|
||||
if (!try_to_free_buffers(folio))
|
||||
goto failed;
|
||||
}
|
||||
|
||||
ret = -ENOMEM;
|
||||
bh = folio_alloc_buffers(folio, size, gfp | __GFP_ACCOUNT);
|
||||
if (!bh)
|
||||
goto failed;
|
||||
goto unlock;
|
||||
|
||||
/*
|
||||
* Link the folio to the buffers and initialise them. Take the
|
||||
@ -1066,44 +1075,37 @@ grow_dev_page(struct block_device *bdev, sector_t block,
|
||||
*/
|
||||
spin_lock(&inode->i_mapping->i_private_lock);
|
||||
link_dev_buffers(folio, bh);
|
||||
end_block = folio_init_buffers(folio, bdev,
|
||||
(sector_t)index << sizebits, size);
|
||||
end_block = folio_init_buffers(folio, bdev, size);
|
||||
spin_unlock(&inode->i_mapping->i_private_lock);
|
||||
done:
|
||||
ret = (block < end_block) ? 1 : -ENXIO;
|
||||
failed:
|
||||
unlock:
|
||||
folio_unlock(folio);
|
||||
folio_put(folio);
|
||||
return ret;
|
||||
return block < end_block;
|
||||
}
|
||||
|
||||
/*
|
||||
* Create buffers for the specified block device block's page. If
|
||||
* that page was dirty, the buffers are set dirty also.
|
||||
* Create buffers for the specified block device block's folio. If
|
||||
* that folio was dirty, the buffers are set dirty also. Returns false
|
||||
* if we've hit a permanent error.
|
||||
*/
|
||||
static int
|
||||
grow_buffers(struct block_device *bdev, sector_t block, int size, gfp_t gfp)
|
||||
static bool grow_buffers(struct block_device *bdev, sector_t block,
|
||||
unsigned size, gfp_t gfp)
|
||||
{
|
||||
pgoff_t index;
|
||||
int sizebits;
|
||||
|
||||
sizebits = PAGE_SHIFT - __ffs(size);
|
||||
index = block >> sizebits;
|
||||
loff_t pos;
|
||||
|
||||
/*
|
||||
* Check for a block which wants to lie outside our maximum possible
|
||||
* pagecache index. (this comparison is done using sector_t types).
|
||||
* Check for a block which lies outside our maximum possible
|
||||
* pagecache index.
|
||||
*/
|
||||
if (unlikely(index != block >> sizebits)) {
|
||||
printk(KERN_ERR "%s: requested out-of-range block %llu for "
|
||||
"device %pg\n",
|
||||
if (check_mul_overflow(block, (sector_t)size, &pos) || pos > MAX_LFS_FILESIZE) {
|
||||
printk(KERN_ERR "%s: requested out-of-range block %llu for device %pg\n",
|
||||
__func__, (unsigned long long)block,
|
||||
bdev);
|
||||
return -EIO;
|
||||
return false;
|
||||
}
|
||||
|
||||
/* Create a page with the proper size buffers.. */
|
||||
return grow_dev_page(bdev, block, index, size, sizebits, gfp);
|
||||
/* Create a folio with the proper size buffers */
|
||||
return grow_dev_folio(bdev, block, pos / PAGE_SIZE, size, gfp);
|
||||
}
|
||||
|
||||
static struct buffer_head *
|
||||
@ -1124,14 +1126,12 @@ __getblk_slow(struct block_device *bdev, sector_t block,
|
||||
|
||||
for (;;) {
|
||||
struct buffer_head *bh;
|
||||
int ret;
|
||||
|
||||
bh = __find_get_block(bdev, block, size);
|
||||
if (bh)
|
||||
return bh;
|
||||
|
||||
ret = grow_buffers(bdev, block, size, gfp);
|
||||
if (ret < 0)
|
||||
if (!grow_buffers(bdev, block, size, gfp))
|
||||
return NULL;
|
||||
}
|
||||
}
|
||||
@ -1699,13 +1699,13 @@ void clean_bdev_aliases(struct block_device *bdev, sector_t block, sector_t len)
|
||||
struct inode *bd_inode = bdev->bd_inode;
|
||||
struct address_space *bd_mapping = bd_inode->i_mapping;
|
||||
struct folio_batch fbatch;
|
||||
pgoff_t index = block >> (PAGE_SHIFT - bd_inode->i_blkbits);
|
||||
pgoff_t index = ((loff_t)block << bd_inode->i_blkbits) / PAGE_SIZE;
|
||||
pgoff_t end;
|
||||
int i, count;
|
||||
struct buffer_head *bh;
|
||||
struct buffer_head *head;
|
||||
|
||||
end = (block + len - 1) >> (PAGE_SHIFT - bd_inode->i_blkbits);
|
||||
end = ((loff_t)(block + len - 1) << bd_inode->i_blkbits) / PAGE_SIZE;
|
||||
folio_batch_init(&fbatch);
|
||||
while (filemap_get_folios(bd_mapping, &index, end, &fbatch)) {
|
||||
count = folio_batch_count(&fbatch);
|
||||
@ -1748,19 +1748,6 @@ unlock_page:
|
||||
}
|
||||
EXPORT_SYMBOL(clean_bdev_aliases);
|
||||
|
||||
/*
|
||||
* Size is a power-of-two in the range 512..PAGE_SIZE,
|
||||
* and the case we care about most is PAGE_SIZE.
|
||||
*
|
||||
* So this *could* possibly be written with those
|
||||
* constraints in mind (relevant mostly if some
|
||||
* architecture has a slow bit-scan instruction)
|
||||
*/
|
||||
static inline int block_size_bits(unsigned int blocksize)
|
||||
{
|
||||
return ilog2(blocksize);
|
||||
}
|
||||
|
||||
static struct buffer_head *folio_create_buffers(struct folio *folio,
|
||||
struct inode *inode,
|
||||
unsigned int b_state)
|
||||
@ -1790,30 +1777,29 @@ static struct buffer_head *folio_create_buffers(struct folio *folio,
|
||||
*/
|
||||
|
||||
/*
|
||||
* While block_write_full_page is writing back the dirty buffers under
|
||||
* While block_write_full_folio is writing back the dirty buffers under
|
||||
* the page lock, whoever dirtied the buffers may decide to clean them
|
||||
* again at any time. We handle that by only looking at the buffer
|
||||
* state inside lock_buffer().
|
||||
*
|
||||
* If block_write_full_page() is called for regular writeback
|
||||
* If block_write_full_folio() is called for regular writeback
|
||||
* (wbc->sync_mode == WB_SYNC_NONE) then it will redirty a page which has a
|
||||
* locked buffer. This only can happen if someone has written the buffer
|
||||
* directly, with submit_bh(). At the address_space level PageWriteback
|
||||
* prevents this contention from occurring.
|
||||
*
|
||||
* If block_write_full_page() is called with wbc->sync_mode ==
|
||||
* If block_write_full_folio() is called with wbc->sync_mode ==
|
||||
* WB_SYNC_ALL, the writes are posted using REQ_SYNC; this
|
||||
* causes the writes to be flagged as synchronous writes.
|
||||
*/
|
||||
int __block_write_full_folio(struct inode *inode, struct folio *folio,
|
||||
get_block_t *get_block, struct writeback_control *wbc,
|
||||
bh_end_io_t *handler)
|
||||
get_block_t *get_block, struct writeback_control *wbc)
|
||||
{
|
||||
int err;
|
||||
sector_t block;
|
||||
sector_t last_block;
|
||||
struct buffer_head *bh, *head;
|
||||
unsigned int blocksize, bbits;
|
||||
size_t blocksize;
|
||||
int nr_underway = 0;
|
||||
blk_opf_t write_flags = wbc_to_write_flags(wbc);
|
||||
|
||||
@ -1832,10 +1818,9 @@ int __block_write_full_folio(struct inode *inode, struct folio *folio,
|
||||
|
||||
bh = head;
|
||||
blocksize = bh->b_size;
|
||||
bbits = block_size_bits(blocksize);
|
||||
|
||||
block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
|
||||
last_block = (i_size_read(inode) - 1) >> bbits;
|
||||
block = div_u64(folio_pos(folio), blocksize);
|
||||
last_block = div_u64(i_size_read(inode) - 1, blocksize);
|
||||
|
||||
/*
|
||||
* Get all the dirty buffers mapped to disk addresses and
|
||||
@ -1849,7 +1834,7 @@ int __block_write_full_folio(struct inode *inode, struct folio *folio,
|
||||
* truncate in progress.
|
||||
*/
|
||||
/*
|
||||
* The buffer was zeroed by block_write_full_page()
|
||||
* The buffer was zeroed by block_write_full_folio()
|
||||
*/
|
||||
clear_buffer_dirty(bh);
|
||||
set_buffer_uptodate(bh);
|
||||
@ -1887,7 +1872,8 @@ int __block_write_full_folio(struct inode *inode, struct folio *folio,
|
||||
continue;
|
||||
}
|
||||
if (test_clear_buffer_dirty(bh)) {
|
||||
mark_buffer_async_write_endio(bh, handler);
|
||||
mark_buffer_async_write_endio(bh,
|
||||
end_buffer_async_write);
|
||||
} else {
|
||||
unlock_buffer(bh);
|
||||
}
|
||||
@ -1940,7 +1926,8 @@ recover:
|
||||
if (buffer_mapped(bh) && buffer_dirty(bh) &&
|
||||
!buffer_delay(bh)) {
|
||||
lock_buffer(bh);
|
||||
mark_buffer_async_write_endio(bh, handler);
|
||||
mark_buffer_async_write_endio(bh,
|
||||
end_buffer_async_write);
|
||||
} else {
|
||||
/*
|
||||
* The buffer may have been set dirty during
|
||||
@ -2014,7 +2001,7 @@ static int
|
||||
iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
|
||||
const struct iomap *iomap)
|
||||
{
|
||||
loff_t offset = block << inode->i_blkbits;
|
||||
loff_t offset = (loff_t)block << inode->i_blkbits;
|
||||
|
||||
bh->b_bdev = iomap->bdev;
|
||||
|
||||
@ -2081,27 +2068,24 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
|
||||
int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
|
||||
get_block_t *get_block, const struct iomap *iomap)
|
||||
{
|
||||
unsigned from = pos & (PAGE_SIZE - 1);
|
||||
unsigned to = from + len;
|
||||
size_t from = offset_in_folio(folio, pos);
|
||||
size_t to = from + len;
|
||||
struct inode *inode = folio->mapping->host;
|
||||
unsigned block_start, block_end;
|
||||
size_t block_start, block_end;
|
||||
sector_t block;
|
||||
int err = 0;
|
||||
unsigned blocksize, bbits;
|
||||
size_t blocksize;
|
||||
struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
|
||||
|
||||
BUG_ON(!folio_test_locked(folio));
|
||||
BUG_ON(from > PAGE_SIZE);
|
||||
BUG_ON(to > PAGE_SIZE);
|
||||
BUG_ON(to > folio_size(folio));
|
||||
BUG_ON(from > to);
|
||||
|
||||
head = folio_create_buffers(folio, inode, 0);
|
||||
blocksize = head->b_size;
|
||||
bbits = block_size_bits(blocksize);
|
||||
block = div_u64(folio_pos(folio), blocksize);
|
||||
|
||||
block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
|
||||
|
||||
for(bh = head, block_start = 0; bh != head || !block_start;
|
||||
for (bh = head, block_start = 0; bh != head || !block_start;
|
||||
block++, block_start=block_end, bh = bh->b_this_page) {
|
||||
block_end = block_start + blocksize;
|
||||
if (block_end <= from || block_start >= to) {
|
||||
@ -2364,7 +2348,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
|
||||
struct inode *inode = folio->mapping->host;
|
||||
sector_t iblock, lblock;
|
||||
struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
|
||||
unsigned int blocksize, bbits;
|
||||
size_t blocksize;
|
||||
int nr, i;
|
||||
int fully_mapped = 1;
|
||||
bool page_error = false;
|
||||
@ -2378,10 +2362,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
|
||||
|
||||
head = folio_create_buffers(folio, inode, 0);
|
||||
blocksize = head->b_size;
|
||||
bbits = block_size_bits(blocksize);
|
||||
|
||||
iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
|
||||
lblock = (limit+blocksize-1) >> bbits;
|
||||
iblock = div_u64(folio_pos(folio), blocksize);
|
||||
lblock = div_u64(limit + blocksize - 1, blocksize);
|
||||
bh = head;
|
||||
nr = 0;
|
||||
i = 0;
|
||||
@ -2666,8 +2649,8 @@ int block_truncate_page(struct address_space *mapping,
|
||||
return 0;
|
||||
|
||||
length = blocksize - length;
|
||||
iblock = (sector_t)index << (PAGE_SHIFT - inode->i_blkbits);
|
||||
|
||||
iblock = ((loff_t)index * PAGE_SIZE) >> inode->i_blkbits;
|
||||
|
||||
folio = filemap_grab_folio(mapping, index);
|
||||
if (IS_ERR(folio))
|
||||
return PTR_ERR(folio);
|
||||
@ -2720,17 +2703,15 @@ EXPORT_SYMBOL(block_truncate_page);
|
||||
/*
|
||||
* The generic ->writepage function for buffer-backed address_spaces
|
||||
*/
|
||||
int block_write_full_page(struct page *page, get_block_t *get_block,
|
||||
struct writeback_control *wbc)
|
||||
int block_write_full_folio(struct folio *folio, struct writeback_control *wbc,
|
||||
void *get_block)
|
||||
{
|
||||
struct folio *folio = page_folio(page);
|
||||
struct inode * const inode = folio->mapping->host;
|
||||
loff_t i_size = i_size_read(inode);
|
||||
|
||||
/* Is the folio fully inside i_size? */
|
||||
if (folio_pos(folio) + folio_size(folio) <= i_size)
|
||||
return __block_write_full_folio(inode, folio, get_block, wbc,
|
||||
end_buffer_async_write);
|
||||
return __block_write_full_folio(inode, folio, get_block, wbc);
|
||||
|
||||
/* Is the folio fully outside i_size? (truncate in progress) */
|
||||
if (folio_pos(folio) >= i_size) {
|
||||
@ -2747,10 +2728,8 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
|
||||
*/
|
||||
folio_zero_segment(folio, offset_in_folio(folio, i_size),
|
||||
folio_size(folio));
|
||||
return __block_write_full_folio(inode, folio, get_block, wbc,
|
||||
end_buffer_async_write);
|
||||
return __block_write_full_folio(inode, folio, get_block, wbc);
|
||||
}
|
||||
EXPORT_SYMBOL(block_write_full_page);
|
||||
|
||||
sector_t generic_block_bmap(struct address_space *mapping, sector_t block,
|
||||
get_block_t *get_block)
|
||||
|
@ -907,8 +907,8 @@ static void writepages_finish(struct ceph_osd_request *req)
|
||||
doutc(cl, "unlocking %p\n", page);
|
||||
|
||||
if (remove_page)
|
||||
generic_error_remove_page(inode->i_mapping,
|
||||
page);
|
||||
generic_error_remove_folio(inode->i_mapping,
|
||||
page_folio(page));
|
||||
|
||||
unlock_page(page);
|
||||
}
|
||||
|
@ -428,7 +428,8 @@ static void d_lru_add(struct dentry *dentry)
|
||||
this_cpu_inc(nr_dentry_unused);
|
||||
if (d_is_negative(dentry))
|
||||
this_cpu_inc(nr_dentry_negative);
|
||||
WARN_ON_ONCE(!list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
|
||||
WARN_ON_ONCE(!list_lru_add_obj(
|
||||
&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
|
||||
}
|
||||
|
||||
static void d_lru_del(struct dentry *dentry)
|
||||
@ -438,7 +439,8 @@ static void d_lru_del(struct dentry *dentry)
|
||||
this_cpu_dec(nr_dentry_unused);
|
||||
if (d_is_negative(dentry))
|
||||
this_cpu_dec(nr_dentry_negative);
|
||||
WARN_ON_ONCE(!list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
|
||||
WARN_ON_ONCE(!list_lru_del_obj(
|
||||
&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
|
||||
}
|
||||
|
||||
static void d_shrink_del(struct dentry *dentry)
|
||||
@ -1240,7 +1242,7 @@ static enum lru_status dentry_lru_isolate(struct list_head *item,
|
||||
*
|
||||
* This is guaranteed by the fact that all LRU management
|
||||
* functions are intermediated by the LRU API calls like
|
||||
* list_lru_add and list_lru_del. List movement in this file
|
||||
* list_lru_add_obj and list_lru_del_obj. List movement in this file
|
||||
* only ever occur through this functions or through callbacks
|
||||
* like this one, that are called from the LRU API.
|
||||
*
|
||||
|
@ -969,7 +969,7 @@ const struct address_space_operations ext2_aops = {
|
||||
.writepages = ext2_writepages,
|
||||
.migrate_folio = buffer_migrate_folio,
|
||||
.is_partially_uptodate = block_is_partially_uptodate,
|
||||
.error_remove_page = generic_error_remove_page,
|
||||
.error_remove_folio = generic_error_remove_folio,
|
||||
};
|
||||
|
||||
static const struct address_space_operations ext2_dax_aops = {
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user