2019-05-27 06:55:06 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-or-later
|
2013-07-10 23:05:03 +00:00
|
|
|
/*
|
|
|
|
* zswap.c - zswap driver file
|
|
|
|
*
|
2023-07-17 16:02:27 +00:00
|
|
|
* zswap is a cache that takes pages that are in the process
|
2013-07-10 23:05:03 +00:00
|
|
|
* of being swapped out and attempts to compress and store them in a
|
|
|
|
* RAM-based memory pool. This can result in a significant I/O reduction on
|
|
|
|
* the swap device and, in the case where decompressing from RAM is faster
|
|
|
|
* than reading from the swap device, can also improve workload performance.
|
|
|
|
*
|
|
|
|
* Copyright (C) 2012 Seth Jennings <sjenning@linux.vnet.ibm.com>
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
|
|
|
|
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/spinlock.h>
|
|
|
|
#include <linux/types.h>
|
|
|
|
#include <linux/atomic.h>
|
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/crypto.h>
|
2020-12-15 03:14:18 +00:00
|
|
|
#include <linux/scatterlist.h>
|
mempolicy: alloc_pages_mpol() for NUMA policy without vma
Shrink shmem's stack usage by eliminating the pseudo-vma from its folio
allocation. alloc_pages_mpol(gfp, order, pol, ilx, nid) becomes the
principal actor for passing mempolicy choice down to __alloc_pages(),
rather than vma_alloc_folio(gfp, order, vma, addr, hugepage).
vma_alloc_folio() and alloc_pages() remain, but as wrappers around
alloc_pages_mpol(). alloc_pages_bulk_*() untouched, except to provide the
additional args to policy_nodemask(), which subsumes policy_node().
Cleanup throughout, cutting out some unhelpful "helpers".
It would all be much simpler without MPOL_INTERLEAVE, but that adds a
dynamic to the constant mpol: complicated by v3.6 commit 09c231cb8bfd
("tmpfs: distribute interleave better across nodes"), which added ino bias
to the interleave, hidden from mm/mempolicy.c until this commit.
Hence "ilx" throughout, the "interleave index". Originally I thought it
could be done just with nid, but that's wrong: the nodemask may come from
the shared policy layer below a shmem vma, or it may come from the task
layer above a shmem vma; and without the final nodemask then nodeid cannot
be decided. And how ilx is applied depends also on page order.
The interleave index is almost always irrelevant unless MPOL_INTERLEAVE:
with one exception in alloc_pages_mpol(), where the NO_INTERLEAVE_INDEX
passed down from vma-less alloc_pages() is also used as hint not to use
THP-style hugepage allocation - to avoid the overhead of a hugepage arg
(though I don't understand why we never just added a GFP bit for THP - if
it actually needs a different allocation strategy from other pages of the
same order). vma_alloc_folio() still carries its hugepage arg here, but
it is not used, and should be removed when agreed.
get_vma_policy() no longer allows a NULL vma: over time I believe we've
eradicated all the places which used to need it e.g. swapoff and madvise
used to pass NULL vma to read_swap_cache_async(), but now know the vma.
[hughd@google.com: handle NULL mpol being passed to __read_swap_cache_async()]
Link: https://lkml.kernel.org/r/ea419956-4751-0102-21f7-9c93cb957892@google.com
Link: https://lkml.kernel.org/r/74e34633-6060-f5e3-aee-7040d43f2e93@google.com
Link: https://lkml.kernel.org/r/1738368e-bac0-fd11-ed7f-b87142a939fe@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun heo <tj@kernel.org>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Domenico Cerasuolo <mimmocerasuolo@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-19 20:39:08 +00:00
|
|
|
#include <linux/mempolicy.h>
|
2013-07-10 23:05:03 +00:00
|
|
|
#include <linux/mempool.h>
|
2014-08-06 23:08:40 +00:00
|
|
|
#include <linux/zpool.h>
|
2020-12-15 03:14:18 +00:00
|
|
|
#include <crypto/acompress.h>
|
2023-07-17 16:02:27 +00:00
|
|
|
#include <linux/zswap.h>
|
2013-07-10 23:05:03 +00:00
|
|
|
#include <linux/mm_types.h>
|
|
|
|
#include <linux/page-flags.h>
|
|
|
|
#include <linux/swapops.h>
|
|
|
|
#include <linux/writeback.h>
|
|
|
|
#include <linux/pagemap.h>
|
2020-01-31 06:15:04 +00:00
|
|
|
#include <linux/workqueue.h>
|
2023-11-30 19:40:20 +00:00
|
|
|
#include <linux/list_lru.h>
|
2013-07-10 23:05:03 +00:00
|
|
|
|
2022-05-10 01:20:47 +00:00
|
|
|
#include "swap.h"
|
mm: zswap: shrink until can accept
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test. For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided. The results are presented below, these are averages of
three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold. Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated. The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-26 18:32:27 +00:00
|
|
|
#include "internal.h"
|
2022-05-10 01:20:47 +00:00
|
|
|
|
2013-07-10 23:05:03 +00:00
|
|
|
/*********************************
|
|
|
|
* statistics
|
|
|
|
**********************************/
|
|
|
|
/* The number of compressed pages currently stored in zswap */
|
2022-05-19 21:08:53 +00:00
|
|
|
atomic_t zswap_stored_pages = ATOMIC_INIT(0);
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 00:15:59 +00:00
|
|
|
/* The number of same-value filled pages currently stored in zswap */
|
|
|
|
static atomic_t zswap_same_filled_pages = ATOMIC_INIT(0);
|
2013-07-10 23:05:03 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The statistics below are not protected from concurrent access for
|
|
|
|
* performance reasons so they may not be a 100% accurate. However,
|
|
|
|
* they do provide useful information on roughly how many times a
|
|
|
|
* certain event is occurring.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* Pool limit was hit (see zswap_max_pool_percent) */
|
|
|
|
static u64 zswap_pool_limit_hit;
|
|
|
|
/* Pages written back when pool limit was reached */
|
|
|
|
static u64 zswap_written_back_pages;
|
|
|
|
/* Store failed due to a reclaim failure after pool limit was reached */
|
|
|
|
static u64 zswap_reject_reclaim_fail;
|
2023-10-24 23:45:09 +00:00
|
|
|
/* Store failed due to compression algorithm failure */
|
|
|
|
static u64 zswap_reject_compress_fail;
|
2013-07-10 23:05:03 +00:00
|
|
|
/* Compressed page was too big for the allocator to (optimally) store */
|
|
|
|
static u64 zswap_reject_compress_poor;
|
|
|
|
/* Store failed because underlying allocator could not get memory */
|
|
|
|
static u64 zswap_reject_alloc_fail;
|
|
|
|
/* Store failed because the entry metadata could not be allocated (rare) */
|
|
|
|
static u64 zswap_reject_kmemcache_fail;
|
|
|
|
|
2020-01-31 06:15:04 +00:00
|
|
|
/* Shrinker work queue */
|
|
|
|
static struct workqueue_struct *shrink_wq;
|
|
|
|
/* Pool limit was hit, we need to calm down */
|
|
|
|
static bool zswap_pool_reached_full;
|
|
|
|
|
2013-07-10 23:05:03 +00:00
|
|
|
/*********************************
|
|
|
|
* tunables
|
|
|
|
**********************************/
|
2015-06-25 22:00:35 +00:00
|
|
|
|
2017-02-27 22:26:50 +00:00
|
|
|
#define ZSWAP_PARAM_UNSET ""
|
|
|
|
|
2023-04-03 12:13:18 +00:00
|
|
|
static int zswap_setup(void);
|
|
|
|
|
2020-04-07 03:08:03 +00:00
|
|
|
/* Enable/disable zswap */
|
|
|
|
static bool zswap_enabled = IS_ENABLED(CONFIG_ZSWAP_DEFAULT_ON);
|
zswap: disable changing params if init fails
Add zswap_init_failed bool that prevents changing any of the module
params, if init_zswap() fails, and set zswap_enabled to false. Change
'enabled' param to a callback, and check zswap_init_failed before
allowing any change to 'enabled', 'zpool', or 'compressor' params.
Any driver that is built-in to the kernel will not be unloaded if its
init function returns error, and its module params remain accessible for
users to change via sysfs. Since zswap uses param callbacks, which
assume that zswap has been initialized, changing the zswap params after
a failed initialization will result in WARNING due to the param
callbacks expecting a pool to already exist. This prevents that by
immediately exiting any of the param callbacks if initialization failed.
This was reported here:
https://marc.info/?l=linux-mm&m=147004228125528&w=4
And fixes this WARNING:
[ 429.723476] WARNING: CPU: 0 PID: 5140 at mm/zswap.c:503 __zswap_pool_current+0x56/0x60
The warning is just noise, and not serious. However, when init fails,
zswap frees all its percpu dstmem pages and its kmem cache. The kmem
cache might be serious, if kmem_cache_alloc(NULL, gfp) has problems; but
the percpu dstmem pages are definitely a problem, as they're used as
temporary buffer for compressed pages before copying into place in the
zpool.
If the user does get zswap enabled after an init failure, then zswap
will likely Oops on the first page it tries to compress (or worse, start
corrupting memory).
Fixes: 90b0fc26d5db ("zswap: change zpool/compressor at runtime")
Link: http://lkml.kernel.org/r/20170124200259.16191-2-ddstreet@ieee.org
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
Reported-by: Marcin Miroslaw <marcin@mejor.pl>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03 21:13:09 +00:00
|
|
|
static int zswap_enabled_param_set(const char *,
|
|
|
|
const struct kernel_param *);
|
2020-12-15 03:14:11 +00:00
|
|
|
static const struct kernel_param_ops zswap_enabled_param_ops = {
|
zswap: disable changing params if init fails
Add zswap_init_failed bool that prevents changing any of the module
params, if init_zswap() fails, and set zswap_enabled to false. Change
'enabled' param to a callback, and check zswap_init_failed before
allowing any change to 'enabled', 'zpool', or 'compressor' params.
Any driver that is built-in to the kernel will not be unloaded if its
init function returns error, and its module params remain accessible for
users to change via sysfs. Since zswap uses param callbacks, which
assume that zswap has been initialized, changing the zswap params after
a failed initialization will result in WARNING due to the param
callbacks expecting a pool to already exist. This prevents that by
immediately exiting any of the param callbacks if initialization failed.
This was reported here:
https://marc.info/?l=linux-mm&m=147004228125528&w=4
And fixes this WARNING:
[ 429.723476] WARNING: CPU: 0 PID: 5140 at mm/zswap.c:503 __zswap_pool_current+0x56/0x60
The warning is just noise, and not serious. However, when init fails,
zswap frees all its percpu dstmem pages and its kmem cache. The kmem
cache might be serious, if kmem_cache_alloc(NULL, gfp) has problems; but
the percpu dstmem pages are definitely a problem, as they're used as
temporary buffer for compressed pages before copying into place in the
zpool.
If the user does get zswap enabled after an init failure, then zswap
will likely Oops on the first page it tries to compress (or worse, start
corrupting memory).
Fixes: 90b0fc26d5db ("zswap: change zpool/compressor at runtime")
Link: http://lkml.kernel.org/r/20170124200259.16191-2-ddstreet@ieee.org
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
Reported-by: Marcin Miroslaw <marcin@mejor.pl>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03 21:13:09 +00:00
|
|
|
.set = zswap_enabled_param_set,
|
|
|
|
.get = param_get_bool,
|
|
|
|
};
|
|
|
|
module_param_cb(enabled, &zswap_enabled_param_ops, &zswap_enabled, 0644);
|
2013-07-10 23:05:03 +00:00
|
|
|
|
2015-09-09 22:35:21 +00:00
|
|
|
/* Crypto compressor to use */
|
2020-04-07 03:08:03 +00:00
|
|
|
static char *zswap_compressor = CONFIG_ZSWAP_COMPRESSOR_DEFAULT;
|
2015-09-09 22:35:21 +00:00
|
|
|
static int zswap_compressor_param_set(const char *,
|
|
|
|
const struct kernel_param *);
|
2020-12-15 03:14:11 +00:00
|
|
|
static const struct kernel_param_ops zswap_compressor_param_ops = {
|
2015-09-09 22:35:21 +00:00
|
|
|
.set = zswap_compressor_param_set,
|
2015-11-07 00:29:15 +00:00
|
|
|
.get = param_get_charp,
|
|
|
|
.free = param_free_charp,
|
2015-09-09 22:35:21 +00:00
|
|
|
};
|
|
|
|
module_param_cb(compressor, &zswap_compressor_param_ops,
|
2015-11-07 00:29:15 +00:00
|
|
|
&zswap_compressor, 0644);
|
2013-07-10 23:05:03 +00:00
|
|
|
|
2015-09-09 22:35:21 +00:00
|
|
|
/* Compressed storage zpool to use */
|
2020-04-07 03:08:03 +00:00
|
|
|
static char *zswap_zpool_type = CONFIG_ZSWAP_ZPOOL_DEFAULT;
|
2015-09-09 22:35:21 +00:00
|
|
|
static int zswap_zpool_param_set(const char *, const struct kernel_param *);
|
2020-12-15 03:14:11 +00:00
|
|
|
static const struct kernel_param_ops zswap_zpool_param_ops = {
|
2015-11-07 00:29:15 +00:00
|
|
|
.set = zswap_zpool_param_set,
|
|
|
|
.get = param_get_charp,
|
|
|
|
.free = param_free_charp,
|
2015-09-09 22:35:21 +00:00
|
|
|
};
|
2015-11-07 00:29:15 +00:00
|
|
|
module_param_cb(zpool, &zswap_zpool_param_ops, &zswap_zpool_type, 0644);
|
2014-08-06 23:08:40 +00:00
|
|
|
|
2015-09-09 22:35:21 +00:00
|
|
|
/* The maximum percentage of memory that the compressed pool can occupy */
|
|
|
|
static unsigned int zswap_max_pool_percent = 20;
|
|
|
|
module_param_named(max_pool_percent, zswap_max_pool_percent, uint, 0644);
|
2014-04-07 22:38:27 +00:00
|
|
|
|
2020-01-31 06:15:04 +00:00
|
|
|
/* The threshold for accepting new pages after the max_pool_percent was hit */
|
|
|
|
static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
|
|
|
|
module_param_named(accept_threshold_percent, zswap_accept_thr_percent,
|
|
|
|
uint, 0644);
|
|
|
|
|
2023-06-20 19:46:44 +00:00
|
|
|
/* Number of zpools in zswap_pool (empirically determined for scalability) */
|
|
|
|
#define ZSWAP_NR_ZPOOLS 32
|
|
|
|
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
/* Enable/disable memory pressure-based shrinker. */
|
|
|
|
static bool zswap_shrinker_enabled = IS_ENABLED(
|
|
|
|
CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
|
|
|
|
module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
|
|
|
|
|
zswap: memcontrol: implement zswap writeback disabling
During our experiment with zswap, we sometimes observe swap IOs due to
occasional zswap store failures and writebacks-to-swap. These swapping
IOs prevent many users who cannot tolerate swapping from adopting zswap to
save memory and improve performance where possible.
This patch adds the option to disable this behavior entirely: do not
writeback to backing swapping device when a zswap store attempt fail, and
do not write pages in the zswap pool back to the backing swap device (both
when the pool is full, and when the new zswap shrinker is called).
This new behavior can be opted-in/out on a per-cgroup basis via a new
cgroup file. By default, writebacks to swap device is enabled, which is
the previous behavior. Initially, writeback is enabled for the root
cgroup, and a newly created cgroup will inherit the current setting of its
parent.
Note that this is subtly different from setting memory.swap.max to 0, as
it still allows for pages to be stored in the zswap pool (which itself
consumes swap space in its current form).
This patch should be applied on top of the zswap shrinker series:
https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
as it also disables the zswap shrinker, a major source of zswap
writebacks.
For the most part, this feature is motivated by internal parties who
have already established their opinions regarding swapping - the
workloads that are highly sensitive to IO, and especially those who are
using servers with really slow disk performance (for instance, massive
but slow HDDs). For these folks, it's impossible to convince them to
even entertain zswap if swapping also comes as a packaged deal.
Writeback disabling is quite a useful feature in these situations - on
a mixed workloads deployment, they can disable writeback for the more
IO-sensitive workloads, and enable writeback for other background
workloads.
For instance, on a server with HDD, I allocate memories and populate
them with random values (so that zswap store will always fail), and
specify memory.high low enough to trigger reclaim. The time it takes
to allocate the memories and just read through it a couple of times
(doing silly things like computing the values' average etc.):
zswap.writeback disabled:
real 0m30.537s
user 0m23.687s
sys 0m6.637s
0 pages swapped in
0 pages swapped out
zswap.writeback enabled:
real 0m45.061s
user 0m24.310s
sys 0m8.892s
712686 pages swapped in
461093 pages swapped out
(the last two lines are from vmstat -s).
[nphamcs@gmail.com: add a comment about recurring zswap store failures leading to reclaim inefficiency]
Link: https://lkml.kernel.org/r/20231221005725.3446672-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231207192406.3809579-1-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Heidelberg <david@ixit.cz>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 19:24:06 +00:00
|
|
|
bool is_zswap_enabled(void)
|
|
|
|
{
|
|
|
|
return zswap_enabled;
|
|
|
|
}
|
|
|
|
|
2013-07-10 23:05:03 +00:00
|
|
|
/*********************************
|
2015-09-09 22:35:19 +00:00
|
|
|
* data structures
|
2013-07-10 23:05:03 +00:00
|
|
|
**********************************/
|
|
|
|
|
2020-12-15 03:14:18 +00:00
|
|
|
struct crypto_acomp_ctx {
|
|
|
|
struct crypto_acomp *acomp;
|
|
|
|
struct acomp_req *req;
|
|
|
|
struct crypto_wait wait;
|
2023-12-28 09:45:46 +00:00
|
|
|
u8 *buffer;
|
|
|
|
struct mutex mutex;
|
2024-02-22 08:11:35 +00:00
|
|
|
bool is_sleepable;
|
2020-12-15 03:14:18 +00:00
|
|
|
};
|
|
|
|
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 09:38:09 +00:00
|
|
|
/*
|
|
|
|
* The lock ordering is zswap_tree.lock -> zswap_pool.lru_lock.
|
|
|
|
* The only case where lru_lock is not acquired while holding tree.lock is
|
|
|
|
* when a zswap_entry is taken off the lru for writeback, in that case it
|
|
|
|
* needs to be verified that it's still valid in the tree.
|
|
|
|
*/
|
2015-09-09 22:35:19 +00:00
|
|
|
struct zswap_pool {
|
2023-06-20 19:46:44 +00:00
|
|
|
struct zpool *zpools[ZSWAP_NR_ZPOOLS];
|
2020-12-15 03:14:18 +00:00
|
|
|
struct crypto_acomp_ctx __percpu *acomp_ctx;
|
2024-02-16 08:55:05 +00:00
|
|
|
struct percpu_ref ref;
|
2015-09-09 22:35:19 +00:00
|
|
|
struct list_head list;
|
2020-01-31 06:15:04 +00:00
|
|
|
struct work_struct release_work;
|
2016-11-26 23:13:40 +00:00
|
|
|
struct hlist_node node;
|
2015-09-09 22:35:19 +00:00
|
|
|
char tfm_name[CRYPTO_MAX_ALG_NAME];
|
mm/zswap: global lru and shrinker shared by all zswap_pools
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3.
Dynamic pool creation has been supported for a long time, which maybe not
used so much in practice. But with the per-memcg lru merged, the current
structure of zswap_pool's lru and shrinker become less optimal.
In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.
1. When memory has pressure, all shrinkers of zswap_pools will try to
shrink its lru list, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work will
try to shrink its own lru, which is inefficient.
A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.
Another optimization is changing zswap_pool kref to percpu_ref, which will
be taken reference by every zswap entry. So the scalability is better.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
This patch (of 3):
Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools
in a list, only the first will be current used.
Each zswap_pool has its own lru and shrinker, which is not necessary and
has its problem:
1. When memory has pressure, all shrinker of zswap_pools will
try to shrink its own lru, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work
will try to shrink its lru list. The rationale here was to
try and empty the old pool first so that we can completely
drop it. However, since we only support exclusive loads now,
the LRU ordering should be entirely decided by the order of
stores, so the oldest entries on the LRU will naturally be
from the oldest pool.
Anyway, having a global lru and shrinker shared by all zswap_pools is
better and efficient.
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-0-200495333595@bytedance.com
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-1-200495333595@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-16 08:55:04 +00:00
|
|
|
};
|
|
|
|
|
2024-03-05 07:53:45 +00:00
|
|
|
/* Global LRU lists shared by all zswap pools. */
|
|
|
|
static struct list_lru zswap_list_lru;
|
|
|
|
|
|
|
|
/* The lock protects zswap_next_shrink updates. */
|
|
|
|
static DEFINE_SPINLOCK(zswap_shrink_lock);
|
|
|
|
static struct mem_cgroup *zswap_next_shrink;
|
|
|
|
static struct work_struct zswap_shrink_work;
|
|
|
|
static struct shrinker *zswap_shrinker;
|
2013-07-10 23:05:03 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* struct zswap_entry
|
|
|
|
*
|
|
|
|
* This structure contains the metadata for tracking a single compressed
|
|
|
|
* page within zswap.
|
|
|
|
*
|
2023-08-08 06:20:56 +00:00
|
|
|
* swpentry - associated swap entry, the offset indexes into the red-black tree
|
2013-07-10 23:05:03 +00:00
|
|
|
* length - the length in bytes of the compressed page data. Needed during
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 09:38:09 +00:00
|
|
|
* decompression. For a same value filled page length is 0, and both
|
|
|
|
* pool and lru are invalid and must be ignored.
|
2015-09-09 22:35:19 +00:00
|
|
|
* pool - the zswap_pool the entry's data is in
|
|
|
|
* handle - zpool allocation handle that stores the compressed page data
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 00:15:59 +00:00
|
|
|
* value - value of the same-value filled pages which have same content
|
2023-08-08 06:20:56 +00:00
|
|
|
* objcg - the obj_cgroup that the compressed memory is charged to
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 09:38:09 +00:00
|
|
|
* lru - handle to the pool's lru used to evict pages.
|
2013-07-10 23:05:03 +00:00
|
|
|
*/
|
|
|
|
struct zswap_entry {
|
2023-06-12 09:38:15 +00:00
|
|
|
swp_entry_t swpentry;
|
2013-07-10 23:05:03 +00:00
|
|
|
unsigned int length;
|
2015-09-09 22:35:19 +00:00
|
|
|
struct zswap_pool *pool;
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 00:15:59 +00:00
|
|
|
union {
|
|
|
|
unsigned long handle;
|
|
|
|
unsigned long value;
|
|
|
|
};
|
2022-05-19 21:08:53 +00:00
|
|
|
struct obj_cgroup *objcg;
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 09:38:09 +00:00
|
|
|
struct list_head lru;
|
2013-07-10 23:05:03 +00:00
|
|
|
};
|
|
|
|
|
2024-03-26 18:35:43 +00:00
|
|
|
static struct xarray *zswap_trees[MAX_SWAPFILES];
|
2024-01-19 11:22:23 +00:00
|
|
|
static unsigned int nr_zswap_trees[MAX_SWAPFILES];
|
2013-07-10 23:05:03 +00:00
|
|
|
|
2015-09-09 22:35:19 +00:00
|
|
|
/* RCU-protected iteration */
|
|
|
|
static LIST_HEAD(zswap_pools);
|
|
|
|
/* protects zswap_pools list modification */
|
|
|
|
static DEFINE_SPINLOCK(zswap_pools_lock);
|
2016-05-05 23:22:23 +00:00
|
|
|
/* pool counter to provide unique names to zpool */
|
|
|
|
static atomic_t zswap_pools_count = ATOMIC_INIT(0);
|
2015-09-09 22:35:19 +00:00
|
|
|
|
2023-04-03 12:13:17 +00:00
|
|
|
enum zswap_init_type {
|
|
|
|
ZSWAP_UNINIT,
|
|
|
|
ZSWAP_INIT_SUCCEED,
|
|
|
|
ZSWAP_INIT_FAILED
|
|
|
|
};
|
2015-09-09 22:35:21 +00:00
|
|
|
|
2023-04-03 12:13:17 +00:00
|
|
|
static enum zswap_init_type zswap_init_state;
|
2015-09-09 22:35:21 +00:00
|
|
|
|
2023-04-03 12:13:18 +00:00
|
|
|
/* used to ensure the integrity of initialization */
|
|
|
|
static DEFINE_MUTEX(zswap_init_lock);
|
zswap: disable changing params if init fails
Add zswap_init_failed bool that prevents changing any of the module
params, if init_zswap() fails, and set zswap_enabled to false. Change
'enabled' param to a callback, and check zswap_init_failed before
allowing any change to 'enabled', 'zpool', or 'compressor' params.
Any driver that is built-in to the kernel will not be unloaded if its
init function returns error, and its module params remain accessible for
users to change via sysfs. Since zswap uses param callbacks, which
assume that zswap has been initialized, changing the zswap params after
a failed initialization will result in WARNING due to the param
callbacks expecting a pool to already exist. This prevents that by
immediately exiting any of the param callbacks if initialization failed.
This was reported here:
https://marc.info/?l=linux-mm&m=147004228125528&w=4
And fixes this WARNING:
[ 429.723476] WARNING: CPU: 0 PID: 5140 at mm/zswap.c:503 __zswap_pool_current+0x56/0x60
The warning is just noise, and not serious. However, when init fails,
zswap frees all its percpu dstmem pages and its kmem cache. The kmem
cache might be serious, if kmem_cache_alloc(NULL, gfp) has problems; but
the percpu dstmem pages are definitely a problem, as they're used as
temporary buffer for compressed pages before copying into place in the
zpool.
If the user does get zswap enabled after an init failure, then zswap
will likely Oops on the first page it tries to compress (or worse, start
corrupting memory).
Fixes: 90b0fc26d5db ("zswap: change zpool/compressor at runtime")
Link: http://lkml.kernel.org/r/20170124200259.16191-2-ddstreet@ieee.org
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
Reported-by: Marcin Miroslaw <marcin@mejor.pl>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03 21:13:09 +00:00
|
|
|
|
2017-02-27 22:26:47 +00:00
|
|
|
/* init completed, but couldn't create the initial pool */
|
|
|
|
static bool zswap_has_pool;
|
|
|
|
|
2015-09-09 22:35:19 +00:00
|
|
|
/*********************************
|
|
|
|
* helpers and fwd declarations
|
|
|
|
**********************************/
|
|
|
|
|
2024-03-26 18:35:43 +00:00
|
|
|
static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
|
2024-01-19 11:22:23 +00:00
|
|
|
{
|
|
|
|
return &zswap_trees[swp_type(swp)][swp_offset(swp)
|
|
|
|
>> SWAP_ADDRESS_SPACE_SHIFT];
|
|
|
|
}
|
|
|
|
|
2015-09-09 22:35:19 +00:00
|
|
|
#define zswap_pool_debug(msg, p) \
|
|
|
|
pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name, \
|
2023-06-20 19:46:44 +00:00
|
|
|
zpool_get_type((p)->zpools[0]))
|
2015-09-09 22:35:19 +00:00
|
|
|
|
2024-01-30 01:36:46 +00:00
|
|
|
/*********************************
|
|
|
|
* pool functions
|
|
|
|
**********************************/
|
2024-02-16 08:55:05 +00:00
|
|
|
static void __zswap_pool_empty(struct percpu_ref *ref);
|
2024-01-30 01:36:46 +00:00
|
|
|
|
|
|
|
static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
struct zswap_pool *pool;
|
|
|
|
char name[38]; /* 'zswap' + 32 char (max) num + \0 */
|
|
|
|
gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!zswap_has_pool) {
|
|
|
|
/* if either are unset, pool initialization failed, and we
|
|
|
|
* need both params to be set correctly before trying to
|
|
|
|
* create a pool.
|
|
|
|
*/
|
|
|
|
if (!strcmp(type, ZSWAP_PARAM_UNSET))
|
|
|
|
return NULL;
|
|
|
|
if (!strcmp(compressor, ZSWAP_PARAM_UNSET))
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
pool = kzalloc(sizeof(*pool), GFP_KERNEL);
|
|
|
|
if (!pool)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
for (i = 0; i < ZSWAP_NR_ZPOOLS; i++) {
|
|
|
|
/* unique name for each pool specifically required by zsmalloc */
|
|
|
|
snprintf(name, 38, "zswap%x",
|
|
|
|
atomic_inc_return(&zswap_pools_count));
|
|
|
|
|
|
|
|
pool->zpools[i] = zpool_create_pool(type, name, gfp);
|
|
|
|
if (!pool->zpools[i]) {
|
|
|
|
pr_err("%s zpool not available\n", type);
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
pr_debug("using %s zpool\n", zpool_get_type(pool->zpools[0]));
|
|
|
|
|
|
|
|
strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
|
|
|
|
|
|
|
|
pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx);
|
|
|
|
if (!pool->acomp_ctx) {
|
|
|
|
pr_err("percpu alloc failed\n");
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
|
|
|
|
&pool->node);
|
|
|
|
if (ret)
|
|
|
|
goto error;
|
|
|
|
|
|
|
|
/* being the current pool takes 1 ref; this func expects the
|
|
|
|
* caller to always add the new pool as the current pool
|
|
|
|
*/
|
2024-02-16 08:55:05 +00:00
|
|
|
ret = percpu_ref_init(&pool->ref, __zswap_pool_empty,
|
|
|
|
PERCPU_REF_ALLOW_REINIT, GFP_KERNEL);
|
|
|
|
if (ret)
|
|
|
|
goto ref_fail;
|
2024-01-30 01:36:46 +00:00
|
|
|
INIT_LIST_HEAD(&pool->list);
|
|
|
|
|
|
|
|
zswap_pool_debug("created", pool);
|
|
|
|
|
|
|
|
return pool;
|
|
|
|
|
2024-02-16 08:55:05 +00:00
|
|
|
ref_fail:
|
|
|
|
cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
|
2024-01-30 01:36:46 +00:00
|
|
|
error:
|
|
|
|
if (pool->acomp_ctx)
|
|
|
|
free_percpu(pool->acomp_ctx);
|
|
|
|
while (i--)
|
|
|
|
zpool_destroy_pool(pool->zpools[i]);
|
|
|
|
kfree(pool);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct zswap_pool *__zswap_pool_create_fallback(void)
|
|
|
|
{
|
|
|
|
bool has_comp, has_zpool;
|
|
|
|
|
|
|
|
has_comp = crypto_has_acomp(zswap_compressor, 0, 0);
|
|
|
|
if (!has_comp && strcmp(zswap_compressor,
|
|
|
|
CONFIG_ZSWAP_COMPRESSOR_DEFAULT)) {
|
|
|
|
pr_err("compressor %s not available, using default %s\n",
|
|
|
|
zswap_compressor, CONFIG_ZSWAP_COMPRESSOR_DEFAULT);
|
|
|
|
param_free_charp(&zswap_compressor);
|
|
|
|
zswap_compressor = CONFIG_ZSWAP_COMPRESSOR_DEFAULT;
|
|
|
|
has_comp = crypto_has_acomp(zswap_compressor, 0, 0);
|
|
|
|
}
|
|
|
|
if (!has_comp) {
|
|
|
|
pr_err("default compressor %s not available\n",
|
|
|
|
zswap_compressor);
|
|
|
|
param_free_charp(&zswap_compressor);
|
|
|
|
zswap_compressor = ZSWAP_PARAM_UNSET;
|
|
|
|
}
|
|
|
|
|
|
|
|
has_zpool = zpool_has_pool(zswap_zpool_type);
|
|
|
|
if (!has_zpool && strcmp(zswap_zpool_type,
|
|
|
|
CONFIG_ZSWAP_ZPOOL_DEFAULT)) {
|
|
|
|
pr_err("zpool %s not available, using default %s\n",
|
|
|
|
zswap_zpool_type, CONFIG_ZSWAP_ZPOOL_DEFAULT);
|
|
|
|
param_free_charp(&zswap_zpool_type);
|
|
|
|
zswap_zpool_type = CONFIG_ZSWAP_ZPOOL_DEFAULT;
|
|
|
|
has_zpool = zpool_has_pool(zswap_zpool_type);
|
|
|
|
}
|
|
|
|
if (!has_zpool) {
|
|
|
|
pr_err("default zpool %s not available\n",
|
|
|
|
zswap_zpool_type);
|
|
|
|
param_free_charp(&zswap_zpool_type);
|
|
|
|
zswap_zpool_type = ZSWAP_PARAM_UNSET;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!has_comp || !has_zpool)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
return zswap_pool_create(zswap_zpool_type, zswap_compressor);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void zswap_pool_destroy(struct zswap_pool *pool)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
zswap_pool_debug("destroying", pool);
|
|
|
|
|
|
|
|
cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
|
|
|
|
free_percpu(pool->acomp_ctx);
|
|
|
|
|
|
|
|
for (i = 0; i < ZSWAP_NR_ZPOOLS; i++)
|
|
|
|
zpool_destroy_pool(pool->zpools[i]);
|
|
|
|
kfree(pool);
|
|
|
|
}
|
|
|
|
|
2024-01-30 01:36:47 +00:00
|
|
|
static void __zswap_pool_release(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool = container_of(work, typeof(*pool),
|
|
|
|
release_work);
|
|
|
|
|
|
|
|
synchronize_rcu();
|
|
|
|
|
2024-02-16 08:55:05 +00:00
|
|
|
/* nobody should have been able to get a ref... */
|
|
|
|
WARN_ON(!percpu_ref_is_zero(&pool->ref));
|
|
|
|
percpu_ref_exit(&pool->ref);
|
2024-01-30 01:36:47 +00:00
|
|
|
|
|
|
|
/* pool is now off zswap_pools list and has no references. */
|
|
|
|
zswap_pool_destroy(pool);
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct zswap_pool *zswap_pool_current(void);
|
|
|
|
|
2024-02-16 08:55:05 +00:00
|
|
|
static void __zswap_pool_empty(struct percpu_ref *ref)
|
2024-01-30 01:36:47 +00:00
|
|
|
{
|
|
|
|
struct zswap_pool *pool;
|
|
|
|
|
2024-02-16 08:55:05 +00:00
|
|
|
pool = container_of(ref, typeof(*pool), ref);
|
2024-01-30 01:36:47 +00:00
|
|
|
|
2024-02-16 08:55:05 +00:00
|
|
|
spin_lock_bh(&zswap_pools_lock);
|
2024-01-30 01:36:47 +00:00
|
|
|
|
|
|
|
WARN_ON(pool == zswap_pool_current());
|
|
|
|
|
|
|
|
list_del_rcu(&pool->list);
|
|
|
|
|
|
|
|
INIT_WORK(&pool->release_work, __zswap_pool_release);
|
|
|
|
schedule_work(&pool->release_work);
|
|
|
|
|
2024-02-16 08:55:05 +00:00
|
|
|
spin_unlock_bh(&zswap_pools_lock);
|
2024-01-30 01:36:47 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int __must_check zswap_pool_get(struct zswap_pool *pool)
|
|
|
|
{
|
|
|
|
if (!pool)
|
|
|
|
return 0;
|
|
|
|
|
2024-02-16 08:55:05 +00:00
|
|
|
return percpu_ref_tryget(&pool->ref);
|
2024-01-30 01:36:47 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void zswap_pool_put(struct zswap_pool *pool)
|
|
|
|
{
|
2024-02-16 08:55:05 +00:00
|
|
|
percpu_ref_put(&pool->ref);
|
2024-01-30 01:36:47 +00:00
|
|
|
}
|
|
|
|
|
2024-01-30 01:36:48 +00:00
|
|
|
static struct zswap_pool *__zswap_pool_current(void)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool;
|
|
|
|
|
|
|
|
pool = list_first_or_null_rcu(&zswap_pools, typeof(*pool), list);
|
|
|
|
WARN_ONCE(!pool && zswap_has_pool,
|
|
|
|
"%s: no page storage pool!\n", __func__);
|
|
|
|
|
|
|
|
return pool;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct zswap_pool *zswap_pool_current(void)
|
|
|
|
{
|
|
|
|
assert_spin_locked(&zswap_pools_lock);
|
|
|
|
|
|
|
|
return __zswap_pool_current();
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct zswap_pool *zswap_pool_current_get(void)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
|
|
|
|
pool = __zswap_pool_current();
|
|
|
|
if (!zswap_pool_get(pool))
|
|
|
|
pool = NULL;
|
|
|
|
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return pool;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* type and compressor must be null-terminated */
|
|
|
|
static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool;
|
|
|
|
|
|
|
|
assert_spin_locked(&zswap_pools_lock);
|
|
|
|
|
|
|
|
list_for_each_entry_rcu(pool, &zswap_pools, list) {
|
|
|
|
if (strcmp(pool->tfm_name, compressor))
|
|
|
|
continue;
|
|
|
|
/* all zpools share the same type */
|
|
|
|
if (strcmp(zpool_get_type(pool->zpools[0]), type))
|
|
|
|
continue;
|
|
|
|
/* if we can't get it, it's about to be destroyed */
|
|
|
|
if (!zswap_pool_get(pool))
|
|
|
|
continue;
|
|
|
|
return pool;
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
mm: zswap: optimize zswap pool size tracking
Profiling the munmap() of a zswapped memory region shows 60% of the total
cycles currently going into updating the zswap_pool_total_size.
There are three consumers of this counter:
- store, to enforce the globally configured pool limit
- meminfo & debugfs, to report the size to the user
- shrink, to determine the batch size for each cycle
Instead of aggregating everytime an entry enters or exits the zswap
pool, aggregate the value from the zpools on-demand:
- Stores aggregate the counter anyway upon success. Aggregating to
check the limit instead is the same amount of work.
- Meminfo & debugfs might benefit somewhat from a pre-aggregated
counter, but aren't exactly hotpaths.
- Shrinking can aggregate once for every cycle instead of doing it for
every freed entry. As the shrinker might work on tens or hundreds of
objects per scan cycle, this is a large reduction in aggregations.
The paths that benefit dramatically are swapin, swapoff, and unmaps.
There could be millions of pages being processed until somebody asks for
the pool size again. This eliminates the pool size updates from those
paths entirely.
Top profile entries for a 24G range munmap(), before:
38.54% zswap-unmap [kernel.kallsyms] [k] zs_zpool_total_size
12.51% zswap-unmap [kernel.kallsyms] [k] zpool_get_total_size
9.10% zswap-unmap [kernel.kallsyms] [k] zswap_update_total_size
2.95% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
2.88% zswap-unmap [kernel.kallsyms] [k] __slab_free
2.86% zswap-unmap [kernel.kallsyms] [k] xas_store
and after:
7.70% zswap-unmap [kernel.kallsyms] [k] __slab_free
7.16% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
6.74% zswap-unmap [kernel.kallsyms] [k] xas_store
It was also briefly considered to move to a single atomic in zswap
that is updated by the backends, since zswap only cares about the sum
of all pools anyway. However, zram directly needs per-pool information
out of zsmalloc. To keep the backend from having to update two atomics
every time, I opted for the lazy aggregation instead for now.
Link: https://lkml.kernel.org/r/20240312153901.3441-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-12 15:34:11 +00:00
|
|
|
static unsigned long zswap_max_pages(void)
|
|
|
|
{
|
|
|
|
return totalram_pages() * zswap_max_pool_percent / 100;
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned long zswap_accept_thr_pages(void)
|
|
|
|
{
|
|
|
|
return zswap_max_pages() * zswap_accept_thr_percent / 100;
|
|
|
|
}
|
|
|
|
|
|
|
|
unsigned long zswap_total_pages(void)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool;
|
2024-03-12 15:34:12 +00:00
|
|
|
unsigned long total = 0;
|
mm: zswap: optimize zswap pool size tracking
Profiling the munmap() of a zswapped memory region shows 60% of the total
cycles currently going into updating the zswap_pool_total_size.
There are three consumers of this counter:
- store, to enforce the globally configured pool limit
- meminfo & debugfs, to report the size to the user
- shrink, to determine the batch size for each cycle
Instead of aggregating everytime an entry enters or exits the zswap
pool, aggregate the value from the zpools on-demand:
- Stores aggregate the counter anyway upon success. Aggregating to
check the limit instead is the same amount of work.
- Meminfo & debugfs might benefit somewhat from a pre-aggregated
counter, but aren't exactly hotpaths.
- Shrinking can aggregate once for every cycle instead of doing it for
every freed entry. As the shrinker might work on tens or hundreds of
objects per scan cycle, this is a large reduction in aggregations.
The paths that benefit dramatically are swapin, swapoff, and unmaps.
There could be millions of pages being processed until somebody asks for
the pool size again. This eliminates the pool size updates from those
paths entirely.
Top profile entries for a 24G range munmap(), before:
38.54% zswap-unmap [kernel.kallsyms] [k] zs_zpool_total_size
12.51% zswap-unmap [kernel.kallsyms] [k] zpool_get_total_size
9.10% zswap-unmap [kernel.kallsyms] [k] zswap_update_total_size
2.95% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
2.88% zswap-unmap [kernel.kallsyms] [k] __slab_free
2.86% zswap-unmap [kernel.kallsyms] [k] xas_store
and after:
7.70% zswap-unmap [kernel.kallsyms] [k] __slab_free
7.16% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
6.74% zswap-unmap [kernel.kallsyms] [k] xas_store
It was also briefly considered to move to a single atomic in zswap
that is updated by the backends, since zswap only cares about the sum
of all pools anyway. However, zram directly needs per-pool information
out of zsmalloc. To keep the backend from having to update two atomics
every time, I opted for the lazy aggregation instead for now.
Link: https://lkml.kernel.org/r/20240312153901.3441-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-12 15:34:11 +00:00
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
list_for_each_entry_rcu(pool, &zswap_pools, list) {
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < ZSWAP_NR_ZPOOLS; i++)
|
2024-03-12 15:34:12 +00:00
|
|
|
total += zpool_get_total_pages(pool->zpools[i]);
|
mm: zswap: optimize zswap pool size tracking
Profiling the munmap() of a zswapped memory region shows 60% of the total
cycles currently going into updating the zswap_pool_total_size.
There are three consumers of this counter:
- store, to enforce the globally configured pool limit
- meminfo & debugfs, to report the size to the user
- shrink, to determine the batch size for each cycle
Instead of aggregating everytime an entry enters or exits the zswap
pool, aggregate the value from the zpools on-demand:
- Stores aggregate the counter anyway upon success. Aggregating to
check the limit instead is the same amount of work.
- Meminfo & debugfs might benefit somewhat from a pre-aggregated
counter, but aren't exactly hotpaths.
- Shrinking can aggregate once for every cycle instead of doing it for
every freed entry. As the shrinker might work on tens or hundreds of
objects per scan cycle, this is a large reduction in aggregations.
The paths that benefit dramatically are swapin, swapoff, and unmaps.
There could be millions of pages being processed until somebody asks for
the pool size again. This eliminates the pool size updates from those
paths entirely.
Top profile entries for a 24G range munmap(), before:
38.54% zswap-unmap [kernel.kallsyms] [k] zs_zpool_total_size
12.51% zswap-unmap [kernel.kallsyms] [k] zpool_get_total_size
9.10% zswap-unmap [kernel.kallsyms] [k] zswap_update_total_size
2.95% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
2.88% zswap-unmap [kernel.kallsyms] [k] __slab_free
2.86% zswap-unmap [kernel.kallsyms] [k] xas_store
and after:
7.70% zswap-unmap [kernel.kallsyms] [k] __slab_free
7.16% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
6.74% zswap-unmap [kernel.kallsyms] [k] xas_store
It was also briefly considered to move to a single atomic in zswap
that is updated by the backends, since zswap only cares about the sum
of all pools anyway. However, zram directly needs per-pool information
out of zsmalloc. To keep the backend from having to update two atomics
every time, I opted for the lazy aggregation instead for now.
Link: https://lkml.kernel.org/r/20240312153901.3441-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-12 15:34:11 +00:00
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
2024-03-12 15:34:12 +00:00
|
|
|
return total;
|
mm: zswap: optimize zswap pool size tracking
Profiling the munmap() of a zswapped memory region shows 60% of the total
cycles currently going into updating the zswap_pool_total_size.
There are three consumers of this counter:
- store, to enforce the globally configured pool limit
- meminfo & debugfs, to report the size to the user
- shrink, to determine the batch size for each cycle
Instead of aggregating everytime an entry enters or exits the zswap
pool, aggregate the value from the zpools on-demand:
- Stores aggregate the counter anyway upon success. Aggregating to
check the limit instead is the same amount of work.
- Meminfo & debugfs might benefit somewhat from a pre-aggregated
counter, but aren't exactly hotpaths.
- Shrinking can aggregate once for every cycle instead of doing it for
every freed entry. As the shrinker might work on tens or hundreds of
objects per scan cycle, this is a large reduction in aggregations.
The paths that benefit dramatically are swapin, swapoff, and unmaps.
There could be millions of pages being processed until somebody asks for
the pool size again. This eliminates the pool size updates from those
paths entirely.
Top profile entries for a 24G range munmap(), before:
38.54% zswap-unmap [kernel.kallsyms] [k] zs_zpool_total_size
12.51% zswap-unmap [kernel.kallsyms] [k] zpool_get_total_size
9.10% zswap-unmap [kernel.kallsyms] [k] zswap_update_total_size
2.95% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
2.88% zswap-unmap [kernel.kallsyms] [k] __slab_free
2.86% zswap-unmap [kernel.kallsyms] [k] xas_store
and after:
7.70% zswap-unmap [kernel.kallsyms] [k] __slab_free
7.16% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
6.74% zswap-unmap [kernel.kallsyms] [k] xas_store
It was also briefly considered to move to a single atomic in zswap
that is updated by the backends, since zswap only cares about the sum
of all pools anyway. However, zram directly needs per-pool information
out of zsmalloc. To keep the backend from having to update two atomics
every time, I opted for the lazy aggregation instead for now.
Link: https://lkml.kernel.org/r/20240312153901.3441-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-12 15:34:11 +00:00
|
|
|
}
|
|
|
|
|
2024-04-13 02:24:05 +00:00
|
|
|
static bool zswap_check_limits(void)
|
|
|
|
{
|
|
|
|
unsigned long cur_pages = zswap_total_pages();
|
|
|
|
unsigned long max_pages = zswap_max_pages();
|
|
|
|
|
|
|
|
if (cur_pages >= max_pages) {
|
|
|
|
zswap_pool_limit_hit++;
|
|
|
|
zswap_pool_reached_full = true;
|
|
|
|
} else if (zswap_pool_reached_full &&
|
|
|
|
cur_pages <= zswap_accept_thr_pages()) {
|
|
|
|
zswap_pool_reached_full = false;
|
|
|
|
}
|
|
|
|
return zswap_pool_reached_full;
|
|
|
|
}
|
|
|
|
|
2024-01-30 01:36:49 +00:00
|
|
|
/*********************************
|
|
|
|
* param callbacks
|
|
|
|
**********************************/
|
|
|
|
|
|
|
|
static bool zswap_pool_changed(const char *s, const struct kernel_param *kp)
|
|
|
|
{
|
|
|
|
/* no change required */
|
|
|
|
if (!strcmp(s, *(char **)kp->arg) && zswap_has_pool)
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* val must be a null-terminated string */
|
|
|
|
static int __zswap_param_set(const char *val, const struct kernel_param *kp,
|
|
|
|
char *type, char *compressor)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool, *put_pool = NULL;
|
|
|
|
char *s = strstrip((char *)val);
|
|
|
|
int ret = 0;
|
|
|
|
bool new_pool = false;
|
|
|
|
|
|
|
|
mutex_lock(&zswap_init_lock);
|
|
|
|
switch (zswap_init_state) {
|
|
|
|
case ZSWAP_UNINIT:
|
|
|
|
/* if this is load-time (pre-init) param setting,
|
|
|
|
* don't create a pool; that's done during init.
|
|
|
|
*/
|
|
|
|
ret = param_set_charp(s, kp);
|
|
|
|
break;
|
|
|
|
case ZSWAP_INIT_SUCCEED:
|
|
|
|
new_pool = zswap_pool_changed(s, kp);
|
|
|
|
break;
|
|
|
|
case ZSWAP_INIT_FAILED:
|
|
|
|
pr_err("can't set param, initialization failed\n");
|
|
|
|
ret = -ENODEV;
|
|
|
|
}
|
|
|
|
mutex_unlock(&zswap_init_lock);
|
|
|
|
|
|
|
|
/* no need to create a new pool, return directly */
|
|
|
|
if (!new_pool)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (!type) {
|
|
|
|
if (!zpool_has_pool(s)) {
|
|
|
|
pr_err("zpool %s not available\n", s);
|
|
|
|
return -ENOENT;
|
|
|
|
}
|
|
|
|
type = s;
|
|
|
|
} else if (!compressor) {
|
|
|
|
if (!crypto_has_acomp(s, 0, 0)) {
|
|
|
|
pr_err("compressor %s not available\n", s);
|
|
|
|
return -ENOENT;
|
|
|
|
}
|
|
|
|
compressor = s;
|
|
|
|
} else {
|
|
|
|
WARN_ON(1);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2024-02-16 08:55:05 +00:00
|
|
|
spin_lock_bh(&zswap_pools_lock);
|
2024-01-30 01:36:49 +00:00
|
|
|
|
|
|
|
pool = zswap_pool_find_get(type, compressor);
|
|
|
|
if (pool) {
|
|
|
|
zswap_pool_debug("using existing", pool);
|
|
|
|
WARN_ON(pool == zswap_pool_current());
|
|
|
|
list_del_rcu(&pool->list);
|
|
|
|
}
|
|
|
|
|
2024-02-16 08:55:05 +00:00
|
|
|
spin_unlock_bh(&zswap_pools_lock);
|
2024-01-30 01:36:49 +00:00
|
|
|
|
|
|
|
if (!pool)
|
|
|
|
pool = zswap_pool_create(type, compressor);
|
2024-02-16 08:55:05 +00:00
|
|
|
else {
|
|
|
|
/*
|
|
|
|
* Restore the initial ref dropped by percpu_ref_kill()
|
|
|
|
* when the pool was decommissioned and switch it again
|
|
|
|
* to percpu mode.
|
|
|
|
*/
|
|
|
|
percpu_ref_resurrect(&pool->ref);
|
|
|
|
|
|
|
|
/* Drop the ref from zswap_pool_find_get(). */
|
|
|
|
zswap_pool_put(pool);
|
|
|
|
}
|
2024-01-30 01:36:49 +00:00
|
|
|
|
|
|
|
if (pool)
|
|
|
|
ret = param_set_charp(s, kp);
|
|
|
|
else
|
|
|
|
ret = -EINVAL;
|
|
|
|
|
2024-02-16 08:55:05 +00:00
|
|
|
spin_lock_bh(&zswap_pools_lock);
|
2024-01-30 01:36:49 +00:00
|
|
|
|
|
|
|
if (!ret) {
|
|
|
|
put_pool = zswap_pool_current();
|
|
|
|
list_add_rcu(&pool->list, &zswap_pools);
|
|
|
|
zswap_has_pool = true;
|
|
|
|
} else if (pool) {
|
|
|
|
/* add the possibly pre-existing pool to the end of the pools
|
|
|
|
* list; if it's new (and empty) then it'll be removed and
|
|
|
|
* destroyed by the put after we drop the lock
|
|
|
|
*/
|
|
|
|
list_add_tail_rcu(&pool->list, &zswap_pools);
|
|
|
|
put_pool = pool;
|
|
|
|
}
|
|
|
|
|
2024-02-16 08:55:05 +00:00
|
|
|
spin_unlock_bh(&zswap_pools_lock);
|
2024-01-30 01:36:49 +00:00
|
|
|
|
|
|
|
if (!zswap_has_pool && !pool) {
|
|
|
|
/* if initial pool creation failed, and this pool creation also
|
|
|
|
* failed, maybe both compressor and zpool params were bad.
|
|
|
|
* Allow changing this param, so pool creation will succeed
|
|
|
|
* when the other param is changed. We already verified this
|
|
|
|
* param is ok in the zpool_has_pool() or crypto_has_acomp()
|
|
|
|
* checks above.
|
|
|
|
*/
|
|
|
|
ret = param_set_charp(s, kp);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* drop the ref from either the old current pool,
|
|
|
|
* or the new pool we failed to add
|
|
|
|
*/
|
|
|
|
if (put_pool)
|
2024-02-16 08:55:05 +00:00
|
|
|
percpu_ref_kill(&put_pool->ref);
|
2024-01-30 01:36:49 +00:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int zswap_compressor_param_set(const char *val,
|
|
|
|
const struct kernel_param *kp)
|
|
|
|
{
|
|
|
|
return __zswap_param_set(val, kp, zswap_zpool_type, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int zswap_zpool_param_set(const char *val,
|
|
|
|
const struct kernel_param *kp)
|
|
|
|
{
|
|
|
|
return __zswap_param_set(val, kp, NULL, zswap_compressor);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int zswap_enabled_param_set(const char *val,
|
|
|
|
const struct kernel_param *kp)
|
|
|
|
{
|
|
|
|
int ret = -ENODEV;
|
|
|
|
|
|
|
|
/* if this is load-time (pre-init) param setting, only set param. */
|
|
|
|
if (system_state != SYSTEM_RUNNING)
|
|
|
|
return param_set_bool(val, kp);
|
|
|
|
|
|
|
|
mutex_lock(&zswap_init_lock);
|
|
|
|
switch (zswap_init_state) {
|
|
|
|
case ZSWAP_UNINIT:
|
|
|
|
if (zswap_setup())
|
|
|
|
break;
|
|
|
|
fallthrough;
|
|
|
|
case ZSWAP_INIT_SUCCEED:
|
|
|
|
if (!zswap_has_pool)
|
|
|
|
pr_err("can't enable, no pool configured\n");
|
|
|
|
else
|
|
|
|
ret = param_set_bool(val, kp);
|
|
|
|
break;
|
|
|
|
case ZSWAP_INIT_FAILED:
|
|
|
|
pr_err("can't enable, initialization failed\n");
|
|
|
|
}
|
|
|
|
mutex_unlock(&zswap_init_lock);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2024-01-30 01:36:50 +00:00
|
|
|
/*********************************
|
|
|
|
* lru functions
|
|
|
|
**********************************/
|
|
|
|
|
2023-11-30 19:40:20 +00:00
|
|
|
/* should be called under RCU */
|
|
|
|
#ifdef CONFIG_MEMCG
|
|
|
|
static inline struct mem_cgroup *mem_cgroup_from_entry(struct zswap_entry *entry)
|
|
|
|
{
|
|
|
|
return entry->objcg ? obj_cgroup_memcg(entry->objcg) : NULL;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline struct mem_cgroup *mem_cgroup_from_entry(struct zswap_entry *entry)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static inline int entry_to_nid(struct zswap_entry *entry)
|
|
|
|
{
|
|
|
|
return page_to_nid(virt_to_page(entry));
|
|
|
|
}
|
|
|
|
|
|
|
|
static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
|
|
|
|
{
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
atomic_long_t *nr_zswap_protected;
|
|
|
|
unsigned long lru_size, old, new;
|
2023-11-30 19:40:20 +00:00
|
|
|
int nid = entry_to_nid(entry);
|
|
|
|
struct mem_cgroup *memcg;
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
struct lruvec *lruvec;
|
2023-11-30 19:40:20 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Note that it is safe to use rcu_read_lock() here, even in the face of
|
|
|
|
* concurrent memcg offlining. Thanks to the memcg->kmemcg_id indirection
|
|
|
|
* used in list_lru lookup, only two scenarios are possible:
|
|
|
|
*
|
|
|
|
* 1. list_lru_add() is called before memcg->kmemcg_id is updated. The
|
|
|
|
* new entry will be reparented to memcg's parent's list_lru.
|
|
|
|
* 2. list_lru_add() is called after memcg->kmemcg_id is updated. The
|
|
|
|
* new entry will be added directly to memcg's parent's list_lru.
|
|
|
|
*
|
2024-01-28 13:28:51 +00:00
|
|
|
* Similar reasoning holds for list_lru_del().
|
2023-11-30 19:40:20 +00:00
|
|
|
*/
|
|
|
|
rcu_read_lock();
|
|
|
|
memcg = mem_cgroup_from_entry(entry);
|
|
|
|
/* will always succeed */
|
|
|
|
list_lru_add(list_lru, &entry->lru, nid, memcg);
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
|
|
|
|
/* Update the protection area */
|
|
|
|
lru_size = list_lru_count_one(list_lru, nid, memcg);
|
|
|
|
lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
|
|
|
|
nr_zswap_protected = &lruvec->zswap_lruvec_state.nr_zswap_protected;
|
|
|
|
old = atomic_long_inc_return(nr_zswap_protected);
|
|
|
|
/*
|
|
|
|
* Decay to avoid overflow and adapt to changing workloads.
|
|
|
|
* This is based on LRU reclaim cost decaying heuristics.
|
|
|
|
*/
|
|
|
|
do {
|
|
|
|
new = old > lru_size / 4 ? old / 2 : old;
|
|
|
|
} while (!atomic_long_try_cmpxchg(nr_zswap_protected, &old, new));
|
2023-11-30 19:40:20 +00:00
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
|
|
|
static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
|
|
|
|
{
|
|
|
|
int nid = entry_to_nid(entry);
|
|
|
|
struct mem_cgroup *memcg;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
memcg = mem_cgroup_from_entry(entry);
|
|
|
|
/* will always succeed */
|
|
|
|
list_lru_del(list_lru, &entry->lru, nid, memcg);
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
2024-01-30 01:36:51 +00:00
|
|
|
void zswap_lruvec_state_init(struct lruvec *lruvec)
|
|
|
|
{
|
|
|
|
atomic_long_set(&lruvec->zswap_lruvec_state.nr_zswap_protected, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
void zswap_folio_swapin(struct folio *folio)
|
|
|
|
{
|
|
|
|
struct lruvec *lruvec;
|
|
|
|
|
|
|
|
if (folio) {
|
|
|
|
lruvec = folio_lruvec(folio);
|
|
|
|
atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
|
|
|
|
{
|
mm/zswap: global lru and shrinker shared by all zswap_pools
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3.
Dynamic pool creation has been supported for a long time, which maybe not
used so much in practice. But with the per-memcg lru merged, the current
structure of zswap_pool's lru and shrinker become less optimal.
In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.
1. When memory has pressure, all shrinkers of zswap_pools will try to
shrink its lru list, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work will
try to shrink its own lru, which is inefficient.
A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.
Another optimization is changing zswap_pool kref to percpu_ref, which will
be taken reference by every zswap entry. So the scalability is better.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
This patch (of 3):
Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools
in a list, only the first will be current used.
Each zswap_pool has its own lru and shrinker, which is not necessary and
has its problem:
1. When memory has pressure, all shrinker of zswap_pools will
try to shrink its own lru, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work
will try to shrink its lru list. The rationale here was to
try and empty the old pool first so that we can completely
drop it. However, since we only support exclusive loads now,
the LRU ordering should be entirely decided by the order of
stores, so the oldest entries on the LRU will naturally be
from the oldest pool.
Anyway, having a global lru and shrinker shared by all zswap_pools is
better and efficient.
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-0-200495333595@bytedance.com
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-1-200495333595@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-16 08:55:04 +00:00
|
|
|
/* lock out zswap shrinker walking memcg tree */
|
2024-03-05 07:53:45 +00:00
|
|
|
spin_lock(&zswap_shrink_lock);
|
|
|
|
if (zswap_next_shrink == memcg)
|
|
|
|
zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
|
|
|
|
spin_unlock(&zswap_shrink_lock);
|
2024-01-30 01:36:51 +00:00
|
|
|
}
|
|
|
|
|
2024-01-30 01:36:52 +00:00
|
|
|
/*********************************
|
|
|
|
* zswap entry functions
|
|
|
|
**********************************/
|
|
|
|
static struct kmem_cache *zswap_entry_cache;
|
|
|
|
|
|
|
|
static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
|
|
|
|
{
|
|
|
|
struct zswap_entry *entry;
|
|
|
|
entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
|
|
|
|
if (!entry)
|
|
|
|
return NULL;
|
|
|
|
return entry;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void zswap_entry_cache_free(struct zswap_entry *entry)
|
|
|
|
{
|
|
|
|
kmem_cache_free(zswap_entry_cache, entry);
|
|
|
|
}
|
|
|
|
|
2023-06-20 19:46:44 +00:00
|
|
|
static struct zpool *zswap_find_zpool(struct zswap_entry *entry)
|
|
|
|
{
|
2024-03-11 23:52:10 +00:00
|
|
|
return entry->pool->zpools[hash_ptr(entry, ilog2(ZSWAP_NR_ZPOOLS))];
|
2023-06-20 19:46:44 +00:00
|
|
|
}
|
|
|
|
|
2013-11-12 23:08:27 +00:00
|
|
|
/*
|
2014-08-06 23:08:40 +00:00
|
|
|
* Carries out the common pattern of freeing and entry's zpool allocation,
|
2013-11-12 23:08:27 +00:00
|
|
|
* freeing the entry itself, and decrementing the number of stored pages.
|
|
|
|
*/
|
2024-01-30 01:36:37 +00:00
|
|
|
static void zswap_entry_free(struct zswap_entry *entry)
|
2013-11-12 23:08:27 +00:00
|
|
|
{
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 00:15:59 +00:00
|
|
|
if (!entry->length)
|
|
|
|
atomic_dec(&zswap_same_filled_pages);
|
|
|
|
else {
|
2024-03-05 07:53:45 +00:00
|
|
|
zswap_lru_del(&zswap_list_lru, entry);
|
2023-06-20 19:46:44 +00:00
|
|
|
zpool_free(zswap_find_zpool(entry), entry->handle);
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 00:15:59 +00:00
|
|
|
zswap_pool_put(entry->pool);
|
|
|
|
}
|
2024-01-30 01:34:38 +00:00
|
|
|
if (entry->objcg) {
|
|
|
|
obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
|
|
|
|
obj_cgroup_put(entry->objcg);
|
|
|
|
}
|
2013-11-12 23:08:27 +00:00
|
|
|
zswap_entry_cache_free(entry);
|
|
|
|
atomic_dec(&zswap_stored_pages);
|
|
|
|
}
|
|
|
|
|
2024-01-30 01:36:53 +00:00
|
|
|
/*********************************
|
|
|
|
* compressed storage functions
|
|
|
|
**********************************/
|
2024-01-30 01:36:54 +00:00
|
|
|
static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
|
|
|
|
struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
|
|
|
|
struct crypto_acomp *acomp;
|
|
|
|
struct acomp_req *req;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
mutex_init(&acomp_ctx->mutex);
|
|
|
|
|
|
|
|
acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
|
|
|
|
if (!acomp_ctx->buffer)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
|
|
|
|
if (IS_ERR(acomp)) {
|
|
|
|
pr_err("could not alloc crypto acomp %s : %ld\n",
|
|
|
|
pool->tfm_name, PTR_ERR(acomp));
|
|
|
|
ret = PTR_ERR(acomp);
|
|
|
|
goto acomp_fail;
|
|
|
|
}
|
|
|
|
acomp_ctx->acomp = acomp;
|
2024-02-22 08:11:35 +00:00
|
|
|
acomp_ctx->is_sleepable = acomp_is_async(acomp);
|
2024-01-30 01:36:54 +00:00
|
|
|
|
|
|
|
req = acomp_request_alloc(acomp_ctx->acomp);
|
|
|
|
if (!req) {
|
|
|
|
pr_err("could not alloc crypto acomp_request %s\n",
|
|
|
|
pool->tfm_name);
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto req_fail;
|
|
|
|
}
|
|
|
|
acomp_ctx->req = req;
|
|
|
|
|
|
|
|
crypto_init_wait(&acomp_ctx->wait);
|
|
|
|
/*
|
|
|
|
* if the backend of acomp is async zip, crypto_req_done() will wakeup
|
|
|
|
* crypto_wait_req(); if the backend of acomp is scomp, the callback
|
|
|
|
* won't be called, crypto_wait_req() will return without blocking.
|
|
|
|
*/
|
|
|
|
acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
|
|
|
|
crypto_req_done, &acomp_ctx->wait);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
req_fail:
|
|
|
|
crypto_free_acomp(acomp_ctx->acomp);
|
|
|
|
acomp_fail:
|
|
|
|
kfree(acomp_ctx->buffer);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
|
|
|
|
{
|
|
|
|
struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
|
|
|
|
struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
|
|
|
|
|
|
|
|
if (!IS_ERR_OR_NULL(acomp_ctx)) {
|
|
|
|
if (!IS_ERR_OR_NULL(acomp_ctx->req))
|
|
|
|
acomp_request_free(acomp_ctx->req);
|
|
|
|
if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
|
|
|
|
crypto_free_acomp(acomp_ctx->acomp);
|
|
|
|
kfree(acomp_ctx->buffer);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2024-01-30 01:36:53 +00:00
|
|
|
static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
|
|
|
|
{
|
|
|
|
struct crypto_acomp_ctx *acomp_ctx;
|
|
|
|
struct scatterlist input, output;
|
2024-02-19 21:19:35 +00:00
|
|
|
int comp_ret = 0, alloc_ret = 0;
|
2024-01-30 01:36:53 +00:00
|
|
|
unsigned int dlen = PAGE_SIZE;
|
|
|
|
unsigned long handle;
|
|
|
|
struct zpool *zpool;
|
|
|
|
char *buf;
|
|
|
|
gfp_t gfp;
|
|
|
|
u8 *dst;
|
|
|
|
|
|
|
|
acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
|
|
|
|
|
|
|
|
mutex_lock(&acomp_ctx->mutex);
|
|
|
|
|
|
|
|
dst = acomp_ctx->buffer;
|
|
|
|
sg_init_table(&input, 1);
|
|
|
|
sg_set_page(&input, &folio->page, PAGE_SIZE, 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need PAGE_SIZE * 2 here since there maybe over-compression case,
|
|
|
|
* and hardware-accelerators may won't check the dst buffer size, so
|
|
|
|
* giving the dst buffer with enough length to avoid buffer overflow.
|
|
|
|
*/
|
|
|
|
sg_init_one(&output, dst, PAGE_SIZE * 2);
|
|
|
|
acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* it maybe looks a little bit silly that we send an asynchronous request,
|
|
|
|
* then wait for its completion synchronously. This makes the process look
|
|
|
|
* synchronous in fact.
|
|
|
|
* Theoretically, acomp supports users send multiple acomp requests in one
|
|
|
|
* acomp instance, then get those requests done simultaneously. but in this
|
|
|
|
* case, zswap actually does store and load page by page, there is no
|
|
|
|
* existing method to send the second page before the first page is done
|
|
|
|
* in one thread doing zwap.
|
|
|
|
* but in different threads running on different cpu, we have different
|
|
|
|
* acomp instance, so multiple threads can do (de)compression in parallel.
|
|
|
|
*/
|
2024-02-19 21:19:35 +00:00
|
|
|
comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
|
2024-01-30 01:36:53 +00:00
|
|
|
dlen = acomp_ctx->req->dlen;
|
2024-02-19 21:19:35 +00:00
|
|
|
if (comp_ret)
|
2024-01-30 01:36:53 +00:00
|
|
|
goto unlock;
|
|
|
|
|
|
|
|
zpool = zswap_find_zpool(entry);
|
|
|
|
gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
|
|
|
|
if (zpool_malloc_support_movable(zpool))
|
|
|
|
gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
|
2024-02-19 21:19:35 +00:00
|
|
|
alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle);
|
|
|
|
if (alloc_ret)
|
2024-01-30 01:36:53 +00:00
|
|
|
goto unlock;
|
|
|
|
|
|
|
|
buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
|
|
|
|
memcpy(buf, dst, dlen);
|
|
|
|
zpool_unmap_handle(zpool, handle);
|
|
|
|
|
|
|
|
entry->handle = handle;
|
|
|
|
entry->length = dlen;
|
|
|
|
|
|
|
|
unlock:
|
2024-02-19 21:19:35 +00:00
|
|
|
if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
|
|
|
|
zswap_reject_compress_poor++;
|
|
|
|
else if (comp_ret)
|
|
|
|
zswap_reject_compress_fail++;
|
|
|
|
else if (alloc_ret)
|
|
|
|
zswap_reject_alloc_fail++;
|
|
|
|
|
2024-01-30 01:36:53 +00:00
|
|
|
mutex_unlock(&acomp_ctx->mutex);
|
2024-02-19 21:19:35 +00:00
|
|
|
return comp_ret == 0 && alloc_ret == 0;
|
2024-01-30 01:36:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void zswap_decompress(struct zswap_entry *entry, struct page *page)
|
|
|
|
{
|
|
|
|
struct zpool *zpool = zswap_find_zpool(entry);
|
|
|
|
struct scatterlist input, output;
|
|
|
|
struct crypto_acomp_ctx *acomp_ctx;
|
|
|
|
u8 *src;
|
|
|
|
|
|
|
|
acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
|
|
|
|
mutex_lock(&acomp_ctx->mutex);
|
|
|
|
|
|
|
|
src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
|
mm: zswap: fix kernel BUG in sg_init_one
sg_init_one() relies on linearly mapped low memory for the safe
utilization of virt_to_page(). Otherwise, we trigger a kernel BUG,
kernel BUG at include/linux/scatterlist.h:187!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 0 PID: 2997 Comm: syz-executor198 Not tainted 6.8.0-syzkaller #0
Hardware name: ARM-Versatile Express
PC is at sg_set_buf include/linux/scatterlist.h:187 [inline]
PC is at sg_init_one+0x9c/0xa8 lib/scatterlist.c:143
LR is at sg_init_table+0x2c/0x40 lib/scatterlist.c:128
Backtrace:
[<807e16ac>] (sg_init_one) from [<804c1824>] (zswap_decompress+0xbc/0x208 mm/zswap.c:1089)
r7:83471c80 r6:def6d08c r5:844847d0 r4:ff7e7ef4
[<804c1768>] (zswap_decompress) from [<804c4468>] (zswap_load+0x15c/0x198 mm/zswap.c:1637)
r9:8446eb80 r8:8446eb80 r7:8446eb84 r6:def6d08c r5:00000001 r4:844847d0
[<804c430c>] (zswap_load) from [<804b9644>] (swap_read_folio+0xa8/0x498 mm/page_io.c:518)
r9:844ac800 r8:835e6c00 r7:00000000 r6:df955d4c r5:00000001 r4:def6d08c
[<804b959c>] (swap_read_folio) from [<804bb064>] (swap_cluster_readahead+0x1c4/0x34c mm/swap_state.c:684)
r10:00000000 r9:00000007 r8:df955d4b r7:00000000 r6:00000000 r5:00100cca
r4:00000001
[<804baea0>] (swap_cluster_readahead) from [<804bb3b8>] (swapin_readahead+0x68/0x4a8 mm/swap_state.c:904)
r10:df955eb8 r9:00000000 r8:00100cca r7:84476480 r6:00000001 r5:00000000
r4:00000001
[<804bb350>] (swapin_readahead) from [<8047cde0>] (do_swap_page+0x200/0xcc4 mm/memory.c:4046)
r10:00000040 r9:00000000 r8:844ac800 r7:84476480 r6:00000001 r5:00000000
r4:df955eb8
[<8047cbe0>] (do_swap_page) from [<8047e6c4>] (handle_pte_fault mm/memory.c:5301 [inline])
[<8047cbe0>] (do_swap_page) from [<8047e6c4>] (__handle_mm_fault mm/memory.c:5439 [inline])
[<8047cbe0>] (do_swap_page) from [<8047e6c4>] (handle_mm_fault+0x3d8/0x12b8 mm/memory.c:5604)
r10:00000040 r9:842b3900 r8:7eb0d000 r7:84476480 r6:7eb0d000 r5:835e6c00
r4:00000254
[<8047e2ec>] (handle_mm_fault) from [<80215d28>] (do_page_fault+0x148/0x3a8 arch/arm/mm/fault.c:326)
r10:00000007 r9:842b3900 r8:7eb0d000 r7:00000207 r6:00000254 r5:7eb0d9b4
r4:df955fb0
[<80215be0>] (do_page_fault) from [<80216170>] (do_DataAbort+0x38/0xa8 arch/arm/mm/fault.c:558)
r10:7eb0da7c r9:00000000 r8:80215be0 r7:df955fb0 r6:7eb0d9b4 r5:00000207
r4:8261d0e0
[<80216138>] (do_DataAbort) from [<80200e3c>] (__dabt_usr+0x5c/0x60 arch/arm/kernel/entry-armv.S:427)
Exception stack(0xdf955fb0 to 0xdf955ff8)
5fa0: 00000000 00000000 22d5f800 0008d158
5fc0: 00000000 7eb0d9a4 00000000 00000109 00000000 00000000 7eb0da7c 7eb0da3c
5fe0: 00000000 7eb0d9a0 00000001 00066bd4 00000010 ffffffff
r8:824a9044 r7:835e6c00 r6:ffffffff r5:00000010 r4:00066bd4
Code: 1a000004 e1822003 e8860094 e89da8f0 (e7f001f2)
---[ end trace 0000000000000000 ]---
----------------
Code disassembly (best guess):
0: 1a000004 bne 0x18
4: e1822003 orr r2, r2, r3
8: e8860094 stm r6, {r2, r4, r7}
c: e89da8f0 ldm sp, {r4, r5, r6, r7, fp, sp, pc}
* 10: e7f001f2 udf #18 <-- trapping instruction
Consequently, we have two choices: either employ kmap_to_page() alongside
sg_set_page(), or resort to copying high memory contents to a temporary
buffer residing in low memory. However, considering the introduction of
the WARN_ON_ONCE in commit ef6e06b2ef870 ("highmem: fix kmap_to_page() for
kmap_local_page() addresses"), which specifically addresses high memory
concerns, it appears that memcpy remains the sole viable option.
Link: https://lkml.kernel.org/r/20240318234706.95347-1-21cnbao@gmail.com
Fixes: 270700dd06ca ("mm/zswap: remove the memcpy if acomp is not sleepable")
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reported-by: syzbot+adbc983a1588b7805de3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000bbb3d80613f243a6@google.com/
Tested-by: syzbot+adbc983a1588b7805de3@syzkaller.appspotmail.com
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-18 23:47:06 +00:00
|
|
|
/*
|
|
|
|
* If zpool_map_handle is atomic, we cannot reliably utilize its mapped buffer
|
|
|
|
* to do crypto_acomp_decompress() which might sleep. In such cases, we must
|
|
|
|
* resort to copying the buffer to a temporary one.
|
|
|
|
* Meanwhile, zpool_map_handle() might return a non-linearly mapped buffer,
|
|
|
|
* such as a kmap address of high memory or even ever a vmap address.
|
|
|
|
* However, sg_init_one is only equipped to handle linearly mapped low memory.
|
|
|
|
* In such cases, we also must copy the buffer to a temporary and lowmem one.
|
|
|
|
*/
|
|
|
|
if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
|
|
|
|
!virt_addr_valid(src)) {
|
2024-01-30 01:36:53 +00:00
|
|
|
memcpy(acomp_ctx->buffer, src, entry->length);
|
|
|
|
src = acomp_ctx->buffer;
|
|
|
|
zpool_unmap_handle(zpool, entry->handle);
|
|
|
|
}
|
|
|
|
|
|
|
|
sg_init_one(&input, src, entry->length);
|
|
|
|
sg_init_table(&output, 1);
|
|
|
|
sg_set_page(&output, page, PAGE_SIZE, 0);
|
|
|
|
acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
|
|
|
|
BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
|
|
|
|
BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
|
|
|
|
mutex_unlock(&acomp_ctx->mutex);
|
|
|
|
|
mm: zswap: fix kernel BUG in sg_init_one
sg_init_one() relies on linearly mapped low memory for the safe
utilization of virt_to_page(). Otherwise, we trigger a kernel BUG,
kernel BUG at include/linux/scatterlist.h:187!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 0 PID: 2997 Comm: syz-executor198 Not tainted 6.8.0-syzkaller #0
Hardware name: ARM-Versatile Express
PC is at sg_set_buf include/linux/scatterlist.h:187 [inline]
PC is at sg_init_one+0x9c/0xa8 lib/scatterlist.c:143
LR is at sg_init_table+0x2c/0x40 lib/scatterlist.c:128
Backtrace:
[<807e16ac>] (sg_init_one) from [<804c1824>] (zswap_decompress+0xbc/0x208 mm/zswap.c:1089)
r7:83471c80 r6:def6d08c r5:844847d0 r4:ff7e7ef4
[<804c1768>] (zswap_decompress) from [<804c4468>] (zswap_load+0x15c/0x198 mm/zswap.c:1637)
r9:8446eb80 r8:8446eb80 r7:8446eb84 r6:def6d08c r5:00000001 r4:844847d0
[<804c430c>] (zswap_load) from [<804b9644>] (swap_read_folio+0xa8/0x498 mm/page_io.c:518)
r9:844ac800 r8:835e6c00 r7:00000000 r6:df955d4c r5:00000001 r4:def6d08c
[<804b959c>] (swap_read_folio) from [<804bb064>] (swap_cluster_readahead+0x1c4/0x34c mm/swap_state.c:684)
r10:00000000 r9:00000007 r8:df955d4b r7:00000000 r6:00000000 r5:00100cca
r4:00000001
[<804baea0>] (swap_cluster_readahead) from [<804bb3b8>] (swapin_readahead+0x68/0x4a8 mm/swap_state.c:904)
r10:df955eb8 r9:00000000 r8:00100cca r7:84476480 r6:00000001 r5:00000000
r4:00000001
[<804bb350>] (swapin_readahead) from [<8047cde0>] (do_swap_page+0x200/0xcc4 mm/memory.c:4046)
r10:00000040 r9:00000000 r8:844ac800 r7:84476480 r6:00000001 r5:00000000
r4:df955eb8
[<8047cbe0>] (do_swap_page) from [<8047e6c4>] (handle_pte_fault mm/memory.c:5301 [inline])
[<8047cbe0>] (do_swap_page) from [<8047e6c4>] (__handle_mm_fault mm/memory.c:5439 [inline])
[<8047cbe0>] (do_swap_page) from [<8047e6c4>] (handle_mm_fault+0x3d8/0x12b8 mm/memory.c:5604)
r10:00000040 r9:842b3900 r8:7eb0d000 r7:84476480 r6:7eb0d000 r5:835e6c00
r4:00000254
[<8047e2ec>] (handle_mm_fault) from [<80215d28>] (do_page_fault+0x148/0x3a8 arch/arm/mm/fault.c:326)
r10:00000007 r9:842b3900 r8:7eb0d000 r7:00000207 r6:00000254 r5:7eb0d9b4
r4:df955fb0
[<80215be0>] (do_page_fault) from [<80216170>] (do_DataAbort+0x38/0xa8 arch/arm/mm/fault.c:558)
r10:7eb0da7c r9:00000000 r8:80215be0 r7:df955fb0 r6:7eb0d9b4 r5:00000207
r4:8261d0e0
[<80216138>] (do_DataAbort) from [<80200e3c>] (__dabt_usr+0x5c/0x60 arch/arm/kernel/entry-armv.S:427)
Exception stack(0xdf955fb0 to 0xdf955ff8)
5fa0: 00000000 00000000 22d5f800 0008d158
5fc0: 00000000 7eb0d9a4 00000000 00000109 00000000 00000000 7eb0da7c 7eb0da3c
5fe0: 00000000 7eb0d9a0 00000001 00066bd4 00000010 ffffffff
r8:824a9044 r7:835e6c00 r6:ffffffff r5:00000010 r4:00066bd4
Code: 1a000004 e1822003 e8860094 e89da8f0 (e7f001f2)
---[ end trace 0000000000000000 ]---
----------------
Code disassembly (best guess):
0: 1a000004 bne 0x18
4: e1822003 orr r2, r2, r3
8: e8860094 stm r6, {r2, r4, r7}
c: e89da8f0 ldm sp, {r4, r5, r6, r7, fp, sp, pc}
* 10: e7f001f2 udf #18 <-- trapping instruction
Consequently, we have two choices: either employ kmap_to_page() alongside
sg_set_page(), or resort to copying high memory contents to a temporary
buffer residing in low memory. However, considering the introduction of
the WARN_ON_ONCE in commit ef6e06b2ef870 ("highmem: fix kmap_to_page() for
kmap_local_page() addresses"), which specifically addresses high memory
concerns, it appears that memcpy remains the sole viable option.
Link: https://lkml.kernel.org/r/20240318234706.95347-1-21cnbao@gmail.com
Fixes: 270700dd06ca ("mm/zswap: remove the memcpy if acomp is not sleepable")
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reported-by: syzbot+adbc983a1588b7805de3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000bbb3d80613f243a6@google.com/
Tested-by: syzbot+adbc983a1588b7805de3@syzkaller.appspotmail.com
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-18 23:47:06 +00:00
|
|
|
if (src != acomp_ctx->buffer)
|
2024-01-30 01:36:53 +00:00
|
|
|
zpool_unmap_handle(zpool, entry->handle);
|
|
|
|
}
|
|
|
|
|
2024-01-30 01:36:55 +00:00
|
|
|
/*********************************
|
|
|
|
* writeback code
|
|
|
|
**********************************/
|
|
|
|
/*
|
|
|
|
* Attempts to free an entry by adding a folio to the swap cache,
|
|
|
|
* decompressing the entry data into the folio, and issuing a
|
|
|
|
* bio write to write the folio back to the swap device.
|
|
|
|
*
|
|
|
|
* This can be thought of as a "resumed writeback" of the folio
|
|
|
|
* to the swap device. We are basically resuming the same swap
|
|
|
|
* writeback path that was intercepted with the zswap_store()
|
|
|
|
* in the first place. After the folio has been decompressed into
|
|
|
|
* the swap cache, the compressed version stored by zswap can be
|
|
|
|
* freed.
|
|
|
|
*/
|
|
|
|
static int zswap_writeback_entry(struct zswap_entry *entry,
|
|
|
|
swp_entry_t swpentry)
|
|
|
|
{
|
2024-03-26 18:35:43 +00:00
|
|
|
struct xarray *tree;
|
|
|
|
pgoff_t offset = swp_offset(swpentry);
|
2024-01-30 01:36:55 +00:00
|
|
|
struct folio *folio;
|
|
|
|
struct mempolicy *mpol;
|
|
|
|
bool folio_was_allocated;
|
|
|
|
struct writeback_control wbc = {
|
|
|
|
.sync_mode = WB_SYNC_NONE,
|
|
|
|
};
|
|
|
|
|
|
|
|
/* try to allocate swap cache folio */
|
|
|
|
mpol = get_task_policy(current);
|
|
|
|
folio = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
|
|
|
|
NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
|
|
|
|
if (!folio)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Found an existing folio, we raced with swapin or concurrent
|
|
|
|
* shrinker. We generally writeback cold folios from zswap, and
|
|
|
|
* swapin means the folio just became hot, so skip this folio.
|
|
|
|
* For unlikely concurrent shrinker case, it will be unlinked
|
|
|
|
* and freed when invalidated by the concurrent shrinker anyway.
|
|
|
|
*/
|
|
|
|
if (!folio_was_allocated) {
|
|
|
|
folio_put(folio);
|
|
|
|
return -EEXIST;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* folio is locked, and the swapcache is now secured against
|
mm/zswap: add more comments in shrink_memcg_cb()
Patch series "mm/zswap: optimize zswap lru list", v2.
This series is motivated when observe the zswap lru list shrinking, noted
there are some unexpected cases in zswap_writeback_entry().
bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
There are some -ENOMEM because when the swap entry is freed to per-cpu
swap pool, it doesn't invalidate/drop zswap entry. Then the shrinker
encounter these trashy zswap entries, it can't be reclaimed and return
-ENOMEM.
So move the invalidation ahead to when swap entry freed to the per-cpu
swap pool, since there is no any benefit to leave trashy zswap entries on
the zswap tree and lru list.
Another case is -EEXIST, which is seen more in the case of
!zswap_exclusive_loads_enabled, in which case the swapin folio will leave
compressed copy on the tree and lru list. And it can't be reclaimed until
the folio is removed from swapcache.
Changing to zswap_exclusive_loads_enabled mode will invalidate when folio
swapin, which has its own drawback if that folio is still clean in
swapcache and swapout again, we need to compress it again. Please see the
commit for details on why we choose exclusive load as the default for
zswap.
Another optimization for -EEXIST is that we add LRU_STOP to support
terminating the shrinking process to avoid evicting warmer region.
Testing using kernel build in tmpfs, one 50GB swapfile and
zswap shrinker_enabled, with memory.max set to 2GB.
mm-unstable zswap-optimize
real 63.90s 63.25s
user 1064.05s 1063.40s
sys 292.32s 270.94s
The main optimization is in sys cpu, about 7% improvement.
This patch (of 6):
Add more comments in shrink_memcg_cb() to describe the deref dance which
is implemented to fix race problem between lru writeback and swapoff, and
the reason why we rotate the entry at the beginning.
Also fix the stale comments in zswap_writeback_entry(), and add more
comments to state that we only deref the tree after we get the swapcache
reference.
Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com
Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-1-99d4084260a0@bytedance.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-04 03:05:59 +00:00
|
|
|
* concurrent swapping to and from the slot, and concurrent
|
|
|
|
* swapoff so we can safely dereference the zswap tree here.
|
|
|
|
* Verify that the swap entry hasn't been invalidated and recycled
|
|
|
|
* behind our backs, to avoid overwriting a new swap folio with
|
|
|
|
* old compressed data. Only when this is successful can the entry
|
|
|
|
* be dereferenced.
|
2024-01-30 01:36:55 +00:00
|
|
|
*/
|
|
|
|
tree = swap_zswap_tree(swpentry);
|
2024-03-26 18:35:43 +00:00
|
|
|
if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) {
|
2024-01-30 01:36:55 +00:00
|
|
|
delete_from_swap_cache(folio);
|
|
|
|
folio_unlock(folio);
|
|
|
|
folio_put(folio);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
zswap_decompress(entry, &folio->page);
|
|
|
|
|
|
|
|
count_vm_event(ZSWPWB);
|
|
|
|
if (entry->objcg)
|
|
|
|
count_objcg_event(entry->objcg, ZSWPWB);
|
|
|
|
|
2024-02-04 03:06:04 +00:00
|
|
|
zswap_entry_free(entry);
|
2024-01-30 01:36:55 +00:00
|
|
|
|
|
|
|
/* folio is up to date */
|
|
|
|
folio_mark_uptodate(folio);
|
|
|
|
|
|
|
|
/* move it to the tail of the inactive list after end_writeback */
|
|
|
|
folio_set_reclaim(folio);
|
|
|
|
|
|
|
|
/* start writeback */
|
|
|
|
__swap_writepage(folio, &wbc);
|
|
|
|
folio_put(folio);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
/*********************************
|
|
|
|
* shrinker functions
|
|
|
|
**********************************/
|
|
|
|
static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_one *l,
|
2024-01-30 01:36:56 +00:00
|
|
|
spinlock_t *lock, void *arg)
|
|
|
|
{
|
|
|
|
struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
|
|
|
|
bool *encountered_page_in_swapcache = (bool *)arg;
|
|
|
|
swp_entry_t swpentry;
|
|
|
|
enum lru_status ret = LRU_REMOVED_RETRY;
|
|
|
|
int writeback_result;
|
|
|
|
|
|
|
|
/*
|
mm/zswap: add more comments in shrink_memcg_cb()
Patch series "mm/zswap: optimize zswap lru list", v2.
This series is motivated when observe the zswap lru list shrinking, noted
there are some unexpected cases in zswap_writeback_entry().
bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
There are some -ENOMEM because when the swap entry is freed to per-cpu
swap pool, it doesn't invalidate/drop zswap entry. Then the shrinker
encounter these trashy zswap entries, it can't be reclaimed and return
-ENOMEM.
So move the invalidation ahead to when swap entry freed to the per-cpu
swap pool, since there is no any benefit to leave trashy zswap entries on
the zswap tree and lru list.
Another case is -EEXIST, which is seen more in the case of
!zswap_exclusive_loads_enabled, in which case the swapin folio will leave
compressed copy on the tree and lru list. And it can't be reclaimed until
the folio is removed from swapcache.
Changing to zswap_exclusive_loads_enabled mode will invalidate when folio
swapin, which has its own drawback if that folio is still clean in
swapcache and swapout again, we need to compress it again. Please see the
commit for details on why we choose exclusive load as the default for
zswap.
Another optimization for -EEXIST is that we add LRU_STOP to support
terminating the shrinking process to avoid evicting warmer region.
Testing using kernel build in tmpfs, one 50GB swapfile and
zswap shrinker_enabled, with memory.max set to 2GB.
mm-unstable zswap-optimize
real 63.90s 63.25s
user 1064.05s 1063.40s
sys 292.32s 270.94s
The main optimization is in sys cpu, about 7% improvement.
This patch (of 6):
Add more comments in shrink_memcg_cb() to describe the deref dance which
is implemented to fix race problem between lru writeback and swapoff, and
the reason why we rotate the entry at the beginning.
Also fix the stale comments in zswap_writeback_entry(), and add more
comments to state that we only deref the tree after we get the swapcache
reference.
Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com
Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-1-99d4084260a0@bytedance.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-04 03:05:59 +00:00
|
|
|
* As soon as we drop the LRU lock, the entry can be freed by
|
|
|
|
* a concurrent invalidation. This means the following:
|
2024-01-30 01:36:56 +00:00
|
|
|
*
|
mm/zswap: add more comments in shrink_memcg_cb()
Patch series "mm/zswap: optimize zswap lru list", v2.
This series is motivated when observe the zswap lru list shrinking, noted
there are some unexpected cases in zswap_writeback_entry().
bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
There are some -ENOMEM because when the swap entry is freed to per-cpu
swap pool, it doesn't invalidate/drop zswap entry. Then the shrinker
encounter these trashy zswap entries, it can't be reclaimed and return
-ENOMEM.
So move the invalidation ahead to when swap entry freed to the per-cpu
swap pool, since there is no any benefit to leave trashy zswap entries on
the zswap tree and lru list.
Another case is -EEXIST, which is seen more in the case of
!zswap_exclusive_loads_enabled, in which case the swapin folio will leave
compressed copy on the tree and lru list. And it can't be reclaimed until
the folio is removed from swapcache.
Changing to zswap_exclusive_loads_enabled mode will invalidate when folio
swapin, which has its own drawback if that folio is still clean in
swapcache and swapout again, we need to compress it again. Please see the
commit for details on why we choose exclusive load as the default for
zswap.
Another optimization for -EEXIST is that we add LRU_STOP to support
terminating the shrinking process to avoid evicting warmer region.
Testing using kernel build in tmpfs, one 50GB swapfile and
zswap shrinker_enabled, with memory.max set to 2GB.
mm-unstable zswap-optimize
real 63.90s 63.25s
user 1064.05s 1063.40s
sys 292.32s 270.94s
The main optimization is in sys cpu, about 7% improvement.
This patch (of 6):
Add more comments in shrink_memcg_cb() to describe the deref dance which
is implemented to fix race problem between lru writeback and swapoff, and
the reason why we rotate the entry at the beginning.
Also fix the stale comments in zswap_writeback_entry(), and add more
comments to state that we only deref the tree after we get the swapcache
reference.
Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com
Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-1-99d4084260a0@bytedance.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-04 03:05:59 +00:00
|
|
|
* 1. We extract the swp_entry_t to the stack, allowing
|
|
|
|
* zswap_writeback_entry() to pin the swap entry and
|
|
|
|
* then validate the zwap entry against that swap entry's
|
|
|
|
* tree using pointer value comparison. Only when that
|
|
|
|
* is successful can the entry be dereferenced.
|
2024-01-30 01:36:56 +00:00
|
|
|
*
|
mm/zswap: add more comments in shrink_memcg_cb()
Patch series "mm/zswap: optimize zswap lru list", v2.
This series is motivated when observe the zswap lru list shrinking, noted
there are some unexpected cases in zswap_writeback_entry().
bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
There are some -ENOMEM because when the swap entry is freed to per-cpu
swap pool, it doesn't invalidate/drop zswap entry. Then the shrinker
encounter these trashy zswap entries, it can't be reclaimed and return
-ENOMEM.
So move the invalidation ahead to when swap entry freed to the per-cpu
swap pool, since there is no any benefit to leave trashy zswap entries on
the zswap tree and lru list.
Another case is -EEXIST, which is seen more in the case of
!zswap_exclusive_loads_enabled, in which case the swapin folio will leave
compressed copy on the tree and lru list. And it can't be reclaimed until
the folio is removed from swapcache.
Changing to zswap_exclusive_loads_enabled mode will invalidate when folio
swapin, which has its own drawback if that folio is still clean in
swapcache and swapout again, we need to compress it again. Please see the
commit for details on why we choose exclusive load as the default for
zswap.
Another optimization for -EEXIST is that we add LRU_STOP to support
terminating the shrinking process to avoid evicting warmer region.
Testing using kernel build in tmpfs, one 50GB swapfile and
zswap shrinker_enabled, with memory.max set to 2GB.
mm-unstable zswap-optimize
real 63.90s 63.25s
user 1064.05s 1063.40s
sys 292.32s 270.94s
The main optimization is in sys cpu, about 7% improvement.
This patch (of 6):
Add more comments in shrink_memcg_cb() to describe the deref dance which
is implemented to fix race problem between lru writeback and swapoff, and
the reason why we rotate the entry at the beginning.
Also fix the stale comments in zswap_writeback_entry(), and add more
comments to state that we only deref the tree after we get the swapcache
reference.
Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com
Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-1-99d4084260a0@bytedance.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-04 03:05:59 +00:00
|
|
|
* 2. Usually, objects are taken off the LRU for reclaim. In
|
|
|
|
* this case this isn't possible, because if reclaim fails
|
|
|
|
* for whatever reason, we have no means of knowing if the
|
|
|
|
* entry is alive to put it back on the LRU.
|
2024-01-30 01:36:56 +00:00
|
|
|
*
|
mm/zswap: add more comments in shrink_memcg_cb()
Patch series "mm/zswap: optimize zswap lru list", v2.
This series is motivated when observe the zswap lru list shrinking, noted
there are some unexpected cases in zswap_writeback_entry().
bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}'
There are some -ENOMEM because when the swap entry is freed to per-cpu
swap pool, it doesn't invalidate/drop zswap entry. Then the shrinker
encounter these trashy zswap entries, it can't be reclaimed and return
-ENOMEM.
So move the invalidation ahead to when swap entry freed to the per-cpu
swap pool, since there is no any benefit to leave trashy zswap entries on
the zswap tree and lru list.
Another case is -EEXIST, which is seen more in the case of
!zswap_exclusive_loads_enabled, in which case the swapin folio will leave
compressed copy on the tree and lru list. And it can't be reclaimed until
the folio is removed from swapcache.
Changing to zswap_exclusive_loads_enabled mode will invalidate when folio
swapin, which has its own drawback if that folio is still clean in
swapcache and swapout again, we need to compress it again. Please see the
commit for details on why we choose exclusive load as the default for
zswap.
Another optimization for -EEXIST is that we add LRU_STOP to support
terminating the shrinking process to avoid evicting warmer region.
Testing using kernel build in tmpfs, one 50GB swapfile and
zswap shrinker_enabled, with memory.max set to 2GB.
mm-unstable zswap-optimize
real 63.90s 63.25s
user 1064.05s 1063.40s
sys 292.32s 270.94s
The main optimization is in sys cpu, about 7% improvement.
This patch (of 6):
Add more comments in shrink_memcg_cb() to describe the deref dance which
is implemented to fix race problem between lru writeback and swapoff, and
the reason why we rotate the entry at the beginning.
Also fix the stale comments in zswap_writeback_entry(), and add more
comments to state that we only deref the tree after we get the swapcache
reference.
Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com
Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-1-99d4084260a0@bytedance.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-04 03:05:59 +00:00
|
|
|
* So rotate it before dropping the lock. If the entry is
|
|
|
|
* written back or invalidated, the free path will unlink
|
|
|
|
* it. For failures, rotation is the right thing as well.
|
|
|
|
*
|
|
|
|
* Temporary failures, where the same entry should be tried
|
|
|
|
* again immediately, almost never happen for this shrinker.
|
|
|
|
* We don't do any trylocking; -ENOMEM comes closest,
|
|
|
|
* but that's extremely rare and doesn't happen spuriously
|
|
|
|
* either. Don't bother distinguishing this case.
|
2024-01-30 01:36:56 +00:00
|
|
|
*/
|
|
|
|
list_move_tail(item, &l->list);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Once the lru lock is dropped, the entry might get freed. The
|
|
|
|
* swpentry is copied to the stack, and entry isn't deref'd again
|
|
|
|
* until the entry is verified to still be alive in the tree.
|
|
|
|
*/
|
|
|
|
swpentry = entry->swpentry;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* It's safe to drop the lock here because we return either
|
|
|
|
* LRU_REMOVED_RETRY or LRU_RETRY.
|
|
|
|
*/
|
|
|
|
spin_unlock(lock);
|
|
|
|
|
|
|
|
writeback_result = zswap_writeback_entry(entry, swpentry);
|
|
|
|
|
|
|
|
if (writeback_result) {
|
|
|
|
zswap_reject_reclaim_fail++;
|
|
|
|
ret = LRU_RETRY;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Encountering a page already in swap cache is a sign that we are shrinking
|
|
|
|
* into the warmer region. We should terminate shrinking (if we're in the dynamic
|
|
|
|
* shrinker context).
|
|
|
|
*/
|
2024-02-04 03:06:01 +00:00
|
|
|
if (writeback_result == -EEXIST && encountered_page_in_swapcache) {
|
|
|
|
ret = LRU_STOP;
|
2024-01-30 01:36:56 +00:00
|
|
|
*encountered_page_in_swapcache = true;
|
2024-02-04 03:06:01 +00:00
|
|
|
}
|
2024-01-30 01:36:56 +00:00
|
|
|
} else {
|
|
|
|
zswap_written_back_pages++;
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock(lock);
|
|
|
|
return ret;
|
|
|
|
}
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
|
|
|
|
static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
|
|
|
|
struct shrink_control *sc)
|
|
|
|
{
|
|
|
|
struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
|
|
|
|
unsigned long shrink_ret, nr_protected, lru_size;
|
|
|
|
bool encountered_page_in_swapcache = false;
|
|
|
|
|
zswap: memcontrol: implement zswap writeback disabling
During our experiment with zswap, we sometimes observe swap IOs due to
occasional zswap store failures and writebacks-to-swap. These swapping
IOs prevent many users who cannot tolerate swapping from adopting zswap to
save memory and improve performance where possible.
This patch adds the option to disable this behavior entirely: do not
writeback to backing swapping device when a zswap store attempt fail, and
do not write pages in the zswap pool back to the backing swap device (both
when the pool is full, and when the new zswap shrinker is called).
This new behavior can be opted-in/out on a per-cgroup basis via a new
cgroup file. By default, writebacks to swap device is enabled, which is
the previous behavior. Initially, writeback is enabled for the root
cgroup, and a newly created cgroup will inherit the current setting of its
parent.
Note that this is subtly different from setting memory.swap.max to 0, as
it still allows for pages to be stored in the zswap pool (which itself
consumes swap space in its current form).
This patch should be applied on top of the zswap shrinker series:
https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
as it also disables the zswap shrinker, a major source of zswap
writebacks.
For the most part, this feature is motivated by internal parties who
have already established their opinions regarding swapping - the
workloads that are highly sensitive to IO, and especially those who are
using servers with really slow disk performance (for instance, massive
but slow HDDs). For these folks, it's impossible to convince them to
even entertain zswap if swapping also comes as a packaged deal.
Writeback disabling is quite a useful feature in these situations - on
a mixed workloads deployment, they can disable writeback for the more
IO-sensitive workloads, and enable writeback for other background
workloads.
For instance, on a server with HDD, I allocate memories and populate
them with random values (so that zswap store will always fail), and
specify memory.high low enough to trigger reclaim. The time it takes
to allocate the memories and just read through it a couple of times
(doing silly things like computing the values' average etc.):
zswap.writeback disabled:
real 0m30.537s
user 0m23.687s
sys 0m6.637s
0 pages swapped in
0 pages swapped out
zswap.writeback enabled:
real 0m45.061s
user 0m24.310s
sys 0m8.892s
712686 pages swapped in
461093 pages swapped out
(the last two lines are from vmstat -s).
[nphamcs@gmail.com: add a comment about recurring zswap store failures leading to reclaim inefficiency]
Link: https://lkml.kernel.org/r/20231221005725.3446672-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231207192406.3809579-1-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Heidelberg <david@ixit.cz>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 19:24:06 +00:00
|
|
|
if (!zswap_shrinker_enabled ||
|
|
|
|
!mem_cgroup_zswap_writeback_enabled(sc->memcg)) {
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
sc->nr_scanned = 0;
|
|
|
|
return SHRINK_STOP;
|
|
|
|
}
|
|
|
|
|
|
|
|
nr_protected =
|
|
|
|
atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
|
2024-03-05 07:53:45 +00:00
|
|
|
lru_size = list_lru_shrink_count(&zswap_list_lru, sc);
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Abort if we are shrinking into the protected region.
|
|
|
|
*
|
|
|
|
* This short-circuiting is necessary because if we have too many multiple
|
|
|
|
* concurrent reclaimers getting the freeable zswap object counts at the
|
|
|
|
* same time (before any of them made reasonable progress), the total
|
|
|
|
* number of reclaimed objects might be more than the number of unprotected
|
|
|
|
* objects (i.e the reclaimers will reclaim into the protected area of the
|
|
|
|
* zswap LRU).
|
|
|
|
*/
|
|
|
|
if (nr_protected >= lru_size - sc->nr_to_scan) {
|
|
|
|
sc->nr_scanned = 0;
|
|
|
|
return SHRINK_STOP;
|
|
|
|
}
|
|
|
|
|
2024-03-05 07:53:45 +00:00
|
|
|
shrink_ret = list_lru_shrink_walk(&zswap_list_lru, sc, &shrink_memcg_cb,
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
&encountered_page_in_swapcache);
|
|
|
|
|
|
|
|
if (encountered_page_in_swapcache)
|
|
|
|
return SHRINK_STOP;
|
|
|
|
|
|
|
|
return shrink_ret ? shrink_ret : SHRINK_STOP;
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
|
|
|
|
struct shrink_control *sc)
|
|
|
|
{
|
|
|
|
struct mem_cgroup *memcg = sc->memcg;
|
|
|
|
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
|
|
|
|
unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
|
|
|
|
|
zswap: memcontrol: implement zswap writeback disabling
During our experiment with zswap, we sometimes observe swap IOs due to
occasional zswap store failures and writebacks-to-swap. These swapping
IOs prevent many users who cannot tolerate swapping from adopting zswap to
save memory and improve performance where possible.
This patch adds the option to disable this behavior entirely: do not
writeback to backing swapping device when a zswap store attempt fail, and
do not write pages in the zswap pool back to the backing swap device (both
when the pool is full, and when the new zswap shrinker is called).
This new behavior can be opted-in/out on a per-cgroup basis via a new
cgroup file. By default, writebacks to swap device is enabled, which is
the previous behavior. Initially, writeback is enabled for the root
cgroup, and a newly created cgroup will inherit the current setting of its
parent.
Note that this is subtly different from setting memory.swap.max to 0, as
it still allows for pages to be stored in the zswap pool (which itself
consumes swap space in its current form).
This patch should be applied on top of the zswap shrinker series:
https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
as it also disables the zswap shrinker, a major source of zswap
writebacks.
For the most part, this feature is motivated by internal parties who
have already established their opinions regarding swapping - the
workloads that are highly sensitive to IO, and especially those who are
using servers with really slow disk performance (for instance, massive
but slow HDDs). For these folks, it's impossible to convince them to
even entertain zswap if swapping also comes as a packaged deal.
Writeback disabling is quite a useful feature in these situations - on
a mixed workloads deployment, they can disable writeback for the more
IO-sensitive workloads, and enable writeback for other background
workloads.
For instance, on a server with HDD, I allocate memories and populate
them with random values (so that zswap store will always fail), and
specify memory.high low enough to trigger reclaim. The time it takes
to allocate the memories and just read through it a couple of times
(doing silly things like computing the values' average etc.):
zswap.writeback disabled:
real 0m30.537s
user 0m23.687s
sys 0m6.637s
0 pages swapped in
0 pages swapped out
zswap.writeback enabled:
real 0m45.061s
user 0m24.310s
sys 0m8.892s
712686 pages swapped in
461093 pages swapped out
(the last two lines are from vmstat -s).
[nphamcs@gmail.com: add a comment about recurring zswap store failures leading to reclaim inefficiency]
Link: https://lkml.kernel.org/r/20231221005725.3446672-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231207192406.3809579-1-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Heidelberg <david@ixit.cz>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 19:24:06 +00:00
|
|
|
if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg))
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
return 0;
|
|
|
|
|
2024-03-21 18:25:32 +00:00
|
|
|
/*
|
|
|
|
* The shrinker resumes swap writeback, which will enter block
|
|
|
|
* and may enter fs. XXX: Harmonize with vmscan.c __GFP_FS
|
|
|
|
* rules (may_enter_fs()), which apply on a per-folio basis.
|
|
|
|
*/
|
|
|
|
if (!gfp_has_io_fs(sc->gfp_mask))
|
|
|
|
return 0;
|
|
|
|
|
2024-04-18 12:26:28 +00:00
|
|
|
/*
|
|
|
|
* For memcg, use the cgroup-wide ZSWAP stats since we don't
|
|
|
|
* have them per-node and thus per-lruvec. Careful if memcg is
|
|
|
|
* runtime-disabled: we can get sc->memcg == NULL, which is ok
|
|
|
|
* for the lruvec, but not for memcg_page_state().
|
|
|
|
*
|
|
|
|
* Without memcg, use the zswap pool-wide metrics.
|
|
|
|
*/
|
|
|
|
if (!mem_cgroup_disabled()) {
|
|
|
|
mem_cgroup_flush_stats(memcg);
|
|
|
|
nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
|
|
|
|
nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
|
|
|
|
} else {
|
mm: zswap: optimize zswap pool size tracking
Profiling the munmap() of a zswapped memory region shows 60% of the total
cycles currently going into updating the zswap_pool_total_size.
There are three consumers of this counter:
- store, to enforce the globally configured pool limit
- meminfo & debugfs, to report the size to the user
- shrink, to determine the batch size for each cycle
Instead of aggregating everytime an entry enters or exits the zswap
pool, aggregate the value from the zpools on-demand:
- Stores aggregate the counter anyway upon success. Aggregating to
check the limit instead is the same amount of work.
- Meminfo & debugfs might benefit somewhat from a pre-aggregated
counter, but aren't exactly hotpaths.
- Shrinking can aggregate once for every cycle instead of doing it for
every freed entry. As the shrinker might work on tens or hundreds of
objects per scan cycle, this is a large reduction in aggregations.
The paths that benefit dramatically are swapin, swapoff, and unmaps.
There could be millions of pages being processed until somebody asks for
the pool size again. This eliminates the pool size updates from those
paths entirely.
Top profile entries for a 24G range munmap(), before:
38.54% zswap-unmap [kernel.kallsyms] [k] zs_zpool_total_size
12.51% zswap-unmap [kernel.kallsyms] [k] zpool_get_total_size
9.10% zswap-unmap [kernel.kallsyms] [k] zswap_update_total_size
2.95% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
2.88% zswap-unmap [kernel.kallsyms] [k] __slab_free
2.86% zswap-unmap [kernel.kallsyms] [k] xas_store
and after:
7.70% zswap-unmap [kernel.kallsyms] [k] __slab_free
7.16% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
6.74% zswap-unmap [kernel.kallsyms] [k] xas_store
It was also briefly considered to move to a single atomic in zswap
that is updated by the backends, since zswap only cares about the sum
of all pools anyway. However, zram directly needs per-pool information
out of zsmalloc. To keep the backend from having to update two atomics
every time, I opted for the lazy aggregation instead for now.
Link: https://lkml.kernel.org/r/20240312153901.3441-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-12 15:34:11 +00:00
|
|
|
nr_backing = zswap_total_pages();
|
2024-03-22 00:10:01 +00:00
|
|
|
nr_stored = atomic_read(&zswap_stored_pages);
|
2024-04-18 12:26:28 +00:00
|
|
|
}
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
|
|
|
|
if (!nr_stored)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
nr_protected =
|
|
|
|
atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
|
2024-03-05 07:53:45 +00:00
|
|
|
nr_freeable = list_lru_shrink_count(&zswap_list_lru, sc);
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
/*
|
|
|
|
* Subtract the lru size by an estimate of the number of pages
|
|
|
|
* that should be protected.
|
|
|
|
*/
|
|
|
|
nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected : 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Scale the number of freeable pages by the memory saving factor.
|
|
|
|
* This ensures that the better zswap compresses memory, the fewer
|
|
|
|
* pages we will evict to swap (as it will otherwise incur IO for
|
|
|
|
* relatively small memory saving).
|
2024-03-22 00:10:01 +00:00
|
|
|
*
|
|
|
|
* The memory saving factor calculated here takes same-filled pages into
|
|
|
|
* account, but those are not freeable since they almost occupy no
|
|
|
|
* space. Hence, we may scale nr_freeable down a little bit more than we
|
|
|
|
* should if we have a lot of same-filled pages.
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
*/
|
|
|
|
return mult_frac(nr_freeable, nr_backing, nr_stored);
|
|
|
|
}
|
|
|
|
|
mm/zswap: global lru and shrinker shared by all zswap_pools
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3.
Dynamic pool creation has been supported for a long time, which maybe not
used so much in practice. But with the per-memcg lru merged, the current
structure of zswap_pool's lru and shrinker become less optimal.
In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.
1. When memory has pressure, all shrinkers of zswap_pools will try to
shrink its lru list, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work will
try to shrink its own lru, which is inefficient.
A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.
Another optimization is changing zswap_pool kref to percpu_ref, which will
be taken reference by every zswap entry. So the scalability is better.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
This patch (of 3):
Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools
in a list, only the first will be current used.
Each zswap_pool has its own lru and shrinker, which is not necessary and
has its problem:
1. When memory has pressure, all shrinker of zswap_pools will
try to shrink its own lru, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work
will try to shrink its lru list. The rationale here was to
try and empty the old pool first so that we can completely
drop it. However, since we only support exclusive loads now,
the LRU ordering should be entirely decided by the order of
stores, so the oldest entries on the LRU will naturally be
from the oldest pool.
Anyway, having a global lru and shrinker shared by all zswap_pools is
better and efficient.
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-0-200495333595@bytedance.com
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-1-200495333595@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-16 08:55:04 +00:00
|
|
|
static struct shrinker *zswap_alloc_shrinker(void)
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
{
|
mm/zswap: global lru and shrinker shared by all zswap_pools
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3.
Dynamic pool creation has been supported for a long time, which maybe not
used so much in practice. But with the per-memcg lru merged, the current
structure of zswap_pool's lru and shrinker become less optimal.
In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.
1. When memory has pressure, all shrinkers of zswap_pools will try to
shrink its lru list, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work will
try to shrink its own lru, which is inefficient.
A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.
Another optimization is changing zswap_pool kref to percpu_ref, which will
be taken reference by every zswap entry. So the scalability is better.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
This patch (of 3):
Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools
in a list, only the first will be current used.
Each zswap_pool has its own lru and shrinker, which is not necessary and
has its problem:
1. When memory has pressure, all shrinker of zswap_pools will
try to shrink its own lru, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work
will try to shrink its lru list. The rationale here was to
try and empty the old pool first so that we can completely
drop it. However, since we only support exclusive loads now,
the LRU ordering should be entirely decided by the order of
stores, so the oldest entries on the LRU will naturally be
from the oldest pool.
Anyway, having a global lru and shrinker shared by all zswap_pools is
better and efficient.
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-0-200495333595@bytedance.com
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-1-200495333595@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-16 08:55:04 +00:00
|
|
|
struct shrinker *shrinker;
|
|
|
|
|
|
|
|
shrinker =
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
shrinker_alloc(SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE, "mm-zswap");
|
mm/zswap: global lru and shrinker shared by all zswap_pools
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3.
Dynamic pool creation has been supported for a long time, which maybe not
used so much in practice. But with the per-memcg lru merged, the current
structure of zswap_pool's lru and shrinker become less optimal.
In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.
1. When memory has pressure, all shrinkers of zswap_pools will try to
shrink its lru list, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work will
try to shrink its own lru, which is inefficient.
A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.
Another optimization is changing zswap_pool kref to percpu_ref, which will
be taken reference by every zswap entry. So the scalability is better.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
This patch (of 3):
Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools
in a list, only the first will be current used.
Each zswap_pool has its own lru and shrinker, which is not necessary and
has its problem:
1. When memory has pressure, all shrinker of zswap_pools will
try to shrink its own lru, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work
will try to shrink its lru list. The rationale here was to
try and empty the old pool first so that we can completely
drop it. However, since we only support exclusive loads now,
the LRU ordering should be entirely decided by the order of
stores, so the oldest entries on the LRU will naturally be
from the oldest pool.
Anyway, having a global lru and shrinker shared by all zswap_pools is
better and efficient.
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-0-200495333595@bytedance.com
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-1-200495333595@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-16 08:55:04 +00:00
|
|
|
if (!shrinker)
|
|
|
|
return NULL;
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
|
mm/zswap: global lru and shrinker shared by all zswap_pools
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3.
Dynamic pool creation has been supported for a long time, which maybe not
used so much in practice. But with the per-memcg lru merged, the current
structure of zswap_pool's lru and shrinker become less optimal.
In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.
1. When memory has pressure, all shrinkers of zswap_pools will try to
shrink its lru list, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work will
try to shrink its own lru, which is inefficient.
A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.
Another optimization is changing zswap_pool kref to percpu_ref, which will
be taken reference by every zswap entry. So the scalability is better.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
This patch (of 3):
Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools
in a list, only the first will be current used.
Each zswap_pool has its own lru and shrinker, which is not necessary and
has its problem:
1. When memory has pressure, all shrinker of zswap_pools will
try to shrink its own lru, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work
will try to shrink its lru list. The rationale here was to
try and empty the old pool first so that we can completely
drop it. However, since we only support exclusive loads now,
the LRU ordering should be entirely decided by the order of
stores, so the oldest entries on the LRU will naturally be
from the oldest pool.
Anyway, having a global lru and shrinker shared by all zswap_pools is
better and efficient.
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-0-200495333595@bytedance.com
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-1-200495333595@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-16 08:55:04 +00:00
|
|
|
shrinker->scan_objects = zswap_shrinker_scan;
|
|
|
|
shrinker->count_objects = zswap_shrinker_count;
|
|
|
|
shrinker->batch = 0;
|
|
|
|
shrinker->seeks = DEFAULT_SEEKS;
|
|
|
|
return shrinker;
|
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is
hit. This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed ahead
of time, as this depends on the workload (specifically, on factors such as
memory access patterns and compressibility of the memory pages).
This patch implements a memcg- and NUMA-aware shrinker for zswap, that is
initiated when there is memory pressure. The shrinker does not have any
parameter that must be tuned by the user, and can be opted in or out on a
per-memcg basis.
Furthermore, to make it more robust for many workloads and prevent
overshrinking (i.e evicting warm pages that might be refaulted into
memory), we build in the following heuristics:
* Estimate the number of warm pages residing in zswap, and attempt to
protect this region of the zswap LRU.
* Scale the number of freeable objects by an estimate of the memory
saving factor. The better zswap compresses the data, the fewer pages
we will evict to swap (as we will otherwise incur IO for relatively
small memory saving).
* During reclaim, if the shrinker encounters a page that is also being
brought into memory, the shrinker will cautiously terminate its
shrinking action, as this is a sign that it is touching the warmer
region of the zswap LRU.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
[nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing]
Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-30 19:40:23 +00:00
|
|
|
}
|
|
|
|
|
2023-11-30 19:40:20 +00:00
|
|
|
static int shrink_memcg(struct mem_cgroup *memcg)
|
|
|
|
{
|
|
|
|
int nid, shrunk = 0;
|
|
|
|
|
zswap: memcontrol: implement zswap writeback disabling
During our experiment with zswap, we sometimes observe swap IOs due to
occasional zswap store failures and writebacks-to-swap. These swapping
IOs prevent many users who cannot tolerate swapping from adopting zswap to
save memory and improve performance where possible.
This patch adds the option to disable this behavior entirely: do not
writeback to backing swapping device when a zswap store attempt fail, and
do not write pages in the zswap pool back to the backing swap device (both
when the pool is full, and when the new zswap shrinker is called).
This new behavior can be opted-in/out on a per-cgroup basis via a new
cgroup file. By default, writebacks to swap device is enabled, which is
the previous behavior. Initially, writeback is enabled for the root
cgroup, and a newly created cgroup will inherit the current setting of its
parent.
Note that this is subtly different from setting memory.swap.max to 0, as
it still allows for pages to be stored in the zswap pool (which itself
consumes swap space in its current form).
This patch should be applied on top of the zswap shrinker series:
https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.com/
as it also disables the zswap shrinker, a major source of zswap
writebacks.
For the most part, this feature is motivated by internal parties who
have already established their opinions regarding swapping - the
workloads that are highly sensitive to IO, and especially those who are
using servers with really slow disk performance (for instance, massive
but slow HDDs). For these folks, it's impossible to convince them to
even entertain zswap if swapping also comes as a packaged deal.
Writeback disabling is quite a useful feature in these situations - on
a mixed workloads deployment, they can disable writeback for the more
IO-sensitive workloads, and enable writeback for other background
workloads.
For instance, on a server with HDD, I allocate memories and populate
them with random values (so that zswap store will always fail), and
specify memory.high low enough to trigger reclaim. The time it takes
to allocate the memories and just read through it a couple of times
(doing silly things like computing the values' average etc.):
zswap.writeback disabled:
real 0m30.537s
user 0m23.687s
sys 0m6.637s
0 pages swapped in
0 pages swapped out
zswap.writeback enabled:
real 0m45.061s
user 0m24.310s
sys 0m8.892s
712686 pages swapped in
461093 pages swapped out
(the last two lines are from vmstat -s).
[nphamcs@gmail.com: add a comment about recurring zswap store failures leading to reclaim inefficiency]
Link: https://lkml.kernel.org/r/20231221005725.3446672-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231207192406.3809579-1-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Heidelberg <david@ixit.cz>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 19:24:06 +00:00
|
|
|
if (!mem_cgroup_zswap_writeback_enabled(memcg))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2023-11-30 19:40:20 +00:00
|
|
|
/*
|
|
|
|
* Skip zombies because their LRUs are reparented and we would be
|
|
|
|
* reclaiming from the parent instead of the dead memcg.
|
|
|
|
*/
|
|
|
|
if (memcg && !mem_cgroup_online(memcg))
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
for_each_node_state(nid, N_NORMAL_MEMORY) {
|
|
|
|
unsigned long nr_to_walk = 1;
|
|
|
|
|
2024-03-05 07:53:45 +00:00
|
|
|
shrunk += list_lru_walk_one(&zswap_list_lru, nid, memcg,
|
2023-11-30 19:40:20 +00:00
|
|
|
&shrink_memcg_cb, NULL, &nr_to_walk);
|
|
|
|
}
|
|
|
|
return shrunk ? 0 : -EAGAIN;
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 09:38:09 +00:00
|
|
|
}
|
|
|
|
|
2020-01-31 06:15:04 +00:00
|
|
|
static void shrink_worker(struct work_struct *w)
|
|
|
|
{
|
2023-11-30 19:40:20 +00:00
|
|
|
struct mem_cgroup *memcg;
|
mm: zswap: shrink until can accept
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test. For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided. The results are presented below, these are averages of
three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold. Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated. The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-26 18:32:27 +00:00
|
|
|
int ret, failures = 0;
|
mm: zswap: optimize zswap pool size tracking
Profiling the munmap() of a zswapped memory region shows 60% of the total
cycles currently going into updating the zswap_pool_total_size.
There are three consumers of this counter:
- store, to enforce the globally configured pool limit
- meminfo & debugfs, to report the size to the user
- shrink, to determine the batch size for each cycle
Instead of aggregating everytime an entry enters or exits the zswap
pool, aggregate the value from the zpools on-demand:
- Stores aggregate the counter anyway upon success. Aggregating to
check the limit instead is the same amount of work.
- Meminfo & debugfs might benefit somewhat from a pre-aggregated
counter, but aren't exactly hotpaths.
- Shrinking can aggregate once for every cycle instead of doing it for
every freed entry. As the shrinker might work on tens or hundreds of
objects per scan cycle, this is a large reduction in aggregations.
The paths that benefit dramatically are swapin, swapoff, and unmaps.
There could be millions of pages being processed until somebody asks for
the pool size again. This eliminates the pool size updates from those
paths entirely.
Top profile entries for a 24G range munmap(), before:
38.54% zswap-unmap [kernel.kallsyms] [k] zs_zpool_total_size
12.51% zswap-unmap [kernel.kallsyms] [k] zpool_get_total_size
9.10% zswap-unmap [kernel.kallsyms] [k] zswap_update_total_size
2.95% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
2.88% zswap-unmap [kernel.kallsyms] [k] __slab_free
2.86% zswap-unmap [kernel.kallsyms] [k] xas_store
and after:
7.70% zswap-unmap [kernel.kallsyms] [k] __slab_free
7.16% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
6.74% zswap-unmap [kernel.kallsyms] [k] xas_store
It was also briefly considered to move to a single atomic in zswap
that is updated by the backends, since zswap only cares about the sum
of all pools anyway. However, zram directly needs per-pool information
out of zsmalloc. To keep the backend from having to update two atomics
every time, I opted for the lazy aggregation instead for now.
Link: https://lkml.kernel.org/r/20240312153901.3441-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-12 15:34:11 +00:00
|
|
|
unsigned long thr;
|
|
|
|
|
|
|
|
/* Reclaim down to the accept threshold */
|
|
|
|
thr = zswap_accept_thr_pages();
|
mm: zswap: shrink until can accept
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test. For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided. The results are presented below, these are averages of
three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold. Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated. The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-26 18:32:27 +00:00
|
|
|
|
2023-11-30 19:40:20 +00:00
|
|
|
/* global reclaim will select cgroup in a round-robin fashion. */
|
mm: zswap: shrink until can accept
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test. For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided. The results are presented below, these are averages of
three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold. Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated. The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-26 18:32:27 +00:00
|
|
|
do {
|
2024-03-05 07:53:45 +00:00
|
|
|
spin_lock(&zswap_shrink_lock);
|
|
|
|
zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
|
|
|
|
memcg = zswap_next_shrink;
|
2023-11-30 19:40:20 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to retry if we have gone through a full round trip, or if we
|
|
|
|
* got an offline memcg (or else we risk undoing the effect of the
|
|
|
|
* zswap memcg offlining cleanup callback). This is not catastrophic
|
|
|
|
* per se, but it will keep the now offlined memcg hostage for a while.
|
|
|
|
*
|
|
|
|
* Note that if we got an online memcg, we will keep the extra
|
|
|
|
* reference in case the original reference obtained by mem_cgroup_iter
|
|
|
|
* is dropped by the zswap memcg offlining callback, ensuring that the
|
|
|
|
* memcg is not killed when we are reclaiming.
|
|
|
|
*/
|
|
|
|
if (!memcg) {
|
2024-03-05 07:53:45 +00:00
|
|
|
spin_unlock(&zswap_shrink_lock);
|
2023-11-30 19:40:20 +00:00
|
|
|
if (++failures == MAX_RECLAIM_RETRIES)
|
mm: zswap: shrink until can accept
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test. For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided. The results are presented below, these are averages of
three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold. Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated. The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-26 18:32:27 +00:00
|
|
|
break;
|
2023-11-30 19:40:20 +00:00
|
|
|
|
|
|
|
goto resched;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!mem_cgroup_tryget_online(memcg)) {
|
|
|
|
/* drop the reference from mem_cgroup_iter() */
|
|
|
|
mem_cgroup_iter_break(NULL, memcg);
|
2024-03-05 07:53:45 +00:00
|
|
|
zswap_next_shrink = NULL;
|
|
|
|
spin_unlock(&zswap_shrink_lock);
|
2023-11-30 19:40:20 +00:00
|
|
|
|
mm: zswap: shrink until can accept
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test. For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided. The results are presented below, these are averages of
three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold. Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated. The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-26 18:32:27 +00:00
|
|
|
if (++failures == MAX_RECLAIM_RETRIES)
|
|
|
|
break;
|
2023-11-30 19:40:20 +00:00
|
|
|
|
|
|
|
goto resched;
|
mm: zswap: shrink until can accept
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test. For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided. The results are presented below, these are averages of
three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold. Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated. The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-26 18:32:27 +00:00
|
|
|
}
|
2024-03-05 07:53:45 +00:00
|
|
|
spin_unlock(&zswap_shrink_lock);
|
2023-11-30 19:40:20 +00:00
|
|
|
|
|
|
|
ret = shrink_memcg(memcg);
|
|
|
|
/* drop the extra reference */
|
|
|
|
mem_cgroup_put(memcg);
|
|
|
|
|
|
|
|
if (ret == -EINVAL)
|
|
|
|
break;
|
|
|
|
if (ret && ++failures == MAX_RECLAIM_RETRIES)
|
|
|
|
break;
|
|
|
|
resched:
|
mm: zswap: shrink until can accept
This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.
The functioning of the zswap shrink worker was found to be inadequate, as
shown by basic benchmark test. For the test, a kernel build was utilized
as a reference, with its memory confined to 1G via a cgroup and a 5G swap
file provided. The results are presented below, these are averages of
three runs without the use of zswap:
real 46m26s
user 35m4s
sys 7m37s
With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:
real 56m4s
user 35m13s
sys 8m43s
written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478
Besides the evident regression, one thing to notice from this data is the
extremely low number of written_back_pages and pool_limit_hit.
The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular scenario:
once zswap hits his limit, zswap_pool_reached_full is set to true; with
this flag on, zswap_frontswap_store rejects pages if zswap is still above
the acceptance threshold. Once we include the rejections due to
zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478
to a significant 21578266.
Zswap is stuck in an undesirable state where it rejects pages because it's
above the acceptance threshold, yet fails to attempt memory reclaimation.
This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.
This state results in hot pages getting written directly to disk, while
cold ones remain memory, waiting only to be invalidated. The LRU order is
completely broken and zswap ends up being just an overhead without
providing any benefits.
This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.
Testing this suggested update showed much better numbers:
real 36m37s
user 35m8s
sys 9m32s
written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653
Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com
Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-26 18:32:27 +00:00
|
|
|
cond_resched();
|
mm: zswap: optimize zswap pool size tracking
Profiling the munmap() of a zswapped memory region shows 60% of the total
cycles currently going into updating the zswap_pool_total_size.
There are three consumers of this counter:
- store, to enforce the globally configured pool limit
- meminfo & debugfs, to report the size to the user
- shrink, to determine the batch size for each cycle
Instead of aggregating everytime an entry enters or exits the zswap
pool, aggregate the value from the zpools on-demand:
- Stores aggregate the counter anyway upon success. Aggregating to
check the limit instead is the same amount of work.
- Meminfo & debugfs might benefit somewhat from a pre-aggregated
counter, but aren't exactly hotpaths.
- Shrinking can aggregate once for every cycle instead of doing it for
every freed entry. As the shrinker might work on tens or hundreds of
objects per scan cycle, this is a large reduction in aggregations.
The paths that benefit dramatically are swapin, swapoff, and unmaps.
There could be millions of pages being processed until somebody asks for
the pool size again. This eliminates the pool size updates from those
paths entirely.
Top profile entries for a 24G range munmap(), before:
38.54% zswap-unmap [kernel.kallsyms] [k] zs_zpool_total_size
12.51% zswap-unmap [kernel.kallsyms] [k] zpool_get_total_size
9.10% zswap-unmap [kernel.kallsyms] [k] zswap_update_total_size
2.95% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
2.88% zswap-unmap [kernel.kallsyms] [k] __slab_free
2.86% zswap-unmap [kernel.kallsyms] [k] xas_store
and after:
7.70% zswap-unmap [kernel.kallsyms] [k] __slab_free
7.16% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
6.74% zswap-unmap [kernel.kallsyms] [k] xas_store
It was also briefly considered to move to a single atomic in zswap
that is updated by the backends, since zswap only cares about the sum
of all pools anyway. However, zram directly needs per-pool information
out of zsmalloc. To keep the backend from having to update two atomics
every time, I opted for the lazy aggregation instead for now.
Link: https://lkml.kernel.org/r/20240312153901.3441-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-12 15:34:11 +00:00
|
|
|
} while (zswap_total_pages() > thr);
|
2020-01-31 06:15:04 +00:00
|
|
|
}
|
|
|
|
|
2024-04-13 02:24:06 +00:00
|
|
|
/*********************************
|
|
|
|
* same-filled functions
|
|
|
|
**********************************/
|
|
|
|
static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value)
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 00:15:59 +00:00
|
|
|
{
|
|
|
|
unsigned long *page;
|
2023-02-05 19:00:36 +00:00
|
|
|
unsigned long val;
|
|
|
|
unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
|
2024-04-13 02:24:06 +00:00
|
|
|
bool ret = false;
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 00:15:59 +00:00
|
|
|
|
2024-04-13 02:24:06 +00:00
|
|
|
page = kmap_local_folio(folio, 0);
|
2023-02-05 19:00:36 +00:00
|
|
|
val = page[0];
|
|
|
|
|
|
|
|
if (val != page[last_pos])
|
2024-04-13 02:24:06 +00:00
|
|
|
goto out;
|
2023-02-05 19:00:36 +00:00
|
|
|
|
|
|
|
for (pos = 1; pos < last_pos; pos++) {
|
|
|
|
if (val != page[pos])
|
2024-04-13 02:24:06 +00:00
|
|
|
goto out;
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 00:15:59 +00:00
|
|
|
}
|
2023-02-05 19:00:36 +00:00
|
|
|
|
|
|
|
*value = val;
|
2024-04-13 02:24:06 +00:00
|
|
|
ret = true;
|
|
|
|
out:
|
|
|
|
kunmap_local(page);
|
|
|
|
return ret;
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 00:15:59 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void zswap_fill_page(void *ptr, unsigned long value)
|
|
|
|
{
|
|
|
|
unsigned long *page;
|
|
|
|
|
|
|
|
page = (unsigned long *)ptr;
|
|
|
|
memset_l(page, value, PAGE_SIZE / sizeof(unsigned long));
|
|
|
|
}
|
|
|
|
|
2024-04-13 02:24:06 +00:00
|
|
|
/*********************************
|
|
|
|
* main API
|
|
|
|
**********************************/
|
2023-07-15 04:23:40 +00:00
|
|
|
bool zswap_store(struct folio *folio)
|
2013-07-10 23:05:03 +00:00
|
|
|
{
|
2023-08-21 16:08:48 +00:00
|
|
|
swp_entry_t swp = folio->swap;
|
2023-07-17 16:02:27 +00:00
|
|
|
pgoff_t offset = swp_offset(swp);
|
2024-03-26 18:35:43 +00:00
|
|
|
struct xarray *tree = swap_zswap_tree(swp);
|
|
|
|
struct zswap_entry *entry, *old;
|
2022-05-19 21:08:53 +00:00
|
|
|
struct obj_cgroup *objcg = NULL;
|
2023-11-30 19:40:20 +00:00
|
|
|
struct mem_cgroup *memcg = NULL;
|
2024-04-13 02:24:06 +00:00
|
|
|
unsigned long value;
|
2023-07-17 16:02:27 +00:00
|
|
|
|
2023-07-15 04:23:40 +00:00
|
|
|
VM_WARN_ON_ONCE(!folio_test_locked(folio));
|
|
|
|
VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
|
2013-07-10 23:05:03 +00:00
|
|
|
|
2023-07-15 04:23:40 +00:00
|
|
|
/* Large folios aren't supported */
|
|
|
|
if (folio_test_large(folio))
|
2023-07-17 16:02:27 +00:00
|
|
|
return false;
|
mm, swap, frontswap: fix THP swap if frontswap enabled
It was reported by Sergey Senozhatsky that if THP (Transparent Huge
Page) and frontswap (via zswap) are both enabled, when memory goes low
so that swap is triggered, segfault and memory corruption will occur in
random user space applications as follow,
kernel: urxvt[338]: segfault at 20 ip 00007fc08889ae0d sp 00007ffc73a7fc40 error 6 in libc-2.26.so[7fc08881a000+1ae000]
#0 0x00007fc08889ae0d _int_malloc (libc.so.6)
#1 0x00007fc08889c2f3 malloc (libc.so.6)
#2 0x0000560e6004bff7 _Z14rxvt_wcstoutf8PKwi (urxvt)
#3 0x0000560e6005e75c n/a (urxvt)
#4 0x0000560e6007d9f1 _ZN16rxvt_perl_interp6invokeEP9rxvt_term9hook_typez (urxvt)
#5 0x0000560e6003d988 _ZN9rxvt_term9cmd_parseEv (urxvt)
#6 0x0000560e60042804 _ZN9rxvt_term6pty_cbERN2ev2ioEi (urxvt)
#7 0x0000560e6005c10f _Z17ev_invoke_pendingv (urxvt)
#8 0x0000560e6005cb55 ev_run (urxvt)
#9 0x0000560e6003b9b9 main (urxvt)
#10 0x00007fc08883af4a __libc_start_main (libc.so.6)
#11 0x0000560e6003f9da _start (urxvt)
After bisection, it was found the first bad commit is bd4c82c22c36 ("mm,
THP, swap: delay splitting THP after swapped out").
The root cause is as follows:
When the pages are written to swap device during swapping out in
swap_writepage(), zswap (fontswap) is tried to compress the pages to
improve performance. But zswap (frontswap) will treat THP as a normal
page, so only the head page is saved. After swapping in, tail pages
will not be restored to their original contents, causing memory
corruption in the applications.
This is fixed by refusing to save page in the frontswap store functions
if the page is a THP. So that the THP will be swapped out to swap
device.
Another choice is to split THP if frontswap is enabled. But it is found
that the frontswap enabling isn't flexible. For example, if
CONFIG_ZSWAP=y (cannot be module), frontswap will be enabled even if
zswap itself isn't enabled.
Frontswap has multiple backends, to make it easy for one backend to
enable THP support, the THP checking is put in backend frontswap store
functions instead of the general interfaces.
Link: http://lkml.kernel.org/r/20180209084947.22749-1-ying.huang@intel.com
Fixes: bd4c82c22c367e068 ("mm, THP, swap: delay splitting THP after swapped out")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org> [put THP checking in backend]
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Shaohua Li <shli@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: <stable@vger.kernel.org> [4.14]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-21 22:45:39 +00:00
|
|
|
|
2024-02-08 02:32:54 +00:00
|
|
|
if (!zswap_enabled)
|
2024-02-09 04:41:12 +00:00
|
|
|
goto check_old;
|
2024-02-08 02:32:54 +00:00
|
|
|
|
mm: zswap: optimize zswap pool size tracking
Profiling the munmap() of a zswapped memory region shows 60% of the total
cycles currently going into updating the zswap_pool_total_size.
There are three consumers of this counter:
- store, to enforce the globally configured pool limit
- meminfo & debugfs, to report the size to the user
- shrink, to determine the batch size for each cycle
Instead of aggregating everytime an entry enters or exits the zswap
pool, aggregate the value from the zpools on-demand:
- Stores aggregate the counter anyway upon success. Aggregating to
check the limit instead is the same amount of work.
- Meminfo & debugfs might benefit somewhat from a pre-aggregated
counter, but aren't exactly hotpaths.
- Shrinking can aggregate once for every cycle instead of doing it for
every freed entry. As the shrinker might work on tens or hundreds of
objects per scan cycle, this is a large reduction in aggregations.
The paths that benefit dramatically are swapin, swapoff, and unmaps.
There could be millions of pages being processed until somebody asks for
the pool size again. This eliminates the pool size updates from those
paths entirely.
Top profile entries for a 24G range munmap(), before:
38.54% zswap-unmap [kernel.kallsyms] [k] zs_zpool_total_size
12.51% zswap-unmap [kernel.kallsyms] [k] zpool_get_total_size
9.10% zswap-unmap [kernel.kallsyms] [k] zswap_update_total_size
2.95% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
2.88% zswap-unmap [kernel.kallsyms] [k] __slab_free
2.86% zswap-unmap [kernel.kallsyms] [k] xas_store
and after:
7.70% zswap-unmap [kernel.kallsyms] [k] __slab_free
7.16% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
6.74% zswap-unmap [kernel.kallsyms] [k] xas_store
It was also briefly considered to move to a single atomic in zswap
that is updated by the backends, since zswap only cares about the sum
of all pools anyway. However, zram directly needs per-pool information
out of zsmalloc. To keep the backend from having to update two atomics
every time, I opted for the lazy aggregation instead for now.
Link: https://lkml.kernel.org/r/20240312153901.3441-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-12 15:34:11 +00:00
|
|
|
/* Check cgroup limits */
|
2023-07-15 04:23:41 +00:00
|
|
|
objcg = get_obj_cgroup_from_folio(folio);
|
2023-11-30 19:40:20 +00:00
|
|
|
if (objcg && !obj_cgroup_may_zswap(objcg)) {
|
|
|
|
memcg = get_mem_cgroup_from_objcg(objcg);
|
|
|
|
if (shrink_memcg(memcg)) {
|
|
|
|
mem_cgroup_put(memcg);
|
|
|
|
goto reject;
|
|
|
|
}
|
|
|
|
mem_cgroup_put(memcg);
|
|
|
|
}
|
2022-05-19 21:08:53 +00:00
|
|
|
|
2024-04-13 02:24:05 +00:00
|
|
|
if (zswap_check_limits())
|
2024-04-13 02:24:04 +00:00
|
|
|
goto reject;
|
2013-07-10 23:05:03 +00:00
|
|
|
|
|
|
|
/* allocate entry */
|
2024-01-30 01:36:44 +00:00
|
|
|
entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
|
2013-07-10 23:05:03 +00:00
|
|
|
if (!entry) {
|
|
|
|
zswap_reject_kmemcache_fail++;
|
|
|
|
goto reject;
|
|
|
|
}
|
|
|
|
|
2024-04-13 02:24:06 +00:00
|
|
|
if (zswap_is_folio_same_filled(folio, &value)) {
|
|
|
|
entry->length = 0;
|
|
|
|
entry->value = value;
|
|
|
|
atomic_inc(&zswap_same_filled_pages);
|
|
|
|
goto store_entry;
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 00:15:59 +00:00
|
|
|
}
|
|
|
|
|
2015-09-09 22:35:19 +00:00
|
|
|
/* if entry is successfully added, it keeps the reference */
|
|
|
|
entry->pool = zswap_pool_current_get();
|
2023-07-17 16:02:27 +00:00
|
|
|
if (!entry->pool)
|
2015-09-09 22:35:19 +00:00
|
|
|
goto freepage;
|
|
|
|
|
2023-11-30 19:40:20 +00:00
|
|
|
if (objcg) {
|
|
|
|
memcg = get_mem_cgroup_from_objcg(objcg);
|
2024-03-05 07:53:45 +00:00
|
|
|
if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
|
2023-11-30 19:40:20 +00:00
|
|
|
mem_cgroup_put(memcg);
|
|
|
|
goto put_pool;
|
|
|
|
}
|
|
|
|
mem_cgroup_put(memcg);
|
|
|
|
}
|
|
|
|
|
2024-01-30 01:36:43 +00:00
|
|
|
if (!zswap_compress(folio, entry))
|
|
|
|
goto put_pool;
|
2020-12-15 03:14:18 +00:00
|
|
|
|
2024-04-13 02:24:06 +00:00
|
|
|
store_entry:
|
2024-01-30 01:36:44 +00:00
|
|
|
entry->swpentry = swp;
|
2022-05-19 21:08:53 +00:00
|
|
|
entry->objcg = objcg;
|
2024-03-26 18:35:43 +00:00
|
|
|
|
|
|
|
old = xa_store(tree, offset, entry, GFP_KERNEL);
|
|
|
|
if (xa_is_err(old)) {
|
|
|
|
int err = xa_err(old);
|
|
|
|
|
|
|
|
WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
|
|
|
|
zswap_reject_alloc_fail++;
|
|
|
|
goto store_failed;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We may have had an existing entry that became stale when
|
|
|
|
* the folio was redirtied and now the new version is being
|
|
|
|
* swapped out. Get rid of the old.
|
|
|
|
*/
|
|
|
|
if (old)
|
|
|
|
zswap_entry_free(old);
|
|
|
|
|
2022-05-19 21:08:53 +00:00
|
|
|
if (objcg) {
|
|
|
|
obj_cgroup_charge_zswap(objcg, entry->length);
|
|
|
|
count_objcg_event(objcg, ZSWPOUT);
|
|
|
|
}
|
|
|
|
|
2023-09-22 17:22:11 +00:00
|
|
|
/*
|
2024-03-26 18:35:43 +00:00
|
|
|
* We finish initializing the entry while it's already in xarray.
|
|
|
|
* This is safe because:
|
|
|
|
*
|
|
|
|
* 1. Concurrent stores and invalidations are excluded by folio lock.
|
|
|
|
*
|
|
|
|
* 2. Writeback is excluded by the entry not being on the LRU yet.
|
|
|
|
* The publishing order matters to prevent writeback from seeing
|
|
|
|
* an incoherent entry.
|
2023-09-22 17:22:11 +00:00
|
|
|
*/
|
2023-06-12 09:38:13 +00:00
|
|
|
if (entry->length) {
|
2023-11-30 19:40:20 +00:00
|
|
|
INIT_LIST_HEAD(&entry->lru);
|
2024-03-05 07:53:45 +00:00
|
|
|
zswap_lru_add(&zswap_list_lru, entry);
|
mm: zswap: add pool shrinking mechanism
Patch series "mm: zswap: move writeback LRU from zpool to zswap", v3.
This series aims to improve the zswap reclaim mechanism by reorganizing
the LRU management. In the current implementation, the LRU is maintained
within each zpool driver, resulting in duplicated code across the three
drivers. The proposed change consists in moving the LRU management from
the individual implementations up to the zswap layer.
The primary objective of this refactoring effort is to simplify the
codebase. By unifying the reclaim loop and consolidating LRU handling
within zswap, we can eliminate redundant code and improve
maintainability. Additionally, this change enables the reclamation of
stored pages in their actual LRU order. Presently, the zpool drivers
link backing pages in an LRU, causing compressed pages with different
LRU positions to be written back simultaneously.
The series consists of several patches. The first patch implements the
LRU and the reclaim loop in zswap, but it is not used yet because all
three driver implementations are marked as zpool_evictable.
The following three commits modify each zpool driver to be not
zpool_evictable, allowing the use of the reclaim loop in zswap.
As the drivers removed their shrink functions, the zpool interface is
then trimmed by removing zpool_evictable, zpool_ops, and zpool_shrink.
Finally, the code in zswap is further cleaned up by simplifying the
writeback function and removing the now unnecessary zswap_header.
This patch (of 7):
Each zpool driver (zbud, z3fold and zsmalloc) implements its own shrink
function, which is called from zpool_shrink. However, with this commit, a
unified shrink function is added to zswap. The ultimate goal is to
eliminate the need for zpool_shrink once all zpool implementations have
dropped their shrink code.
To ensure the functionality of each commit, this change focuses solely on
adding the mechanism itself. No modifications are made to the backends,
meaning that functionally, there are no immediate changes. The zswap
mechanism will only come into effect once the backends have removed their
shrink code. The subsequent commits will address the modifications needed
in the backends.
Link: https://lkml.kernel.org/r/20230612093815.133504-1-cerasuolodomenico@gmail.com
Link: https://lkml.kernel.org/r/20230612093815.133504-2-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 09:38:09 +00:00
|
|
|
}
|
2013-07-10 23:05:03 +00:00
|
|
|
|
|
|
|
/* update stats */
|
|
|
|
atomic_inc(&zswap_stored_pages);
|
2022-05-19 21:08:53 +00:00
|
|
|
count_vm_event(ZSWPOUT);
|
2013-07-10 23:05:03 +00:00
|
|
|
|
2023-07-17 16:02:27 +00:00
|
|
|
return true;
|
2013-07-10 23:05:03 +00:00
|
|
|
|
2024-03-26 18:35:43 +00:00
|
|
|
store_failed:
|
|
|
|
if (!entry->length)
|
|
|
|
atomic_dec(&zswap_same_filled_pages);
|
|
|
|
else {
|
|
|
|
zpool_free(zswap_find_zpool(entry), entry->handle);
|
2023-11-30 19:40:20 +00:00
|
|
|
put_pool:
|
2024-03-26 18:35:43 +00:00
|
|
|
zswap_pool_put(entry->pool);
|
|
|
|
}
|
2015-09-09 22:35:19 +00:00
|
|
|
freepage:
|
2013-07-10 23:05:03 +00:00
|
|
|
zswap_entry_cache_free(entry);
|
|
|
|
reject:
|
2024-03-16 01:58:03 +00:00
|
|
|
obj_cgroup_put(objcg);
|
2024-04-13 02:24:04 +00:00
|
|
|
if (zswap_pool_reached_full)
|
|
|
|
queue_work(shrink_wq, &zswap_shrink_work);
|
2024-02-09 04:41:12 +00:00
|
|
|
check_old:
|
|
|
|
/*
|
|
|
|
* If the zswap store fails or zswap is disabled, we must invalidate the
|
|
|
|
* possibly stale entry which was previously stored at this offset.
|
|
|
|
* Otherwise, writeback could overwrite the new data in the swapfile.
|
|
|
|
*/
|
2024-03-26 18:35:43 +00:00
|
|
|
entry = xa_erase(tree, offset);
|
2024-02-09 04:41:12 +00:00
|
|
|
if (entry)
|
2024-03-26 18:35:43 +00:00
|
|
|
zswap_entry_free(entry);
|
2023-07-17 16:02:27 +00:00
|
|
|
return false;
|
2013-07-10 23:05:03 +00:00
|
|
|
}
|
|
|
|
|
2023-07-15 04:23:43 +00:00
|
|
|
bool zswap_load(struct folio *folio)
|
2013-07-10 23:05:03 +00:00
|
|
|
{
|
2023-08-21 16:08:48 +00:00
|
|
|
swp_entry_t swp = folio->swap;
|
2023-07-17 16:02:27 +00:00
|
|
|
pgoff_t offset = swp_offset(swp);
|
2023-07-15 04:23:43 +00:00
|
|
|
struct page *page = &folio->page;
|
2024-03-24 21:04:47 +00:00
|
|
|
bool swapcache = folio_test_swapcache(folio);
|
2024-03-26 18:35:43 +00:00
|
|
|
struct xarray *tree = swap_zswap_tree(swp);
|
2013-07-10 23:05:03 +00:00
|
|
|
struct zswap_entry *entry;
|
2023-12-28 09:45:43 +00:00
|
|
|
u8 *dst;
|
2023-07-17 16:02:27 +00:00
|
|
|
|
2023-07-15 04:23:43 +00:00
|
|
|
VM_WARN_ON_ONCE(!folio_test_locked(folio));
|
2013-07-10 23:05:03 +00:00
|
|
|
|
2024-03-24 21:04:47 +00:00
|
|
|
/*
|
|
|
|
* When reading into the swapcache, invalidate our entry. The
|
|
|
|
* swapcache can be the authoritative owner of the page and
|
|
|
|
* its mappings, and the pressure that results from having two
|
|
|
|
* in-memory copies outweighs any benefits of caching the
|
|
|
|
* compression work.
|
|
|
|
*
|
|
|
|
* (Most swapins go through the swapcache. The notable
|
|
|
|
* exception is the singleton fault on SWP_SYNCHRONOUS_IO
|
|
|
|
* files, which reads into a private page and may free it if
|
|
|
|
* the fault fails. We remain the primary owner of the entry.)
|
|
|
|
*/
|
|
|
|
if (swapcache)
|
2024-03-26 18:35:43 +00:00
|
|
|
entry = xa_erase(tree, offset);
|
|
|
|
else
|
|
|
|
entry = xa_load(tree, offset);
|
|
|
|
|
|
|
|
if (!entry)
|
|
|
|
return false;
|
2013-07-10 23:05:03 +00:00
|
|
|
|
2023-12-28 09:45:44 +00:00
|
|
|
if (entry->length)
|
2024-01-30 01:36:42 +00:00
|
|
|
zswap_decompress(entry, page);
|
2023-12-28 09:45:44 +00:00
|
|
|
else {
|
2023-11-27 15:55:21 +00:00
|
|
|
dst = kmap_local_page(page);
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 00:15:59 +00:00
|
|
|
zswap_fill_page(dst, entry->value);
|
2023-11-27 15:55:21 +00:00
|
|
|
kunmap_local(dst);
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 00:15:59 +00:00
|
|
|
}
|
|
|
|
|
2022-05-19 21:08:53 +00:00
|
|
|
count_vm_event(ZSWPIN);
|
2022-05-19 21:08:53 +00:00
|
|
|
if (entry->objcg)
|
|
|
|
count_objcg_event(entry->objcg, ZSWPIN);
|
2023-12-28 09:45:42 +00:00
|
|
|
|
2024-03-24 21:04:47 +00:00
|
|
|
if (swapcache) {
|
|
|
|
zswap_entry_free(entry);
|
|
|
|
folio_mark_dirty(folio);
|
|
|
|
}
|
2024-02-04 03:06:03 +00:00
|
|
|
|
2023-12-28 09:45:44 +00:00
|
|
|
return true;
|
2013-07-10 23:05:03 +00:00
|
|
|
}
|
|
|
|
|
2024-02-04 03:06:00 +00:00
|
|
|
void zswap_invalidate(swp_entry_t swp)
|
2013-07-10 23:05:03 +00:00
|
|
|
{
|
2024-02-04 03:06:00 +00:00
|
|
|
pgoff_t offset = swp_offset(swp);
|
2024-03-26 18:35:43 +00:00
|
|
|
struct xarray *tree = swap_zswap_tree(swp);
|
2013-07-10 23:05:03 +00:00
|
|
|
struct zswap_entry *entry;
|
|
|
|
|
2024-03-26 18:35:43 +00:00
|
|
|
entry = xa_erase(tree, offset);
|
2024-01-30 01:36:45 +00:00
|
|
|
if (entry)
|
2024-03-26 18:35:43 +00:00
|
|
|
zswap_entry_free(entry);
|
2013-07-10 23:05:03 +00:00
|
|
|
}
|
|
|
|
|
2024-01-19 11:22:23 +00:00
|
|
|
int zswap_swapon(int type, unsigned long nr_pages)
|
2023-07-17 16:02:27 +00:00
|
|
|
{
|
2024-03-26 18:35:43 +00:00
|
|
|
struct xarray *trees, *tree;
|
2024-01-19 11:22:23 +00:00
|
|
|
unsigned int nr, i;
|
2023-07-17 16:02:27 +00:00
|
|
|
|
2024-01-19 11:22:23 +00:00
|
|
|
nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
|
|
|
|
trees = kvcalloc(nr, sizeof(*tree), GFP_KERNEL);
|
|
|
|
if (!trees) {
|
2023-07-17 16:02:27 +00:00
|
|
|
pr_err("alloc failed, zswap disabled for swap type %d\n", type);
|
mm/zswap: make sure each swapfile always have zswap rb-tree
Patch series "mm/zswap: optimize the scalability of zswap rb-tree", v2.
When testing the zswap performance by using kernel build -j32 in a tmpfs
directory, I found the scalability of zswap rb-tree is not good, which is
protected by the only spinlock. That would cause heavy lock contention if
multiple tasks zswap_store/load concurrently.
So a simple solution is to split the only one zswap rb-tree into multiple
rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea
is from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB
trunks").
Although this method can't solve the spinlock contention completely, it
can mitigate much of that contention. Below is the results of kernel
build in tmpfs with zswap shrinker enabled:
linux-next zswap-lock-optimize
real 1m9.181s 1m3.820s
user 17m44.036s 17m40.100s
sys 7m37.297s 4m54.622s
So there are clearly improvements. And it's complementary with the
ongoing zswap xarray conversion by Chris. Anyway, I think we can also
merge this first, it's complementary IMHO. So I just refresh and resend
this for further discussion.
This patch (of 2):
Not all zswap interfaces can handle the absence of the zswap rb-tree,
actually only zswap_store() has handled it for now.
To make things simple, we make sure each swapfile always have the zswap
rb-tree prepared before being enabled and used. The preparation is
unlikely to fail in practice, this patch just make it explicit.
Link: https://lkml.kernel.org/r/20240117-b4-zswap-lock-optimize-v2-0-b5cc55479090@bytedance.com
Link: https://lkml.kernel.org/r/20240117-b4-zswap-lock-optimize-v2-1-b5cc55479090@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Chris Li <chriscli@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-19 11:22:22 +00:00
|
|
|
return -ENOMEM;
|
2023-07-17 16:02:27 +00:00
|
|
|
}
|
|
|
|
|
2024-03-26 18:35:43 +00:00
|
|
|
for (i = 0; i < nr; i++)
|
|
|
|
xa_init(trees + i);
|
2024-01-19 11:22:23 +00:00
|
|
|
|
|
|
|
nr_zswap_trees[type] = nr;
|
|
|
|
zswap_trees[type] = trees;
|
mm/zswap: make sure each swapfile always have zswap rb-tree
Patch series "mm/zswap: optimize the scalability of zswap rb-tree", v2.
When testing the zswap performance by using kernel build -j32 in a tmpfs
directory, I found the scalability of zswap rb-tree is not good, which is
protected by the only spinlock. That would cause heavy lock contention if
multiple tasks zswap_store/load concurrently.
So a simple solution is to split the only one zswap rb-tree into multiple
rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea
is from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB
trunks").
Although this method can't solve the spinlock contention completely, it
can mitigate much of that contention. Below is the results of kernel
build in tmpfs with zswap shrinker enabled:
linux-next zswap-lock-optimize
real 1m9.181s 1m3.820s
user 17m44.036s 17m40.100s
sys 7m37.297s 4m54.622s
So there are clearly improvements. And it's complementary with the
ongoing zswap xarray conversion by Chris. Anyway, I think we can also
merge this first, it's complementary IMHO. So I just refresh and resend
this for further discussion.
This patch (of 2):
Not all zswap interfaces can handle the absence of the zswap rb-tree,
actually only zswap_store() has handled it for now.
To make things simple, we make sure each swapfile always have the zswap
rb-tree prepared before being enabled and used. The preparation is
unlikely to fail in practice, this patch just make it explicit.
Link: https://lkml.kernel.org/r/20240117-b4-zswap-lock-optimize-v2-0-b5cc55479090@bytedance.com
Link: https://lkml.kernel.org/r/20240117-b4-zswap-lock-optimize-v2-1-b5cc55479090@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Chris Li <chriscli@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-19 11:22:22 +00:00
|
|
|
return 0;
|
2023-07-17 16:02:27 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void zswap_swapoff(int type)
|
2013-07-10 23:05:03 +00:00
|
|
|
{
|
2024-03-26 18:35:43 +00:00
|
|
|
struct xarray *trees = zswap_trees[type];
|
2024-01-19 11:22:23 +00:00
|
|
|
unsigned int i;
|
2013-07-10 23:05:03 +00:00
|
|
|
|
2024-01-19 11:22:23 +00:00
|
|
|
if (!trees)
|
2013-07-10 23:05:03 +00:00
|
|
|
return;
|
|
|
|
|
2024-01-24 04:51:12 +00:00
|
|
|
/* try_to_unuse() invalidated all the entries already */
|
|
|
|
for (i = 0; i < nr_zswap_trees[type]; i++)
|
2024-03-26 18:35:43 +00:00
|
|
|
WARN_ON_ONCE(!xa_empty(trees + i));
|
2024-01-19 11:22:23 +00:00
|
|
|
|
|
|
|
kvfree(trees);
|
|
|
|
nr_zswap_trees[type] = 0;
|
2013-10-16 20:46:54 +00:00
|
|
|
zswap_trees[type] = NULL;
|
2013-07-10 23:05:03 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*********************************
|
|
|
|
* debugfs functions
|
|
|
|
**********************************/
|
|
|
|
#ifdef CONFIG_DEBUG_FS
|
|
|
|
#include <linux/debugfs.h>
|
|
|
|
|
|
|
|
static struct dentry *zswap_debugfs_root;
|
|
|
|
|
mm: zswap: optimize zswap pool size tracking
Profiling the munmap() of a zswapped memory region shows 60% of the total
cycles currently going into updating the zswap_pool_total_size.
There are three consumers of this counter:
- store, to enforce the globally configured pool limit
- meminfo & debugfs, to report the size to the user
- shrink, to determine the batch size for each cycle
Instead of aggregating everytime an entry enters or exits the zswap
pool, aggregate the value from the zpools on-demand:
- Stores aggregate the counter anyway upon success. Aggregating to
check the limit instead is the same amount of work.
- Meminfo & debugfs might benefit somewhat from a pre-aggregated
counter, but aren't exactly hotpaths.
- Shrinking can aggregate once for every cycle instead of doing it for
every freed entry. As the shrinker might work on tens or hundreds of
objects per scan cycle, this is a large reduction in aggregations.
The paths that benefit dramatically are swapin, swapoff, and unmaps.
There could be millions of pages being processed until somebody asks for
the pool size again. This eliminates the pool size updates from those
paths entirely.
Top profile entries for a 24G range munmap(), before:
38.54% zswap-unmap [kernel.kallsyms] [k] zs_zpool_total_size
12.51% zswap-unmap [kernel.kallsyms] [k] zpool_get_total_size
9.10% zswap-unmap [kernel.kallsyms] [k] zswap_update_total_size
2.95% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
2.88% zswap-unmap [kernel.kallsyms] [k] __slab_free
2.86% zswap-unmap [kernel.kallsyms] [k] xas_store
and after:
7.70% zswap-unmap [kernel.kallsyms] [k] __slab_free
7.16% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
6.74% zswap-unmap [kernel.kallsyms] [k] xas_store
It was also briefly considered to move to a single atomic in zswap
that is updated by the backends, since zswap only cares about the sum
of all pools anyway. However, zram directly needs per-pool information
out of zsmalloc. To keep the backend from having to update two atomics
every time, I opted for the lazy aggregation instead for now.
Link: https://lkml.kernel.org/r/20240312153901.3441-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-12 15:34:11 +00:00
|
|
|
static int debugfs_get_total_size(void *data, u64 *val)
|
|
|
|
{
|
|
|
|
*val = zswap_total_pages() * PAGE_SIZE;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
DEFINE_DEBUGFS_ATTRIBUTE(total_size_fops, debugfs_get_total_size, NULL, "%llu\n");
|
|
|
|
|
2023-04-03 12:13:18 +00:00
|
|
|
static int zswap_debugfs_init(void)
|
2013-07-10 23:05:03 +00:00
|
|
|
{
|
|
|
|
if (!debugfs_initialized())
|
|
|
|
return -ENODEV;
|
|
|
|
|
|
|
|
zswap_debugfs_root = debugfs_create_dir("zswap", NULL);
|
|
|
|
|
2018-06-14 22:27:58 +00:00
|
|
|
debugfs_create_u64("pool_limit_hit", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_pool_limit_hit);
|
|
|
|
debugfs_create_u64("reject_reclaim_fail", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_reject_reclaim_fail);
|
|
|
|
debugfs_create_u64("reject_alloc_fail", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_reject_alloc_fail);
|
|
|
|
debugfs_create_u64("reject_kmemcache_fail", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_reject_kmemcache_fail);
|
2023-10-24 23:45:09 +00:00
|
|
|
debugfs_create_u64("reject_compress_fail", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_reject_compress_fail);
|
2018-06-14 22:27:58 +00:00
|
|
|
debugfs_create_u64("reject_compress_poor", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_reject_compress_poor);
|
|
|
|
debugfs_create_u64("written_back_pages", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_written_back_pages);
|
mm: zswap: optimize zswap pool size tracking
Profiling the munmap() of a zswapped memory region shows 60% of the total
cycles currently going into updating the zswap_pool_total_size.
There are three consumers of this counter:
- store, to enforce the globally configured pool limit
- meminfo & debugfs, to report the size to the user
- shrink, to determine the batch size for each cycle
Instead of aggregating everytime an entry enters or exits the zswap
pool, aggregate the value from the zpools on-demand:
- Stores aggregate the counter anyway upon success. Aggregating to
check the limit instead is the same amount of work.
- Meminfo & debugfs might benefit somewhat from a pre-aggregated
counter, but aren't exactly hotpaths.
- Shrinking can aggregate once for every cycle instead of doing it for
every freed entry. As the shrinker might work on tens or hundreds of
objects per scan cycle, this is a large reduction in aggregations.
The paths that benefit dramatically are swapin, swapoff, and unmaps.
There could be millions of pages being processed until somebody asks for
the pool size again. This eliminates the pool size updates from those
paths entirely.
Top profile entries for a 24G range munmap(), before:
38.54% zswap-unmap [kernel.kallsyms] [k] zs_zpool_total_size
12.51% zswap-unmap [kernel.kallsyms] [k] zpool_get_total_size
9.10% zswap-unmap [kernel.kallsyms] [k] zswap_update_total_size
2.95% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
2.88% zswap-unmap [kernel.kallsyms] [k] __slab_free
2.86% zswap-unmap [kernel.kallsyms] [k] xas_store
and after:
7.70% zswap-unmap [kernel.kallsyms] [k] __slab_free
7.16% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
6.74% zswap-unmap [kernel.kallsyms] [k] xas_store
It was also briefly considered to move to a single atomic in zswap
that is updated by the backends, since zswap only cares about the sum
of all pools anyway. However, zram directly needs per-pool information
out of zsmalloc. To keep the backend from having to update two atomics
every time, I opted for the lazy aggregation instead for now.
Link: https://lkml.kernel.org/r/20240312153901.3441-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-12 15:34:11 +00:00
|
|
|
debugfs_create_file("pool_total_size", 0444,
|
|
|
|
zswap_debugfs_root, NULL, &total_size_fops);
|
2018-06-14 22:27:58 +00:00
|
|
|
debugfs_create_atomic_t("stored_pages", 0444,
|
|
|
|
zswap_debugfs_root, &zswap_stored_pages);
|
zswap: same-filled pages handling
Zswap is a cache which compresses the pages that are being swapped out
and stores them into a dynamically allocated RAM-based memory pool.
Experiments have shown that around 10-20% of pages stored in zswap are
same-filled pages (i.e. contents of the page are all same), but these
pages are handled as normal pages by compressing and allocating memory
in the pool.
This patch adds a check in zswap_frontswap_store() to identify
same-filled page before compression of the page. If the page is a
same-filled page, set zswap_entry.length to zero, save the same-filled
value and skip the compression of the page and alloction of memory in
zpool. In zswap_frontswap_load(), check if value of zswap_entry.length
is zero corresponding to the page to be loaded. If zswap_entry.length
is zero, fill the page with same-filled value. This saves the
decompression time during load.
On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
relaunching different applications, out of ~64000 pages stored in zswap,
~11000 pages were same-value filled pages (including zero-filled pages)
and ~9000 pages were zero-filled pages.
An average of 17% of pages(including zero-filled pages) in zswap are
same-value filled pages and 14% pages are zero-filled pages. An average
of 3% of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 26.5ms 18ms 32%
(of same value pages)
*Zswap Load Time
(of same value pages) 25.5ms 13ms 49%
-----------------------------------------------------------------
On Ubuntu PC with 2GB RAM, while executing kernel build and other test
scripts and running multimedia applications, out of 360000 pages stored
in zswap 78000(~22%) of pages were found to be same-value filled pages
(including zero-filled pages) and 64000(~17%) are zero-filled pages. So
an average of %5 of pages are same-filled non-zero pages.
The below table shows the execution time profiling with the patch.
Baseline With patch % Improvement
-----------------------------------------------------------------
*Zswap Store Time 91ms 74ms 19%
(of same value pages)
*Zswap Load Time 50ms 7.5ms 85%
(of same value pages)
-----------------------------------------------------------------
*The execution times may vary with test device used.
Dan said:
: I did test this patch out this week, and I added some instrumentation to
: check the performance impact, and tested with a small program to try to
: check the best and worst cases.
:
: When doing a lot of swap where all (or almost all) pages are same-value, I
: found this patch does save both time and space, significantly. The exact
: improvement in time and space depends on which compressor is being used,
: but roughly agrees with the numbers you listed.
:
: In the worst case situation, where all (or almost all) pages have the
: same-value *except* the final long (meaning, zswap will check each long on
: the entire page but then still have to pass the page to the compressor),
: the same-value check is around 10-15% of the total time spent in
: zswap_frontswap_store(). That's a not-insignificant amount of time, but
: it's not huge. Considering that most systems will probably be swapping
: pages that aren't similar to the worst case (although I don't have any
: data to know that), I'd say the improvement is worth the possible
: worst-case performance impact.
[srividya.dr@samsung.com: add memset_l instead of for loop]
Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
Signed-off-by: Srividya Desireddy <srividya.dr@samsung.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
Cc: SHARAN ALLUR <sharan.allur@samsung.com>
Cc: RAJIB BASU <rajib.basu@samsung.com>
Cc: JUHUN KIM <juhunkim@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01 00:15:59 +00:00
|
|
|
debugfs_create_atomic_t("same_filled_pages", 0444,
|
2018-06-14 22:27:58 +00:00
|
|
|
zswap_debugfs_root, &zswap_same_filled_pages);
|
2013-07-10 23:05:03 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#else
|
2023-04-03 12:13:18 +00:00
|
|
|
static int zswap_debugfs_init(void)
|
2013-07-10 23:05:03 +00:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*********************************
|
|
|
|
* module init and exit
|
|
|
|
**********************************/
|
2023-04-03 12:13:18 +00:00
|
|
|
static int zswap_setup(void)
|
2013-07-10 23:05:03 +00:00
|
|
|
{
|
2015-09-09 22:35:19 +00:00
|
|
|
struct zswap_pool *pool;
|
2016-11-26 23:13:39 +00:00
|
|
|
int ret;
|
2014-04-07 22:38:27 +00:00
|
|
|
|
2023-04-03 12:13:16 +00:00
|
|
|
zswap_entry_cache = KMEM_CACHE(zswap_entry, 0);
|
|
|
|
if (!zswap_entry_cache) {
|
2013-07-10 23:05:03 +00:00
|
|
|
pr_err("entry cache creation failed\n");
|
2015-09-09 22:35:19 +00:00
|
|
|
goto cache_fail;
|
2013-07-10 23:05:03 +00:00
|
|
|
}
|
2015-09-09 22:35:19 +00:00
|
|
|
|
2016-11-26 23:13:40 +00:00
|
|
|
ret = cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE,
|
|
|
|
"mm/zswap_pool:prepare",
|
|
|
|
zswap_cpu_comp_prepare,
|
|
|
|
zswap_cpu_comp_dead);
|
|
|
|
if (ret)
|
|
|
|
goto hp_fail;
|
|
|
|
|
mm/zswap: global lru and shrinker shared by all zswap_pools
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3.
Dynamic pool creation has been supported for a long time, which maybe not
used so much in practice. But with the per-memcg lru merged, the current
structure of zswap_pool's lru and shrinker become less optimal.
In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.
1. When memory has pressure, all shrinkers of zswap_pools will try to
shrink its lru list, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work will
try to shrink its own lru, which is inefficient.
A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.
Another optimization is changing zswap_pool kref to percpu_ref, which will
be taken reference by every zswap entry. So the scalability is better.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
This patch (of 3):
Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools
in a list, only the first will be current used.
Each zswap_pool has its own lru and shrinker, which is not necessary and
has its problem:
1. When memory has pressure, all shrinker of zswap_pools will
try to shrink its own lru, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work
will try to shrink its lru list. The rationale here was to
try and empty the old pool first so that we can completely
drop it. However, since we only support exclusive loads now,
the LRU ordering should be entirely decided by the order of
stores, so the oldest entries on the LRU will naturally be
from the oldest pool.
Anyway, having a global lru and shrinker shared by all zswap_pools is
better and efficient.
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-0-200495333595@bytedance.com
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-1-200495333595@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-16 08:55:04 +00:00
|
|
|
shrink_wq = alloc_workqueue("zswap-shrink",
|
|
|
|
WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
|
|
|
|
if (!shrink_wq)
|
|
|
|
goto shrink_wq_fail;
|
|
|
|
|
2024-03-05 07:53:45 +00:00
|
|
|
zswap_shrinker = zswap_alloc_shrinker();
|
|
|
|
if (!zswap_shrinker)
|
mm/zswap: global lru and shrinker shared by all zswap_pools
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3.
Dynamic pool creation has been supported for a long time, which maybe not
used so much in practice. But with the per-memcg lru merged, the current
structure of zswap_pool's lru and shrinker become less optimal.
In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.
1. When memory has pressure, all shrinkers of zswap_pools will try to
shrink its lru list, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work will
try to shrink its own lru, which is inefficient.
A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.
Another optimization is changing zswap_pool kref to percpu_ref, which will
be taken reference by every zswap entry. So the scalability is better.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
This patch (of 3):
Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools
in a list, only the first will be current used.
Each zswap_pool has its own lru and shrinker, which is not necessary and
has its problem:
1. When memory has pressure, all shrinker of zswap_pools will
try to shrink its own lru, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work
will try to shrink its lru list. The rationale here was to
try and empty the old pool first so that we can completely
drop it. However, since we only support exclusive loads now,
the LRU ordering should be entirely decided by the order of
stores, so the oldest entries on the LRU will naturally be
from the oldest pool.
Anyway, having a global lru and shrinker shared by all zswap_pools is
better and efficient.
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-0-200495333595@bytedance.com
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-1-200495333595@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-16 08:55:04 +00:00
|
|
|
goto shrinker_fail;
|
2024-03-05 07:53:45 +00:00
|
|
|
if (list_lru_init_memcg(&zswap_list_lru, zswap_shrinker))
|
mm/zswap: global lru and shrinker shared by all zswap_pools
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3.
Dynamic pool creation has been supported for a long time, which maybe not
used so much in practice. But with the per-memcg lru merged, the current
structure of zswap_pool's lru and shrinker become less optimal.
In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.
1. When memory has pressure, all shrinkers of zswap_pools will try to
shrink its lru list, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work will
try to shrink its own lru, which is inefficient.
A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.
Another optimization is changing zswap_pool kref to percpu_ref, which will
be taken reference by every zswap entry. So the scalability is better.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
This patch (of 3):
Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools
in a list, only the first will be current used.
Each zswap_pool has its own lru and shrinker, which is not necessary and
has its problem:
1. When memory has pressure, all shrinker of zswap_pools will
try to shrink its own lru, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work
will try to shrink its lru list. The rationale here was to
try and empty the old pool first so that we can completely
drop it. However, since we only support exclusive loads now,
the LRU ordering should be entirely decided by the order of
stores, so the oldest entries on the LRU will naturally be
from the oldest pool.
Anyway, having a global lru and shrinker shared by all zswap_pools is
better and efficient.
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-0-200495333595@bytedance.com
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-1-200495333595@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-16 08:55:04 +00:00
|
|
|
goto lru_fail;
|
2024-03-05 07:53:45 +00:00
|
|
|
shrinker_register(zswap_shrinker);
|
mm/zswap: global lru and shrinker shared by all zswap_pools
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3.
Dynamic pool creation has been supported for a long time, which maybe not
used so much in practice. But with the per-memcg lru merged, the current
structure of zswap_pool's lru and shrinker become less optimal.
In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.
1. When memory has pressure, all shrinkers of zswap_pools will try to
shrink its lru list, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work will
try to shrink its own lru, which is inefficient.
A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.
Another optimization is changing zswap_pool kref to percpu_ref, which will
be taken reference by every zswap entry. So the scalability is better.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
This patch (of 3):
Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools
in a list, only the first will be current used.
Each zswap_pool has its own lru and shrinker, which is not necessary and
has its problem:
1. When memory has pressure, all shrinker of zswap_pools will
try to shrink its own lru, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work
will try to shrink its lru list. The rationale here was to
try and empty the old pool first so that we can completely
drop it. However, since we only support exclusive loads now,
the LRU ordering should be entirely decided by the order of
stores, so the oldest entries on the LRU will naturally be
from the oldest pool.
Anyway, having a global lru and shrinker shared by all zswap_pools is
better and efficient.
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-0-200495333595@bytedance.com
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-1-200495333595@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-16 08:55:04 +00:00
|
|
|
|
2024-03-05 07:53:45 +00:00
|
|
|
INIT_WORK(&zswap_shrink_work, shrink_worker);
|
mm/zswap: global lru and shrinker shared by all zswap_pools
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3.
Dynamic pool creation has been supported for a long time, which maybe not
used so much in practice. But with the per-memcg lru merged, the current
structure of zswap_pool's lru and shrinker become less optimal.
In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.
1. When memory has pressure, all shrinkers of zswap_pools will try to
shrink its lru list, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work will
try to shrink its own lru, which is inefficient.
A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.
Another optimization is changing zswap_pool kref to percpu_ref, which will
be taken reference by every zswap entry. So the scalability is better.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
This patch (of 3):
Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools
in a list, only the first will be current used.
Each zswap_pool has its own lru and shrinker, which is not necessary and
has its problem:
1. When memory has pressure, all shrinker of zswap_pools will
try to shrink its own lru, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work
will try to shrink its lru list. The rationale here was to
try and empty the old pool first so that we can completely
drop it. However, since we only support exclusive loads now,
the LRU ordering should be entirely decided by the order of
stores, so the oldest entries on the LRU will naturally be
from the oldest pool.
Anyway, having a global lru and shrinker shared by all zswap_pools is
better and efficient.
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-0-200495333595@bytedance.com
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-1-200495333595@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-16 08:55:04 +00:00
|
|
|
|
2015-09-09 22:35:19 +00:00
|
|
|
pool = __zswap_pool_create_fallback();
|
2017-02-27 22:26:47 +00:00
|
|
|
if (pool) {
|
|
|
|
pr_info("loaded using pool %s/%s\n", pool->tfm_name,
|
2023-06-20 19:46:44 +00:00
|
|
|
zpool_get_type(pool->zpools[0]));
|
2017-02-27 22:26:47 +00:00
|
|
|
list_add(&pool->list, &zswap_pools);
|
|
|
|
zswap_has_pool = true;
|
|
|
|
} else {
|
2015-09-09 22:35:19 +00:00
|
|
|
pr_err("pool creation failed\n");
|
2017-02-27 22:26:47 +00:00
|
|
|
zswap_enabled = false;
|
2013-07-10 23:05:03 +00:00
|
|
|
}
|
2014-04-07 22:38:27 +00:00
|
|
|
|
2013-07-10 23:05:03 +00:00
|
|
|
if (zswap_debugfs_init())
|
|
|
|
pr_warn("debugfs initialization failed\n");
|
2023-04-03 12:13:17 +00:00
|
|
|
zswap_init_state = ZSWAP_INIT_SUCCEED;
|
2013-07-10 23:05:03 +00:00
|
|
|
return 0;
|
2015-09-09 22:35:19 +00:00
|
|
|
|
mm/zswap: global lru and shrinker shared by all zswap_pools
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3.
Dynamic pool creation has been supported for a long time, which maybe not
used so much in practice. But with the per-memcg lru merged, the current
structure of zswap_pool's lru and shrinker become less optimal.
In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.
1. When memory has pressure, all shrinkers of zswap_pools will try to
shrink its lru list, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work will
try to shrink its own lru, which is inefficient.
A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.
Another optimization is changing zswap_pool kref to percpu_ref, which will
be taken reference by every zswap entry. So the scalability is better.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
This patch (of 3):
Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools
in a list, only the first will be current used.
Each zswap_pool has its own lru and shrinker, which is not necessary and
has its problem:
1. When memory has pressure, all shrinker of zswap_pools will
try to shrink its own lru, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work
will try to shrink its lru list. The rationale here was to
try and empty the old pool first so that we can completely
drop it. However, since we only support exclusive loads now,
the LRU ordering should be entirely decided by the order of
stores, so the oldest entries on the LRU will naturally be
from the oldest pool.
Anyway, having a global lru and shrinker shared by all zswap_pools is
better and efficient.
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-0-200495333595@bytedance.com
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-1-200495333595@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-16 08:55:04 +00:00
|
|
|
lru_fail:
|
2024-03-05 07:53:45 +00:00
|
|
|
shrinker_free(zswap_shrinker);
|
mm/zswap: global lru and shrinker shared by all zswap_pools
Patch series "mm/zswap: optimize for dynamic zswap_pools", v3.
Dynamic pool creation has been supported for a long time, which maybe not
used so much in practice. But with the per-memcg lru merged, the current
structure of zswap_pool's lru and shrinker become less optimal.
In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.
1. When memory has pressure, all shrinkers of zswap_pools will try to
shrink its lru list, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work will
try to shrink its own lru, which is inefficient.
A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.
Another optimization is changing zswap_pool kref to percpu_ref, which will
be taken reference by every zswap entry. So the scalability is better.
Testing kernel build (32 threads) in tmpfs with memory.max=2GB. (zswap
shrinker and writeback enabled with one 50GB swapfile, on a 128 CPUs
x86-64 machine, below is the average of 5 runs)
mm-unstable zswap-global-lru
real 63.20 63.12
user 1061.75 1062.95
sys 268.74 264.44
This patch (of 3):
Dynamic zswap_pool creation may create/reuse to have multiple zswap_pools
in a list, only the first will be current used.
Each zswap_pool has its own lru and shrinker, which is not necessary and
has its problem:
1. When memory has pressure, all shrinker of zswap_pools will
try to shrink its own lru, there is no order between them.
2. When zswap limit hit, only the last zswap_pool's shrink_work
will try to shrink its lru list. The rationale here was to
try and empty the old pool first so that we can completely
drop it. However, since we only support exclusive loads now,
the LRU ordering should be entirely decided by the order of
stores, so the oldest entries on the LRU will naturally be
from the oldest pool.
Anyway, having a global lru and shrinker shared by all zswap_pools is
better and efficient.
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-0-200495333595@bytedance.com
Link: https://lkml.kernel.org/r/20240210-zswap-global-lru-v3-1-200495333595@bytedance.com
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-16 08:55:04 +00:00
|
|
|
shrinker_fail:
|
|
|
|
destroy_workqueue(shrink_wq);
|
|
|
|
shrink_wq_fail:
|
|
|
|
cpuhp_remove_multi_state(CPUHP_MM_ZSWP_POOL_PREPARE);
|
2016-11-26 23:13:40 +00:00
|
|
|
hp_fail:
|
2023-04-03 12:13:16 +00:00
|
|
|
kmem_cache_destroy(zswap_entry_cache);
|
2015-09-09 22:35:19 +00:00
|
|
|
cache_fail:
|
zswap: disable changing params if init fails
Add zswap_init_failed bool that prevents changing any of the module
params, if init_zswap() fails, and set zswap_enabled to false. Change
'enabled' param to a callback, and check zswap_init_failed before
allowing any change to 'enabled', 'zpool', or 'compressor' params.
Any driver that is built-in to the kernel will not be unloaded if its
init function returns error, and its module params remain accessible for
users to change via sysfs. Since zswap uses param callbacks, which
assume that zswap has been initialized, changing the zswap params after
a failed initialization will result in WARNING due to the param
callbacks expecting a pool to already exist. This prevents that by
immediately exiting any of the param callbacks if initialization failed.
This was reported here:
https://marc.info/?l=linux-mm&m=147004228125528&w=4
And fixes this WARNING:
[ 429.723476] WARNING: CPU: 0 PID: 5140 at mm/zswap.c:503 __zswap_pool_current+0x56/0x60
The warning is just noise, and not serious. However, when init fails,
zswap frees all its percpu dstmem pages and its kmem cache. The kmem
cache might be serious, if kmem_cache_alloc(NULL, gfp) has problems; but
the percpu dstmem pages are definitely a problem, as they're used as
temporary buffer for compressed pages before copying into place in the
zpool.
If the user does get zswap enabled after an init failure, then zswap
will likely Oops on the first page it tries to compress (or worse, start
corrupting memory).
Fixes: 90b0fc26d5db ("zswap: change zpool/compressor at runtime")
Link: http://lkml.kernel.org/r/20170124200259.16191-2-ddstreet@ieee.org
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
Reported-by: Marcin Miroslaw <marcin@mejor.pl>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03 21:13:09 +00:00
|
|
|
/* if built-in, we aren't unloaded on failure; don't allow use */
|
2023-04-03 12:13:17 +00:00
|
|
|
zswap_init_state = ZSWAP_INIT_FAILED;
|
zswap: disable changing params if init fails
Add zswap_init_failed bool that prevents changing any of the module
params, if init_zswap() fails, and set zswap_enabled to false. Change
'enabled' param to a callback, and check zswap_init_failed before
allowing any change to 'enabled', 'zpool', or 'compressor' params.
Any driver that is built-in to the kernel will not be unloaded if its
init function returns error, and its module params remain accessible for
users to change via sysfs. Since zswap uses param callbacks, which
assume that zswap has been initialized, changing the zswap params after
a failed initialization will result in WARNING due to the param
callbacks expecting a pool to already exist. This prevents that by
immediately exiting any of the param callbacks if initialization failed.
This was reported here:
https://marc.info/?l=linux-mm&m=147004228125528&w=4
And fixes this WARNING:
[ 429.723476] WARNING: CPU: 0 PID: 5140 at mm/zswap.c:503 __zswap_pool_current+0x56/0x60
The warning is just noise, and not serious. However, when init fails,
zswap frees all its percpu dstmem pages and its kmem cache. The kmem
cache might be serious, if kmem_cache_alloc(NULL, gfp) has problems; but
the percpu dstmem pages are definitely a problem, as they're used as
temporary buffer for compressed pages before copying into place in the
zpool.
If the user does get zswap enabled after an init failure, then zswap
will likely Oops on the first page it tries to compress (or worse, start
corrupting memory).
Fixes: 90b0fc26d5db ("zswap: change zpool/compressor at runtime")
Link: http://lkml.kernel.org/r/20170124200259.16191-2-ddstreet@ieee.org
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
Reported-by: Marcin Miroslaw <marcin@mejor.pl>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-03 21:13:09 +00:00
|
|
|
zswap_enabled = false;
|
2013-07-10 23:05:03 +00:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
2023-04-03 12:13:18 +00:00
|
|
|
|
|
|
|
static int __init zswap_init(void)
|
|
|
|
{
|
|
|
|
if (!zswap_enabled)
|
|
|
|
return 0;
|
|
|
|
return zswap_setup();
|
|
|
|
}
|
2013-07-10 23:05:03 +00:00
|
|
|
/* must be late so crypto has time to come up */
|
2023-04-03 12:13:18 +00:00
|
|
|
late_initcall(zswap_init);
|
2013-07-10 23:05:03 +00:00
|
|
|
|
2014-11-13 03:08:46 +00:00
|
|
|
MODULE_AUTHOR("Seth Jennings <sjennings@variantweb.net>");
|
2013-07-10 23:05:03 +00:00
|
|
|
MODULE_DESCRIPTION("Compressed cache for swap pages");
|