As GDSCs are turned on and off some associated clocks are momentarily
enabled for house keeping purposes. For this, and similar, purposes the
"shared RCGs" will park the RCG on a source clock which is known to be
available.
When the RCG is parked, a safe clock source will be selected and
committed, then the original source would be written back and upon enable
the change back to the unparked source would be committed.
But starting with SM8350 this fails, as the value in CFG is committed by
the GDSC handshake and without a ticking parent the GDSC enablement will
time out.
This becomes a concrete problem if the runtime supended state of a
device includes disabling such rcg's parent clock. As the device
attempts to power up the domain again the rcg will fail to enable and
hence the GDSC enablement will fail, preventing the device from
returning from the suspended state.
This can be seen in e.g. the display stack during probe on SM8350.
To avoid this problem, the software needs to ensure that the RCG is
configured to a active parent clock while it is disabled. This is done
by caching the CFG register content while the shared RCG is parked on
this safe source.
Writes to M, N and D registers are committed as they are requested. New
helpers for get_parent() and recalc_rate() are extracted from their
previous implementations and __clk_rcg2_configure() is modified to allow
it to operate on the cached value.
Fixes: 7ef6f11887 ("clk: qcom: Configure the RCGs to a safe source as needed")
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@kernel.org>
Link: https://lore.kernel.org/r/20220426212136.1543984-1-bjorn.andersson@linaro.org
FAT supports creation time but not change time, and there was no
corresponding timestamp for creation time in previous VFS. The original
implementation took the compromise of saving the in-memory change time
into the on-disk creation time field, but this would lead to compatibility
issues with non-linux systems.
To address this issue, this patch changes the behavior of ctime. It will
no longer be loaded and stored from the creation time on disk. Instead of
that, it'll be consistent with the in-memory mtime and share the same
on-disk field. All updates to mtime will also be applied to ctime in
memory, while all updates to ctime will be ignored.
Link: https://lkml.kernel.org/r/20220503152536.2503003-2-cccheng@synology.com
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
With gcc version 12.0.1 20220401 (Red Hat 12.0.1-0), building with
defconfig results in the following compilation error:
| CC mm/swapfile.o
| mm/swapfile.c: In function `setup_swap_info':
| mm/swapfile.c:2291:47: error: array subscript -1 is below array bounds
| of `struct plist_node[]' [-Werror=array-bounds]
| 2291 | p->avail_lists[i].prio = 1;
| | ~~~~~~~~~~~~~~^~~
| In file included from mm/swapfile.c:16:
| ./include/linux/swap.h:292:27: note: while referencing `avail_lists'
| 292 | struct plist_node avail_lists[]; /*
| | ^~~~~~~~~~~
This is due to the compiler detecting that the mask in
node_states[__state] could theoretically be zero, which would lead to
first_node() returning -1 through find_first_bit.
I believe that the warning/error is legitimate. I first tried adding a
test to check that the node mask is not emtpy, since a similar test exists
in the case where MAX_NUMNODES == 1.
However, adding the if statement causes other warnings to appear in
for_each_cpu_node_but, because it introduces a dangling else ambiguity.
And unfortunately, GCC is not smart enough to detect that the added test
makes the case where (node) == -1 impossible, so it still complains with
the same message.
This is why I settled on replacing that with a harmless, but relatively
useless (node) >= 0 test. Based on the warning for the dangling else, I
also decided to fix the case where MAX_NUMNODES == 1 by moving the
condition inside the for loop. It will still only be tested once. This
ensures that the meaning of an else following for_each_node_mask or
derivatives would not silently have a different meaning depending on the
configuration.
Link: https://lkml.kernel.org/r/20220414150855.2407137-3-dinechin@redhat.com
Signed-off-by: Christophe de Dinechin <christophe@dinechin.org>
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ben Segall <bsegall@google.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, trace point mm_page_alloc_zone_locked() doesn't show correct
information.
First, when alloc_flag has ALLOC_HARDER/ALLOC_CMA, page can be allocated
from MIGRATE_HIGHATOMIC/MIGRATE_CMA. Nevertheless, tracepoint use
requested migration type not MIGRATE_HIGHATOMIC and MIGRATE_CMA.
Second, after commit 44042b4498 ("mm/page_alloc: allow high-order pages
to be stored on the per-cpu lists") percpu-list can store high order
pages. But trace point determine whether it is a refiil of percpu-list by
comparing requested order and 0.
To handle these problems, make mm_page_alloc_zone_locked() only be called
by __rmqueue_smallest with correct migration type. With a new argument
called percpu_refill, it can show roughly whether it is a refill of
percpu-list.
Link: https://lkml.kernel.org/r/20220512025307.57924-1-vvghjk1234@gmail.com
Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Baik Song An <bsahn@etri.re.kr>
Cc: Hong Yeon Kim <kimhy@etri.re.kr>
Cc: Taeung Song <taeung@reallinux.co.kr>
Cc: <linuxgeek@linuxgeek.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended
under memory pressure if processes keep working on their vmas(e.g., fork,
mmap, munmap). It makes reclaim path stuck. In our real workload traces,
we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it
makes other processes entering direct reclaim, which were also stuck on
the lock.
This patch makes lru aging path try_lock mode like shink_page_list so the
reclaim context will keep working with next lru pages without being stuck.
if it found the rmap lock contended, it rotates the page back to head of
lru in both active/inactive lrus to make them consistent behavior, which
is basic starting point rather than adding more heristic.
Since this patch introduces a new "contended" field as out-param along
with try_lock in-param in rmap_walk_control, it's not immutable any longer
if the try_lock is set so remove const keywords on rmap related functions.
Since rmap walking is already expensive operation, I doubt the const
would help sizable benefit( And we didn't have it until 5.17).
In a heavy app workload in Android, trace shows following statistics. It
almost removes rmap lock contention from reclaim path.
Martin Liu reported:
Before:
max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function
1632 0 1631 151.542173 31672 209 page_lock_anon_vma_read
601 0 601 145.544681 28817 198 rmap_walk_file
After:
max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function
NaN NaN NaN NaN NaN 0.0 NaN
0 0 0 0.127645 1 12 rmap_walk_file
[minchan@kernel.org: add comment, per Matthew]
Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com
Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Martin Liu <liumartin@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Applications can currently escape their cgroup memory containment when
zswap is enabled. This patch adds per-cgroup tracking and limiting of
zswap backend memory to rectify this.
The existing cgroup2 memory.stat file is extended to show zswap statistics
analogous to what's in meminfo and vmstat. Furthermore, two new control
files, memory.zswap.current and memory.zswap.max, are added to allow
tuning zswap usage on a per-workload basis. This is important since not
all workloads benefit from zswap equally; some even suffer compared to
disk swap when memory contents don't compress well. The optimal size of
the zswap pool, and the threshold for writeback, also depends on the size
of the workload's warm set.
The implementation doesn't use a traditional page_counter transaction.
zswap is unconventional as a memory consumer in that we only know the
amount of memory to charge once expensive compression has occurred. If
zwap is disabled or the limit is already exceeded we obviously don't want
to compress page upon page only to reject them all. Instead, the limit is
checked against current usage, then we compress and charge. This allows
some limit overrun, but not enough to matter in practice.
[hannes@cmpxchg.org: fix for CONFIG_SLOB builds]
Link: https://lkml.kernel.org/r/YnwD14zxYjUJPc2w@cmpxchg.org
[hannes@cmpxchg.org: opt out of cgroups v1]
Link: https://lkml.kernel.org/r/Yn6it9mBYFA+/lTb@cmpxchg.org
Link: https://lkml.kernel.org/r/20220510152847.230957-7-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- CONFIG_ZRAM: Zram is a user-facing feature, whereas zsmalloc is
not. Don't make the user chase down a technical dependency like
that, just select it in automatically when zram is requested. The
CONFIG_CRYPTO dependency is redundant due to more specific deps.
- CONFIG_ZPOOL: This is not a user-facing feature. Hide the symbol and
have it selected in as needed.
- CONFIG_ZSWAP: Select CRYPTO instead of depend. Common pattern.
- Make the ZSWAP suboptions and their descriptions (compression,
allocation backend) a bit more straight-forward for the user.
Link: https://lkml.kernel.org/r/20220510152847.230957-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "zswap: accounting & cgroup control", v2.
Zswap can consume nearly a quarter of RAM in the default configuration,
yet it's neither listed in /proc/meminfo, nor is it accounted and
manageable on a per-cgroup basis.
This makes reasoning about the memory situation on a host in general
rather difficult. On shared/cgrouped hosts, the consequences are worse.
First, workloads can escape memory containment and cause resource priority
inversions: a lo-pri group can fill the global zswap pool and force a
hi-pri group out to disk. Second, not all workloads benefit from zswap
equally. Some even suffer when memory contents compress poorly, and are
better off going to disk swap directly. On a host with mixed workloads,
it's currently not possible to enable zswap for one workload but not for
the other.
This series implements the missing global accounting as well as cgroup
tracking & control for zswap backing memory:
- Patch 1 refreshes the very out-of-date meminfo documentation in
Documentation/filesystems/proc.rst.
- Patches 2-4 clean up related and adjacent options in Kconfig. Not
actual dependencies, just things I noticed during development.
- Patch 5 adds meminfo and vmstat coverage for zswap consumption and
activity.
- Patch 6 implements per-cgroup tracking & control of zswap memory.
This patch (of 6):
Add new entries. Minor corrections and cleanups.
[hannes@cmpxchg.org: fix htmldocs warnings]
Link: https://lkml.kernel.org/r/Ynve8dg4zJyhH2gW@cmpxchg.org
[hannes@cmpxchg.org: change `Unevictable' wording, per David]
Link: https://lkml.kernel.org/r/YnwFraZlVWQoCjz3@cmpxchg.org
Link: https://lkml.kernel.org/r/20220510152847.230957-1-hannes@cmpxchg.org
Link: https://lkml.kernel.org/r/20220510152847.230957-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The readonly FS THP relies on khugepaged to collapse THP for suitable
vmas. But the behavior is inconsistent for "always" mode
(https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/).
The "always" mode means THP allocation should be tried all the time and
khugepaged should try to collapse THP all the time. Of course the
allocation and collapse may fail due to other factors and conditions.
Currently file THP may not be collapsed by khugepaged even though all the
conditions are met. That does break the semantics of "always" mode.
So make sure readonly FS vmas are registered to khugepaged to fix the
break.
Register suitable vmas in common mmap path, that could cover both readonly
FS vmas and shmem vmas, so remove the khugepaged calls in shmem.c.
Still need to keep the khugepaged call in vma_merge() since vma_merge() is
called in a lot of places, for example, madvise, mprotect, etc.
Link: https://lkml.kernel.org/r/20220510203222.24246-9-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Vlastmil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Song Liu <song@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>