This syscall can be used to set a home node for the MPOL_BIND and
MPOL_PREFERRED_MANY memory policy. Users should use this syscall after
setting up a memory policy for the specified range as shown below.
mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
new_nodes->size + 1, 0);
sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
home_node, 0);
The syscall allows specifying a home node/preferred node from which
kernel will fulfill memory allocation requests first.
For address range with MPOL_BIND memory policy, if nodemask specifies
more than one node, page allocations will come from the node in the
nodemask with sufficient free memory that is closest to the home
node/preferred node.
For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
page allocation will come from the node in the nodemask with sufficient
free memory that is closest to the home node/preferred node. If there
is not enough memory in all the nodes specified in the nodemask, the
allocation will be attempted from the closest numa node to the home node
in the system.
This helps applications to hint at a memory allocation preference node
and fallback to _only_ a set of nodes if the memory is not available on
the preferred node. Fallback allocation is attempted from the node
which is nearest to the preferred node.
This helps applications to have control on memory allocation numa nodes
and avoids default fallback to slow memory NUMA nodes. For example a
system with NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of
slow memory
new_nodes = numa_bitmask_alloc(nr_nodes);
numa_bitmask_setbit(new_nodes, 1);
numa_bitmask_setbit(new_nodes, 2);
numa_bitmask_setbit(new_nodes, 3);
p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp, new_nodes->size + 1, 0);
sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);
This will allocate from nodes closer to node 2 and will make sure the
kernel will only allocate from nodes 1, 2, and 3. Memory will not be
allocated from slow memory nodes 10, 11, and 12. This differs from
default MPOL_BIND behavior in that with default MPOL_BIND the allocation
will be attempted from node closer to the local node. One of the
reasons to specify a home node is to allow allocations from cpu less
NUMA node and its nearby NUMA nodes.
With MPOL_PREFERRED_MANY on the other hand will first try to allocate
from the closest node to node 2 from the node list 1, 2 and 3. If those
nodes don't have enough memory, kernel will allocate from slow memory
node 10, 11 and 12 which ever is closer to node 2.
Link: https://lkml.kernel.org/r/20211202123810.267175-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: add new syscall set_mempolicy_home_node", v6.
This patch (of 3):
A followup patch will enable setting a home node with
MPOL_PREFERRED_MANY memory policy. To facilitate that switch to using
policy_node helper. There is no functional change in this patch.
Link: https://lkml.kernel.org/r/20211202123810.267175-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20211202123810.267175-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In unset_migratetype_isolate(), we can bypass the call to
move_freepages_block() for non-buddy pages.
It will save a few cpu cycles for some situations such as cma and
hugetlb when allocating continue pages, in these situation function
alloc_contig_pages will be called.
alloc_contig_pages
__alloc_contig_migrate_range
isolate_freepages_range ==> pages has been remove from buddy
undo_isolate_page_range
unset_migratetype_isolate ==> can directly set migratetype
[osalvador@suse.de: changelog tweak]
Link: https://lkml.kernel.org/r/20211229033649.2760586-1-chenwandun@huawei.com
Fixes: 3c605096d3 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Wang Kefeng <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drop_slab_node is only used in drop_slab. So remove it's declaration
from header file and add keyword static for it's definition.
Link: https://lkml.kernel.org/r/20211111062445.5236-1-ligang.bdlg@bytedance.com
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are interfaces to adjust max_ptes_none, max_ptes_swap,
max_ptes_shared values, see
/sys/kernel/mm/transparent_hugepage/khugepaged/.
But system administrator may not know which value is the best. So Add
those events to support adjusting max_ptes_* to suitable values.
For example, if default max_ptes_swap value causes too much failures,
and system uses zram whose IO is fast, administrator could increase
max_ptes_swap until THP_SCAN_EXCEED_SWAP_PTE not increase anymore.
Link: https://lkml.kernel.org/r/20211225094036.574157-1-yang.yang29@zte.com.cn
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Saravanan D <saravanand@fb.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For hugetlb backed jobs/VMs it's critical to understand the numa
information for the memory backing these jobs to deliver optimal
performance.
Currently this technically can be queried from /proc/self/numa_maps, but
there are significant issues with that. Namely:
1. Memory can be mapped or unmapped.
2. numa_maps are per process and need to be aggregated across all
processes in the cgroup. For shared memory this is more involved as
the userspace needs to make sure it doesn't double count shared
mappings.
3. I believe querying numa_maps needs to hold the mmap_lock which adds
to the contention on this lock.
For these reasons I propose simply adding hugetlb.*.numa_stat file,
which shows the numa information of the cgroup similarly to
memory.numa_stat.
On cgroup-v2:
cat /sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat
total=2097152 N0=2097152 N1=0
On cgroup-v1:
cat /sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat
total=2097152 N0=2097152 N1=0
hierarichal_total=2097152 N0=2097152 N1=0
This patch was tested manually by allocating hugetlb memory and querying
the hugetlb.*.numa_stat file of the cgroup and its parents.
[colin.i.king@googlemail.com: fix spelling mistake "hierarichal" -> "hierarchical"]
Link: https://lkml.kernel.org/r/20211125090635.23508-1-colin.i.king@gmail.com
[keescook@chromium.org: fix copy/paste array assignment]
Link: https://lkml.kernel.org/r/20211203065647.2819707-1-keescook@chromium.org
Link: https://lkml.kernel.org/r/20211123001020.4083653-1-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jue Wang <juew@google.com>
Cc: Yang Yao <ygyao@google.com>
Cc: Joanna Li <joannali@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In kdump kernel of x86_64, page allocation failure is observed:
kworker/u2:2: page allocation failure: order:0, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 0 PID: 55 Comm: kworker/u2:2 Not tainted 5.16.0-rc4+ #5
Hardware name: AMD Dinar/Dinar, BIOS RDN1505B 06/05/2013
Workqueue: events_unbound async_run_entry_fn
Call Trace:
<TASK>
dump_stack_lvl+0x48/0x5e
warn_alloc.cold+0x72/0xd6
__alloc_pages_slowpath.constprop.0+0xc69/0xcd0
__alloc_pages+0x1df/0x210
new_slab+0x389/0x4d0
___slab_alloc+0x58f/0x770
__slab_alloc.constprop.0+0x4a/0x80
kmem_cache_alloc_trace+0x24b/0x2c0
sr_probe+0x1db/0x620
......
device_add+0x405/0x920
......
__scsi_add_device+0xe5/0x100
ata_scsi_scan_host+0x97/0x1d0
async_run_entry_fn+0x30/0x130
process_one_work+0x1e8/0x3c0
worker_thread+0x50/0x3b0
? rescuer_thread+0x350/0x350
kthread+0x16b/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Mem-Info:
......
The above failure happened when calling kmalloc() to allocate buffer with
GFP_DMA. It requests to allocate slab page from DMA zone while no managed
pages at all in there.
sr_probe()
--> get_capabilities()
--> buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
Because in the current kernel, dma-kmalloc will be created as long as
CONFIG_ZONE_DMA is enabled. However, kdump kernel of x86_64 doesn't have
managed pages on DMA zone since commit 6f599d8423 ("x86/kdump: Always
reserve the low 1M when the crashkernel option is specified"). The
failure can be always reproduced.
For now, let's mute the warning of allocation failure if requesting pages
from DMA zone while no managed pages.
[akpm@linux-foundation.org: fix warning]
Link: https://lkml.kernel.org/r/20211223094435.248523-4-bhe@redhat.com
Fixes: 6f599d8423 ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: John Donnelly <john.p.donnelly@oracle.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Handle warning of allocation failure on DMA zone w/o
managed pages", v4.
**Problem observed:
On x86_64, when crash is triggered and entering into kdump kernel, page
allocation failure can always be seen.
---------------------------------
DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 0 PID: 1 Comm: swapper/0
Call Trace:
dump_stack+0x7f/0xa1
warn_alloc.cold+0x72/0xd6
......
__alloc_pages+0x24d/0x2c0
......
dma_atomic_pool_init+0xdb/0x176
do_one_initcall+0x67/0x320
? rcu_read_lock_sched_held+0x3f/0x80
kernel_init_freeable+0x290/0x2dc
? rest_init+0x24f/0x24f
kernel_init+0xa/0x111
ret_from_fork+0x22/0x30
Mem-Info:
------------------------------------
***Root cause:
In the current kernel, it assumes that DMA zone must have managed pages
and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not
always true. E.g in kdump kernel of x86_64, only low 1M is presented and
locked down at very early stage of boot, so that this low 1M won't be
added into buddy allocator to become managed pages of DMA zone. This
exception will always cause page allocation failure if page is requested
from DMA zone.
***Investigation:
This failure happens since below commit merged into linus's tree.
1a6a9044b9 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options
23721c8e92 x86/crash: Remove crash_reserve_low_1M()
f1d4d47c58 x86/setup: Always reserve the first 1M of RAM
7c321eb2b8 x86/kdump: Remove the backup region handling
6f599d8423 x86/kdump: Always reserve the low 1M when the crashkernel option is specified
Before them, on x86_64, the low 640K area will be reused by kdump kernel.
So in kdump kernel, the content of low 640K area is copied into a backup
region for dumping before jumping into kdump. Then except of those firmware
reserved region in [0, 640K], the left area will be added into buddy
allocator to become available managed pages of DMA zone.
However, after above commits applied, in kdump kernel of x86_64, the low
1M is reserved by memblock, but not released to buddy allocator. So any
later page allocation requested from DMA zone will fail.
At the beginning, if crashkernel is reserved, the low 1M need be locked
down because AMD SME encrypts memory making the old backup region
mechanims impossible when switching into kdump kernel.
Later, it was also observed that there are BIOSes corrupting memory
under 1M. To solve this, in commit f1d4d47c58, the entire region of
low 1M is always reserved after the real mode trampoline is allocated.
Besides, recently, Intel engineer mentioned their TDX (Trusted domain
extensions) which is under development in kernel also needs to lock down
the low 1M. So we can't simply revert above commits to fix the page allocation
failure from DMA zone as someone suggested.
***Solution:
Currently, only DMA atomic pool and dma-kmalloc will initialize and
request page allocation with GFP_DMA during bootup.
So only initializ DMA atomic pool when DMA zone has available managed
pages, otherwise just skip the initialization.
For dma-kmalloc(), for the time being, let's mute the warning of
allocation failure if requesting pages from DMA zone while no manged
pages. Meanwhile, change code to use dma_alloc_xx/dma_map_xx API to
replace kmalloc(GFP_DMA), or do not use GFP_DMA when calling kmalloc() if
not necessary. Christoph is posting patches to fix those under
drivers/scsi/. Finally, we can remove the need of dma-kmalloc() as people
suggested.
This patch (of 3):
In some places of the current kernel, it assumes that dma zone must have
managed pages if CONFIG_ZONE_DMA is enabled. While this is not always
true. E.g in kdump kernel of x86_64, only low 1M is presented and locked
down at very early stage of boot, so that there's no managed pages at all
in DMA zone. This exception will always cause page allocation failure if
page is requested from DMA zone.
Here add function has_managed_dma() and the relevant helper functions to
check if there's DMA zone with managed pages. It will be used in later
patches.
Link: https://lkml.kernel.org/r/20211223094435.248523-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20211223094435.248523-2-bhe@redhat.com
Fixes: 6f599d8423 ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: John Donnelly <john.p.donnelly@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clarify that the alloc_contig_pages() allocated range will always be
aligned to the requested nr_pages.
Link: https://lkml.kernel.org/r/1639545478-12160-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alloc_pages_vma is meant to allocate a page with a vma specific memory
policy. The initial node parameter is always a local node so it is
pointless to waste a function argument for this. Drop the parameter.
Link: https://lkml.kernel.org/r/YaSnlv4QpryEpesG@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arthur Marsh reported we would hit the error below when building kernel
with gcc-12:
CC mm/page_alloc.o
mm/page_alloc.c: In function `mem_init_print_info':
mm/page_alloc.c:8173:27: error: comparison between two arrays [-Werror=array-compare]
8173 | if (start <= pos && pos < end && size > adj) \
|
In C++20, the comparision between arrays should be warned.
Link: https://lkml.kernel.org/r/20211125130928.32465-1-sxwjean@me.com
Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
Reported-by: Arthur Marsh <arthur.marsh@internode.on.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sl?b and vmalloc allocators reduce the given gfp mask for their internal
needs. For that they use GFP_RECLAIM_MASK to preserve the reclaim
behavior and constrains.
__GFP_NOLOCKDEP is not a part of that mask because it doesn't really
control the reclaim behavior strictly speaking. On the other hand it
tells the underlying page allocator to disable reclaim recursion
detection so arguably it should be part of the mask.
Having __GFP_NOLOCKDEP in the mask will not alter the behavior in any
form so this change is safe pretty much by definition. It also adds a
support for this flag to SL?B and vmalloc allocators which will in turn
allow its use to kvmalloc as well. A lack of the support has been
noticed recently in
http://lkml.kernel.org/r/20211119225435.GZ449541@dread.disaster.area
Link: https://lkml.kernel.org/r/YZ9XtLY4AEjVuiEI@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented by
previous patches so we can allow the support for kvmalloc. This will
allow some external users to simplify or completely remove their
helpers.
GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
explicitly documented so let's add a note about that.
ceph_kvmalloc is the first helper to be dropped and changed to kvmalloc.
Link: https://lkml.kernel.org/r/20211122153233.9924-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit b7d90e7a5e ("mm/vmalloc: be more explicit about supported gfp
flags") has been merged prematurely without the rest of the series and
without addressed review feedback from Neil. Fix that up now. Only
wording is changed slightly.
Link: https://lkml.kernel.org/r/20211122153233.9924-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Chinner has mentioned that some of the xfs code would benefit from
kvmalloc support for __GFP_NOFAIL because they have allocations that
cannot fail and they do not fit into a single page.
The large part of the vmalloc implementation already complies with the
given gfp flags so there is no work for those to be done. The area and
page table allocations are an exception to that. Implement a retry loop
for those.
Add a short sleep before retrying. 1 jiffy is a completely random
timeout. Ideally the retry would wait for an explicit event - e.g. a
change to the vmalloc space change if the failure was caused by the
space fragmentation or depletion. But there are multiple different
reasons to retry and this could become much more complex. Keep the
retry simple for now and just sleep to prevent from hogging CPUs.
Link: https://lkml.kernel.org/r/20211122153233.9924-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "extend vmalloc support for constrained allocations", v2.
Based on a recent discussion with Dave and Neil [1] I have tried to
implement NOFS, NOIO, NOFAIL support for the vmalloc to make life of
kvmalloc users easier.
A requirement for NOFAIL support for kvmalloc was new to me but this
seems to be really needed by the xfs code.
NOFS/NOIO was a known and a long term problem which was hoped to be
handled by the scope API. Those scope should have been used at the
reclaim recursion boundaries both to document them and also to remove
the necessity of NOFS/NOIO constrains for all allocations within that
scope. Instead workarounds were developed to wrap a single allocation
instead (like ceph_kvmalloc).
First patch implements NOFS/NOIO support for vmalloc. The second one
adds NOFAIL support and the third one bundles all together into kvmalloc
and drops ceph_kvmalloc which can use kvmalloc directly now.
[1] http://lkml.kernel.org/r/163184741778.29351.16920832234899124642.stgit@noble.brown
This patch (of 4):
vmalloc historically hasn't supported GFP_NO{FS,IO} requests because
page table allocations do not support externally provided gfp mask and
performed GFP_KERNEL like allocations.
Since few years we have scope (memalloc_no{fs,io}_{save,restore}) APIs
to enforce NOFS and NOIO constrains implicitly to all allocators within
the scope. There was a hope that those scopes would be defined on a
higher level when the reclaim recursion boundary starts/stops (e.g.
when a lock required during the memory reclaim is required etc.). It
seems that not all NOFS/NOIO users have adopted this approach and
instead they have taken a workaround approach to wrap a single
[k]vmalloc allocation by a scope API.
These workarounds do not serve the purpose of a better reclaim recursion
documentation and reduction of explicit GFP_NO{FS,IO} usege so let's
just provide them with the semantic they are asking for without a need
for workarounds.
Add support for GFP_NOFS and GFP_NOIO to vmalloc directly. All internal
allocations already comply with the given gfp_mask. The only current
exception is vmap_pages_range which maps kernel page tables. Infer the
proper scope API based on the given gfp mask.
[sfr@canb.auug.org.au: mm/vmalloc.c needs linux/sched/mm.h]
Link: https://lkml.kernel.org/r/20211217232641.0148710c@canb.auug.org.au
Link: https://lkml.kernel.org/r/20211122153233.9924-1-mhocko@kernel.org
Link: https://lkml.kernel.org/r/20211122153233.9924-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Neil Brown <neilb@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 2618c60b8b ("dma: make dma pool to use
kmalloc_node").
While working myself into the dmapool code I've found this little odd
kmalloc_node().
What basically happens here is that we allocate the housekeeping
structure on the numa node where the device is attached to. Since the
device is never doing DMA to or from that memory this doesn't seem to
make sense at all.
So while this doesn't seem to cause much harm it's probably cleaner to
revert the change for consistency.
Link: https://lkml.kernel.org/r/20211221110724.97664-1-christian.koenig@amd.com
Signed-off-by: Christian König <christian.koenig@amd.com>
Cc: Yinghai Lu <yinghai.lu@sun.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All callers pass NULL, so we can stop calculating the value we would
store in it.
Link: https://lkml.kernel.org/r/20211220205943.456187-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we don't report it to the caller of reuse_swap_page(), we don't
need to request it from page_trans_huge_map_swapcount().
Link: https://lkml.kernel.org/r/20211220205943.456187-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
None of the callers care about the total_map_swapcount() any more.
Link: https://lkml.kernel.org/r/20211220205943.456187-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Check user page table entries at the time they are added and removed.
Allows to synchronously catch memory corruption issues related to double
mapping.
When a pte for an anonymous page is added into page table, we verify
that this pte does not already point to a file backed page, and vice
versa if this is a file backed page that is being added we verify that
this page does not have an anonymous mapping
We also enforce that read-only sharing for anonymous pages is allowed
(i.e. cow after fork). All other sharing must be for file pages.
Page table check allows to protect and debug cases where "struct page"
metadata became corrupted for some reason. For example, when refcnt or
mapcount become invalid.
Link: https://lkml.kernel.org/r/20211221154650.1047963-4-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Xu <weixugc@google.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have ptep_get_and_clear() and ptep_get_and_clear_full() helpers to
clear PTE from user page tables, but there is no variant for simple
clear of a present PTE from user page tables without using a low level
pte_clear() which can be either native or para-virtualised.
Add a new ptep_clear() that can be used in common code to clear PTEs
from page table. We will need this call later in order to add a hook
for page table check.
Link: https://lkml.kernel.org/r/20211221154650.1047963-3-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Xu <weixugc@google.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "page table check", v3.
Ensure that some memory corruptions are prevented by checking at the
time of insertion of entries into user page tables that there is no
illegal sharing.
We have recently found a problem [1] that existed in kernel since 4.14.
The problem was caused by broken page ref count and led to memory
leaking from one process into another. The problem was accidentally
detected by studying a dump of one process and noticing that one page
contains memory that should not belong to this process.
There are some other page->_refcount related problems that were recently
fixed: [2], [3] which potentially could also lead to illegal sharing.
In addition to hardening refcount [4] itself, this work is an attempt to
prevent this class of memory corruption issues.
It uses a simple state machine that is independent from regular MM logic
to check for illegal sharing at time pages are inserted and removed from
page tables.
[1] https://lore.kernel.org/all/xr9335nxwc5y.fsf@gthelen2.svl.corp.google.com
[2] https://lore.kernel.org/all/1582661774-30925-2-git-send-email-akaher@vmware.com
[3] https://lore.kernel.org/all/20210622021423.154662-3-mike.kravetz@oracle.com
[4] https://lore.kernel.org/all/20211221150140.988298-1-pasha.tatashin@soleen.com
This patch (of 4):
There are a few places where we first update the entry in the user page
table, and later change the struct page to indicate that this is
anonymous or file page.
In most places, however, we first configure the page metadata and then
insert entries into the page table. Page table check, will use the
information from struct page to verify the type of entry is inserted.
Change the order in all places to first update struct page, and later to
update page table.
This means that we first do calls that may change the type of page (anon
or file):
page_move_anon_rmap
page_add_anon_rmap
do_page_add_anon_rmap
page_add_new_anon_rmap
page_add_file_rmap
hugepage_add_anon_rmap
hugepage_add_new_anon_rmap
And after that do calls that add entries to the page table:
set_huge_pte_at
set_pte_at
Link: https://lkml.kernel.org/r/20211221154650.1047963-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20211221154650.1047963-2-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Turner <pjt@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Will Deacon <will@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With exit_mmap holding mmap_write_lock during free_pgtables call,
process_mrelease does not need to elevate mm->mm_users in order to
prevent exit_mmap from destrying pagetables while __oom_reap_task_mm is
walking the VMA tree. The change prevents process_mrelease from calling
the last mmput, which can lead to waiting for IO completion in exit_aio.
Link: https://lkml.kernel.org/r/20211209191325.3069345-3-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christian Brauner <christian@brauner.io>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jan Engelhardt <jengelh@inai.de>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
oom-reaper and process_mrelease system call should protect against races
with exit_mmap which can destroy page tables while they walk the VMA
tree. oom-reaper protects from that race by setting MMF_OOM_VICTIM and
by relying on exit_mmap to set MMF_OOM_SKIP before taking and releasing
mmap_write_lock. process_mrelease has to elevate mm->mm_users to
prevent such race.
Both oom-reaper and process_mrelease hold mmap_read_lock when walking
the VMA tree. The locking rules and mechanisms could be simpler if
exit_mmap takes mmap_write_lock while executing destructive operations
such as free_pgtables.
Change exit_mmap to hold the mmap_write_lock when calling unlock_range,
free_pgtables and remove_vma. Note also that because oom-reaper checks
VM_LOCKED flag, unlock_range() should not be allowed to race with it.
Before this patch, remove_vma used to be called with no locks held,
however with fput being executed asynchronously and vm_ops->close not
being allowed to hold mmap_lock (it is called from __split_vma with
mmap_sem held for write), changing that should be fine.
In most cases this lock should be uncontended. Previously, Kirill
reported ~4% regression caused by a similar change [1]. We reran the
same test and although the individual results are quite noisy, the
percentiles show lower regression with 1.6% being the worst case [2].
The change allows oom-reaper and process_mrelease to execute safely
under mmap_read_lock without worries that exit_mmap might destroy page
tables from under them.
[1] https://lore.kernel.org/all/20170725141723.ivukwhddk2voyhuc@node.shutemov.name/
[2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20211209191325.3069345-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jan Engelhardt <jengelh@inai.de>
Cc: Tim Murray <timmurray@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
linux/mm_types.h should only define structure definitions, to make it
cheap to include elsewhere. The atomic_t helper function definitions
are particularly large, so it's better to move the helpers using those
into the existing linux/mm_inline.h and only include that where needed.
As a follow-up, we may want to go through all the indirect includes in
mm_types.h and reduce them as much as possible.
Link: https://lkml.kernel.org/r/20211207125710.2503446-2-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Colin Cross <ccross@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch to add anonymous vma names causes a build failure in some
configurations:
include/linux/mm_types.h: In function 'is_same_vma_anon_name':
include/linux/mm_types.h:924:37: error: implicit declaration of function 'strcmp' [-Werror=implicit-function-declaration]
924 | return name && vma_name && !strcmp(name, vma_name);
| ^~~~~~
include/linux/mm_types.h:22:1: note: 'strcmp' is defined in header '<string.h>'; did you forget to '#include <string.h>'?
This should not really be part of linux/mm_types.h in the first place,
as that header is meant to only contain structure defintions and need a
minimum set of indirect includes itself.
While the header clearly includes more than it should at this point,
let's not make it worse by including string.h as well, which would pull
in the expensive (compile-speed wise) fortify-string logic.
Move the new functions into a separate header that only needs to be
included in a couple of locations.
Link: https://lkml.kernel.org/r/20211207125710.2503446-1-arnd@kernel.org
Fixes: "mm: add a field to store names for private anonymous memory"
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Colin Cross <ccross@google.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While forking a process with high number (64K) of named anonymous vmas
the overhead caused by strdup() is noticeable. Experiments with ARM64
Android device show up to 40% performance regression when forking a
process with 64k unpopulated anonymous vmas using the max name lengths
vs the same process with the same number of anonymous vmas having no
name.
Introduce anon_vma_name refcounted structure to avoid the overhead of
copying vma names during fork() and when splitting named anonymous vmas.
When a vma is duplicated, instead of copying the name we increment the
refcount of this structure. Multiple vmas can point to the same
anon_vma_name as long as they increment the refcount. The name member
of anon_vma_name structure is assigned at structure allocation time and
is never changed. If vma name changes then the refcount of the original
structure is dropped, a new anon_vma_name structure is allocated to hold
the new name and the vma pointer is updated to point to the new
structure.
With this approach the fork() performance regressions is reduced 3-4x
times and with usecases using more reasonable number of VMAs (a few
thousand) the regressions is not measurable.
Link: https://lkml.kernel.org/r/20211019215511.3771969-3-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Colin Cross <ccross@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Landley <rob@landley.net>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Shaohua Li <shli@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in
use. At a minimum there is libc malloc and the stack, and in many cases
there are libc malloc, the stack, direct syscalls to mmap anonymous
memory, and multiple VM heaps (one for small objects, one for big
objects, etc.). Each of these layers usually has its own tools to
inspect its usage; malloc by compiling a debug version, the VM through
heap inspection tools, and for direct syscalls there is usually no way
to track them.
On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages
mapped in userspace and slice their usage by process, shared (COW) vs.
unique mappings, backing, etc. This can account for real physical
memory usage even in cases like fork without exec (which Android uses
heavily to share as many private COW pages as possible between
processes), Kernel SamePage Merging, and clean zero pages. It produces
a measurement of the pages that only exist in that process (USS, for
unique), and a measurement of the physical memory usage of that process
with the cost of shared pages being evenly split between processes that
share them (PSS).
If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap
walking tool that can understand the heap debugging of every layer, or
for every layer's heap debugging tools to implement the pagemap walking
logic, in which case it is hard to get a consistent view of memory
across the whole system.
Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that somebody
needs to clean up on crashes. It needs to be readable while the process
is still running, so it has to have some sort of synchronization with
every layer of userspace. Efficiently tracking the ranges requires
reimplementing something like the kernel vma trees, and linking to it
from every layer of userspace. It requires more memory, more syscalls,
more runtime cost, and more complexity to separately track regions that
the kernel is already tracking.
This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas. The names of named
anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
[anon:<name>].
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
Setting the name to NULL clears it. The name length limit is 80 bytes
including NUL-terminator and is checked to contain only printable ascii
characters (including space), except '[',']','\','$' and '`'.
Ascii strings are being used to have a descriptive identifiers for vmas,
which can be understood by the users reading /proc/pid/maps or
/proc/pid/smaps. Names can be standardized for a given system and they
can include some variable parts such as the name of the allocator or a
library, tid of the thread using it, etc.
The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string. Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name. The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.
CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
feature. It keeps the feature disabled by default to prevent any
additional memory overhead and to avoid confusing procfs parsers on
systems which are not ready to support named anonymous vmas.
The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names. In that design, name
pointers could be shared between vmas. However during the last
upstreaming attempt, Kees Cook raised concerns [2] about this approach
and suggested to copy the name into kernel memory space, perform
validity checks [3] and store as a string referenced from
vm_area_struct.
One big concern is about fork() performance which would need to strdup
anonymous vma names. Dave Hansen suggested experimenting with
worst-case scenario of forking a process with 64k vmas having longest
possible names [4]. I ran this experiment on an ARM64 Android device
and recorded a worst-case regression of almost 40% when forking such a
process.
This regression is addressed in the followup patch which replaces the
pointer to a name with a refcounted structure that allows sharing the
name pointer between vmas of the same name. Instead of duplicating the
string during fork() or when splitting a vma it increments the refcount.
[1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
Changes for prctl(2) manual page (in the options section):
PR_SET_VMA
Sets an attribute specified in arg2 for virtual memory areas
starting from the address specified in arg3 and spanning the
size specified in arg4. arg5 specifies the value of the attribute
to be set. Note that assigning an attribute to a virtual memory
area might prevent it from being merged with adjacent virtual
memory areas due to the difference in that attribute's value.
Currently, arg2 must be one of:
PR_SET_VMA_ANON_NAME
Set a name for anonymous virtual memory areas. arg5 should
be a pointer to a null-terminated string containing the
name. The name length including null byte cannot exceed
80 bytes. If arg5 is NULL, the name of the appropriate
anonymous virtual memory areas will be reset. The name
can contain only printable ascii characters (including
space), except '[',']','\','$' and '`'.
This feature is available only if the kernel is built with
the CONFIG_ANON_VMA_NAME option enabled.
[surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
[surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
work here was done by Colin Cross, therefore, with his permission, keeping
him as the author]
Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com
Signed-off-by: Colin Cross <ccross@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Landley <rob@landley.net>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Shaohua Li <shli@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: rearrange madvise code to allow for reuse", v11.
Avoid performance regression of the new anon vma name field refcounting it.
I checked the image sizes with allnoconfig builds:
unpatched Linus' ToT
text data bss dec hex filename
1324759 32 73928 1398719 1557bf vmlinux
After the first patch is applied (madvise refactoring)
text data bss dec hex filename
1322346 32 73928 1396306 154e52 vmlinux
>>> 2413 bytes decrease vs ToT <<<
After all patches applied with CONFIG_ANON_VMA_NAME=n
text data bss dec hex filename
1322337 32 73928 1396297 154e49 vmlinux
>>> 2422 bytes decrease vs ToT <<<
After all patches applied with CONFIG_ANON_VMA_NAME=y
text data bss dec hex filename
1325228 32 73928 1399188 155994 vmlinux
>>> 469 bytes increase vs ToT <<<
This patch (of 3):
Refactor the madvise syscall to allow for parts of it to be reused by a
prctl syscall that affects vmas.
Move the code that walks vmas in a virtual address range into a function
that takes a function pointer as a parameter. The only caller for now
is sys_madvise, which uses it to call madvise_vma_behavior on each vma,
but the next patch will add an additional caller.
Move handling all vma behaviors inside madvise_behavior, and rename it
to madvise_vma_behavior.
Move the code that updates the flags on a vma, including splitting or
merging the vma as necessary, into a new function called
madvise_update_vma. The next patch will add support for updating a new
anon_name field as well.
Link: https://lkml.kernel.org/r/20211019215511.3771969-1-surenb@google.com
Signed-off-by: Colin Cross <ccross@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Rob Landley <rob@landley.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kvmalloc* allocation functions can fallback to vmalloc allocations
and more often on long running machines. In addition the kernel does
have __GFP_ACCOUNT kvmalloc* calls. So, often on long running machines,
the memory.stat does not tell the complete picture which type of memory
is charged to the memcg. So add a per-memcg vmalloc stat.
[shakeelb@google.com: page_memcg() within rcu lock, per Muchun]
Link: https://lkml.kernel.org/r/20211222052457.1960701-1-shakeelb@google.com
[akpm@linux-foundation.org: remove cast, per Muchun]
[shakeelb@google.com: remove area->page[0] checks and move to page by page accounting per Michal]
Link: https://lkml.kernel.org/r/20220104222341.3972772-1-shakeelb@google.com
Link: https://lkml.kernel.org/r/20211221215336.1922823-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make use of the struct_size() helper instead of an open-coded version,
in order to avoid any potential type mistakes or integer overflows that,
in the worst scenario, could lead to heap overflows.
Link: https://github.com/KSPP/linux/issues/160
Link: https://lkml.kernel.org/r/20211216022024.127375-1-wangweiyang2@huawei.com
Signed-off-by: Wang Weiyang <wangweiyang2@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 11192d9c12 ("memcg: flush stats only if updated") added
tracking of memcg stats updates which is used by the readers to flush
only if the updates are over a certain threshold. However each
individual update can correspond to a large value change for a given
stat. For example adding or removing a hugepage to an LRU changes the
stat by thp_nr_pages (512 on x86_64).
Treating the update related to THP as one can keep the stat off, in
theory, by (thp_nr_pages * nr_cpus * CHARGE_BATCH) before flush.
To handle such scenarios, this patch adds consideration of the stat
update value as well instead of just the update event. In addition let
the asyn flusher unconditionally flush the stats to put time limit on
the stats skew and hopefully a lot less readers would need to flush.
Link: https://lkml.kernel.org/r/20211118065350.697046-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Our container agent wants to know when a container exits if it was OOM
killed or not to report to the user. We use memory.oom.group = 1 to
ensure that OOM kills within the container's cgroup kill everything.
Existing memory.events are insufficient for knowing if this triggered:
1) Our current approach reads memory.events oom_kill and reports the
container was killed if the value is non-zero. This is erroneous in
some cases where containers create their children cgroups with
memory.oom.group=1 as such OOM kills will get counted against the
container cgroup's oom_kill counter despite not actually OOM killing
the entire container.
2) Reading memory.events.local will fail to identify OOM kills in leaf
cgroups (that don't set memory.oom.group) within the container
cgroup.
This patch adds a new oom_group_kill event when memory.oom.group
triggers to allow userspace to cleanly identify when an entire cgroup is
oom killed.
[schatzberg.dan@gmail.com: changes from Johannes and Chris]
Link: https://lkml.kernel.org/r/20211213162511.2492267-1-schatzberg.dan@gmail.com
Link: https://lkml.kernel.org/r/20211203162426.3375036-1-schatzberg.dan@gmail.com
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
propagate_protected_usage() is called to propagate the usage change in
the page_counter structure. But there is a call to this function from
page_counter_try_charge() when there is actually no usage change. Hence
this call should be removed.
Link: https://lkml.kernel.org/r/20211118181125.3918222-1-dqiao@redhat.com
Signed-off-by: Donghai Qiao <dqiao@redhat.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 494c1dfe85 ("mm: memcg/slab: create a new set of kmalloc-cg-<n>
caches") makes cgroup_memory_nokmem global, however, it is unnecessary
because there is already a function mem_cgroup_kmem_disabled() which
exports it.
Just make it static and replace it with mem_cgroup_kmem_disabled() in
mm/slab_common.c.
Link: https://lkml.kernel.org/r/20211109065418.21693-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The 'a' and 'b' bitmaps are local to this function, so no concurrent
access can occur. So the non-atomic '__set_bit()' can be used to save a
few cycles.
Link: https://lkml.kernel.org/r/e52476da5cee57151745c5c3c934a69798dc6fa4.1638132190.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a data race in commit 779750d20b ("shmem: split huge pages beyond
i_size under memory pressure").
Here are call traces causing race:
Call Trace 1:
shmem_unused_huge_shrink+0x3ae/0x410
? __list_lru_walk_one.isra.5+0x33/0x160
super_cache_scan+0x17c/0x190
shrink_slab.part.55+0x1ef/0x3f0
shrink_node+0x10e/0x330
kswapd+0x380/0x740
kthread+0xfc/0x130
? mem_cgroup_shrink_node+0x170/0x170
? kthread_create_on_node+0x70/0x70
ret_from_fork+0x1f/0x30
Call Trace 2:
shmem_evict_inode+0xd8/0x190
evict+0xbe/0x1c0
do_unlinkat+0x137/0x330
do_syscall_64+0x76/0x120
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
A simple explanation:
Image there are 3 items in the local list (@list). In the first
traversal, A is not deleted from @list.
1) A->B->C
^
|
pos (leave)
In the second traversal, B is deleted from @list. Concurrently, A is
deleted from @list through shmem_evict_inode() since last reference
counter of inode is dropped by other thread. Then the @list is corrupted.
2) A->B->C
^ ^
| |
evict pos (drop)
We should make sure the inode is either on the global list or deleted from
any local list before iput().
Fixed by moving inodes back to global list before we put them.
[akpm@linux-foundation.org: coding style fixes]
Link: https://lkml.kernel.org/r/20211125064502.99983-1-ligang.bdlg@bytedance.com
Fixes: 779750d20b ("shmem: split huge pages beyond i_size under memory pressure")
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current behavior of memory failure is to truncate the page cache
regardless of dirty or clean. If the page is dirty the later access
will get the obsolete data from disk without any notification to the
users. This may cause silent data loss. It is even worse for shmem
since shmem is in-memory filesystem, truncating page cache means
discarding data blocks. The later read would return all zero.
The right approach is to keep the corrupted page in page cache, any
later access would return error for syscalls or SIGBUS for page fault,
until the file is truncated, hole punched or removed. The regular
storage backed filesystems would be more complicated so this patch is
focused on shmem. This also unblock the support for soft offlining
shmem THP.
[akpm@linux-foundation.org: coding style fixes]
[arnd@arndb.de: fix uninitialized variable use in me_pagecache_clean()]
Link: https://lkml.kernel.org/r/20211022064748.4173718-1-arnd@kernel.org
[Fix invalid pointer dereference in shmem_read_mapping_page_gfp() with a
slight different implementation from what Ajay Garg <ajaygargnsit@gmail.com>
and Muchun Song <songmuchun@bytedance.com> proposed and reworked the
error handling of shmem_write_begin() suggested by Linus]
Link: https://lore.kernel.org/linux-mm/20211111084617.6746-1-ajaygargnsit@gmail.com/
Link: https://lkml.kernel.org/r/20211020210755.23964-6-shy828301@gmail.com
Link: https://lkml.kernel.org/r/20211116193247.21102-1-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ajay Garg <ajaygargnsit@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Andy Lavr <andy.lavr@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When BUG_ON check for THP migration entry, the existing code only check
thp_migration_supported case, but not for !thp_migration_supported case.
If !thp_migration_supported() and !pmd_present(), the original code may
dead loop in theory. To make the BUG_ON check consistent, we need catch
both cases.
Move the BUG_ON check one step earlier, because if the bug happen we
should know it instead of depend on FOLL_MIGRATION been used by caller.
Because pmdval instead of *pmd is read by the is_pmd_migration_entry()
check, the existing code don't help to avoid useless locking within
pmd_migration_entry_wait(), so remove that check.
Link: https://lkml.kernel.org/r/20211217062559.737063-1-lixinhai.lxh@gmail.com
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fault_in_readable() and fault_in_writeable() perform __get_user() and
__put_user() in a loop, implying multiple user access locking/unlocking.
To avoid that, use user access blocks.
Link: https://lkml.kernel.org/r/720dcf79314acca1a78fae56d478cc851952149d.1637084492.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Return value directly instead of taking this in another redundant
variable.
Link: https://lkml.kernel.org/r/20211207083222.401594-1-chi.minghao@zte.com.cn
Signed-off-by: chiminghao <chi.minghao@zte.com.cn>
Reported-by: Zeal Robot <zealci@zte.com.cm>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dump_mapping() is a big chunk of dump_page(), and it'd be handy to be
able to call it when we don't have a struct page. Split it out and move
it to fs/inode.c. Take the opportunity to simplify some of the debug
messages a little.
Link: https://lkml.kernel.org/r/20211121121056.2870061-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KASAN's quarantine might save its metadata inside freed objects. As
this happens after the memory is zeroed by the slab allocator when
init_on_free is enabled, the memory coming out of quarantine is not
properly zeroed.
This causes lib/test_meminit.c tests to fail with Generic KASAN.
Zero the metadata when the object is removed from quarantine.
Link: https://lkml.kernel.org/r/2805da5df4b57138fdacd671f5d227d58950ba54.1640037083.git.andreyknvl@google.com
Fixes: 6471384af2 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Because mm/slab_common.c is not instrumented with software KASAN modes,
it is not possible to detect use-after-free of the kmem_cache passed
into kmem_cache_destroy(). In particular, because of the s->refcount--
and subsequent early return if non-zero, KASAN would never be able to
see the double-free via kmem_cache_free(kmem_cache, s). To be able to
detect a double-kmem_cache_destroy(), check accessibility of the
kmem_cache, and in case of failure return early.
While KASAN_HW_TAGS is able to detect such bugs, by checking
accessibility and returning early we fail more gracefully and also avoid
corrupting reused objects (where tags mismatch).
A recent case of a double-kmem_cache_destroy() was detected by KFENCE:
https://lkml.kernel.org/r/0000000000003f654905c168b09d@google.com, which
was not detectable by software KASAN modes.
Link: https://lkml.kernel.org/r/20211119142219.1519617-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new @vmemmap_shift property for struct dev_pagemap which specifies
that a devmap is composed of a set of compound pages of order
@vmemmap_shift, instead of base pages. When a compound page devmap is
requested, all but the first page are initialised as tail pages instead
of order-0 pages.
For certain ZONE_DEVICE users like device-dax which have a fixed page
size, this creates an opportunity to optimize GUP and GUP-fast walkers,
treating it the same way as THP or hugetlb pages.
Additionally, commit 7118fc2906 ("hugetlb: address ref count racing in
prep_compound_gigantic_page") removed set_page_count() because the
setting of page ref count to zero was redundant. devmap pages don't
come from page allocator though and only head page refcount is used for
compound pages, hence initialize tail page count to zero.
Link: https://lkml.kernel.org/r/20211202204422.26777-5-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move struct page init to an helper function __init_zone_device_page().
This is in preparation for sharing the storage for compound page
metadata.
Link: https://lkml.kernel.org/r/20211202204422.26777-4-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm, device-dax: Introduce compound pages in devmap", v7.
This series converts device-dax to use compound pages, and moves away
from the 'struct page per basepage on PMD/PUD' that is done today.
Doing so
1) unlocks a few noticeable improvements on unpin_user_pages() and
makes device-dax+altmap case 4x times faster in pinning (numbers
below and in last patch)
2) as mentioned in various other threads it's one important step
towards cleaning up ZONE_DEVICE refcounting.
I've split the compound pages on devmap part from the rest based on
recent discussions on devmap pending and future work planned[5][6].
There is consensus that device-dax should be using compound pages to
represent its PMD/PUDs just like HugeTLB and THP, and that leads to less
specialization of the dax parts. I will pursue the rest of the work in
parallel once this part is merged, particular the GUP-{slow,fast}
improvements [7] and the tail struct page deduplication memory savings
part[8].
To summarize what the series does:
Patch 1: Prepare hwpoisoning to work with dax compound pages.
Patches 2-3: Split the current utility function of prep_compound_page()
into head and tail and use those two helpers where appropriate to take
advantage of caches being warm after __init_single_page(). This is used
when initializing zone device when we bring up device-dax namespaces.
Patches 4-10: Add devmap support for compound pages in device-dax.
memmap_init_zone_device() initialize its metadata as compound pages, and
it introduces a new devmap property known as vmemmap_shift which
outlines how the vmemmap is structured (defaults to base pages as done
today). The property describe the page order of the metadata
essentially. While at it do a few cleanups in device-dax in patches
5-9. Finally enable device-dax usage of devmap @vmemmap_shift to a
value based on its own @align property. @vmemmap_shift returns 0 by
default (which is today's case of base pages in devmap, like fsdax or
the others) and the usage of compound devmap is optional. Starting with
device-dax (*not* fsdax) we enable it by default. There are a few
pinning improvements particular on the unpinning case and altmap, as
well as unpin_user_page_range_dirty_lock() being just as effective as
THP/hugetlb[0] pages.
$ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
(pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
[altmap]
(pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms
$ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
(pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
[altmap with -m 127004]
(pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms
Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with
and without altmap), alongside gup_test selftests with dynamic dax
regions and static dax regions. Coupled with ndctl unit tests for
dynamic dax devices that exercise all of this. Note, for dynamic dax
regions I had to revert commit 8aa83e6395 ("x86/setup: Call
early_reserve_memory() earlier"), it is a known issue that this commit
broke efi_fake_mem=.
This patch (of 11):
Split the utility function prep_compound_page() into head and tail
counterparts, and use them accordingly.
This is in preparation for sharing the storage for compound page
metadata.
Link: https://lkml.kernel.org/r/20211202204422.26777-1-joao.m.martins@oracle.com
Link: https://lkml.kernel.org/r/20211202204422.26777-3-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yongqiang reports a kmemleak panic when module insmod/rmmod with KASAN
enabled(without KASAN_VMALLOC) on x86[1].
When the module area allocates memory, it's kmemleak_object is created
successfully, but the KASAN shadow memory of module allocation is not
ready, so when kmemleak scan the module's pointer, it will panic due to
no shadow memory with KASAN check.
module_alloc
__vmalloc_node_range
kmemleak_vmalloc
kmemleak_scan
update_checksum
kasan_module_alloc
kmemleak_ignore
Note, there is no problem if KASAN_VMALLOC enabled, the modules area
entire shadow memory is preallocated. Thus, the bug only exits on ARCH
which supports dynamic allocation of module area per module load, for
now, only x86/arm64/s390 are involved.
Add a VM_DEFER_KMEMLEAK flags, defer vmalloc'ed object register of
kmemleak in module_alloc() to fix this issue.
[1] https://lore.kernel.org/all/6d41e2b9-4692-5ec4-b1cd-cbe29ae89739@huawei.com/
[wangkefeng.wang@huawei.com: fix build]
Link: https://lkml.kernel.org/r/20211125080307.27225-1-wangkefeng.wang@huawei.com
[akpm@linux-foundation.org: simplify ifdefs, per Andrey]
Link: https://lkml.kernel.org/r/CA+fCnZcnwJHUQq34VuRxpdoY6_XbJCDJ-jopksS5Eia4PijPzw@mail.gmail.com
Link: https://lkml.kernel.org/r/20211124142034.192078-1-wangkefeng.wang@huawei.com
Fixes: 793213a82d ("s390/kasan: dynamic shadow mem allocation for modules")
Fixes: 39d114ddc6 ("arm64: add KASAN support")
Fixes: bebf56a1b1 ("kasan: enable instrumentation of global variables")
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reported-by: Yongqiang Liu <liuyongqiang13@huawei.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With HW tag-based kasan enable, We will get the warning when we free
object whose address starts with 0xFF.
It is because kmemleak rbtree stores tagged object and this freeing
object's tag does not match with rbtree object.
In the example below, kmemleak rbtree stores the tagged object in the
kmalloc(), and kfree() gets the pointer with 0xFF tag.
Call sequence:
ptr = kmalloc(size, GFP_KERNEL);
page = virt_to_page(ptr);
offset = offset_in_page(ptr);
kfree(page_address(page) + offset);
ptr = kmalloc(size, GFP_KERNEL);
A sequence like that may cause the warning as following:
1) Freeing unknown object:
In kfree(), we will get free unknown object warning in
kmemleak_free(). Because object(0xFx) in kmemleak rbtree and
pointer(0xFF) in kfree() have different tag.
2) Overlap existing:
When we allocate that object with the same hw-tag again, we will
find the overlap in the kmemleak rbtree and kmemleak thread will be
killed.
kmemleak: Freeing unknown object at 0xffff000003f88000
CPU: 5 PID: 177 Comm: cat Not tainted 5.16.0-rc1-dirty #21
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x1ac
show_stack+0x1c/0x30
dump_stack_lvl+0x68/0x84
dump_stack+0x1c/0x38
kmemleak_free+0x6c/0x70
slab_free_freelist_hook+0x104/0x200
kmem_cache_free+0xa8/0x3d4
test_version_show+0x270/0x3a0
module_attr_show+0x28/0x40
sysfs_kf_seq_show+0xb0/0x130
kernfs_seq_show+0x30/0x40
seq_read_iter+0x1bc/0x4b0
seq_read_iter+0x1bc/0x4b0
kernfs_fop_read_iter+0x144/0x1c0
generic_file_splice_read+0xd0/0x184
do_splice_to+0x90/0xe0
splice_direct_to_actor+0xb8/0x250
do_splice_direct+0x88/0xd4
do_sendfile+0x2b0/0x344
__arm64_sys_sendfile64+0x164/0x16c
invoke_syscall+0x48/0x114
el0_svc_common.constprop.0+0x44/0xec
do_el0_svc+0x74/0x90
el0_svc+0x20/0x80
el0t_64_sync_handler+0x1a8/0x1b0
el0t_64_sync+0x1ac/0x1b0
...
kmemleak: Cannot insert 0xf2ff000003f88000 into the object search tree (overlaps existing)
CPU: 5 PID: 178 Comm: cat Not tainted 5.16.0-rc1-dirty #21
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x1ac
show_stack+0x1c/0x30
dump_stack_lvl+0x68/0x84
dump_stack+0x1c/0x38
create_object.isra.0+0x2d8/0x2fc
kmemleak_alloc+0x34/0x40
kmem_cache_alloc+0x23c/0x2f0
test_version_show+0x1fc/0x3a0
module_attr_show+0x28/0x40
sysfs_kf_seq_show+0xb0/0x130
kernfs_seq_show+0x30/0x40
seq_read_iter+0x1bc/0x4b0
kernfs_fop_read_iter+0x144/0x1c0
generic_file_splice_read+0xd0/0x184
do_splice_to+0x90/0xe0
splice_direct_to_actor+0xb8/0x250
do_splice_direct+0x88/0xd4
do_sendfile+0x2b0/0x344
__arm64_sys_sendfile64+0x164/0x16c
invoke_syscall+0x48/0x114
el0_svc_common.constprop.0+0x44/0xec
do_el0_svc+0x74/0x90
el0_svc+0x20/0x80
el0t_64_sync_handler+0x1a8/0x1b0
el0t_64_sync+0x1ac/0x1b0
kmemleak: Kernel memory leak detector disabled
kmemleak: Object 0xf2ff000003f88000 (size 128):
kmemleak: comm "cat", pid 177, jiffies 4294921177
kmemleak: min_count = 1
kmemleak: count = 0
kmemleak: flags = 0x1
kmemleak: checksum = 0
kmemleak: backtrace:
kmem_cache_alloc+0x23c/0x2f0
test_version_show+0x1fc/0x3a0
module_attr_show+0x28/0x40
sysfs_kf_seq_show+0xb0/0x130
kernfs_seq_show+0x30/0x40
seq_read_iter+0x1bc/0x4b0
kernfs_fop_read_iter+0x144/0x1c0
generic_file_splice_read+0xd0/0x184
do_splice_to+0x90/0xe0
splice_direct_to_actor+0xb8/0x250
do_splice_direct+0x88/0xd4
do_sendfile+0x2b0/0x344
__arm64_sys_sendfile64+0x164/0x16c
invoke_syscall+0x48/0x114
el0_svc_common.constprop.0+0x44/0xec
do_el0_svc+0x74/0x90
kmemleak: Automatic memory scanning thread ended
[akpm@linux-foundation.org: whitespace tweak]
Link: https://lkml.kernel.org/r/20211118054426.4123-1-Kuan-Ying.Lee@mediatek.com
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Doug Berger <opendmb@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no external users of slab_start/next/stop(), so make them
static. And the memory.kmem.slabinfo is deprecated, which outputs
nothing now, so move memcg_slab_show() into mm/memcontrol.c and rename
it to mem_cgroup_slab_show to be consistent with other function names.
Link: https://lkml.kernel.org/r/20211109133359.32881-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Calling kmem_cache_destroy() while the cache still has objects allocated
is a kernel bug, and will usually result in the entire cache being
leaked. While the message in kmem_cache_destroy() resembles a warning,
it is currently not implemented using a real WARN().
This is problematic for infrastructure testing the kernel, all of which
rely on the specific format of WARN()s to pick up on bugs.
Some 13 years ago this used to be a simple WARN_ON() in slub, but commit
d629d81957 ("slub: improve kmem_cache_destroy() error message")
changed it into an open-coded warning to avoid confusion with a bug in
slub itself.
Instead, turn the open-coded warning into a real WARN() with the message
preserved, so that test systems can actually identify these issues, and
we get all the other benefits of using a normal WARN(). The warning
message is extended with "when called from <caller-ip>" to make it even
clearer where the fault lies.
For most configurations this is only a cosmetic change, however, note
that WARN() here will now also respect panic_on_warn.
Link: https://lkml.kernel.org/r/20211102170733.648216-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Simplify the dax_operations API
- Eliminate bdev_dax_pgoff() in favor of the filesystem maintaining
and applying a partition offset to all its DAX iomap operations.
- Remove wrappers and device-mapper stacked callbacks for
->copy_from_iter() and ->copy_to_iter() in favor of moving
block_device relative offset responsibility to the
dax_direct_access() caller.
- Remove the need for an @bdev in filesystem-DAX infrastructure
- Remove unused uio helpers copy_from_iter_flushcache() and
copy_mc_to_iter() as only the non-check_copy_size() versions are
used for DAX.
- Prepare XFS for the pending (next merge window) DAX+reflink support
- Remove deprecated DEV_DAX_PMEM_COMPAT support
- Cleanup a straggling misuse of the GUID api
Tags offered after the branch was cut:
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/Ydb/3P+8nvjCjYfO@redhat.com
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYd3dTAAKCRDfioYZHlFs
Z//UAP9zetoTE+O7zJG7CXja4jSopSadbdbh6QKSXaqfKBPvQQD+N4US3wA2bGv8
f/qCY62j2Hj3hUTGHs9RvTyw3JsSYAA=
=QvDs
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull dax and libnvdimm updates from Dan Williams:
"The bulk of this is a rework of the dax_operations API after
discovering the obstacles it posed to the work-in-progress DAX+reflink
support for XFS and other copy-on-write filesystem mechanics.
Primarily the need to plumb a block_device through the API to handle
partition offsets was a sticking point and Christoph untangled that
dependency in addition to other cleanups to make landing the
DAX+reflink support easier.
The DAX_PMEM_COMPAT option has been around for 4 years and not only
are distributions shipping userspace that understand the current
configuration API, but some are not even bothering to turn this option
on anymore, so it seems a good time to remove it per the deprecation
schedule. Recall that this was added after the device-dax subsystem
moved from /sys/class/dax to /sys/bus/dax for its sysfs organization.
All recent functionality depends on /sys/bus/dax.
Some other miscellaneous cleanups and reflink prep patches are
included as well.
Summary:
- Simplify the dax_operations API:
- Eliminate bdev_dax_pgoff() in favor of the filesystem
maintaining and applying a partition offset to all its DAX iomap
operations.
- Remove wrappers and device-mapper stacked callbacks for
->copy_from_iter() and ->copy_to_iter() in favor of moving
block_device relative offset responsibility to the
dax_direct_access() caller.
- Remove the need for an @bdev in filesystem-DAX infrastructure
- Remove unused uio helpers copy_from_iter_flushcache() and
copy_mc_to_iter() as only the non-check_copy_size() versions are
used for DAX.
- Prepare XFS for the pending (next merge window) DAX+reflink support
- Remove deprecated DEV_DAX_PMEM_COMPAT support
- Cleanup a straggling misuse of the GUID api"
* tag 'libnvdimm-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (38 commits)
iomap: Fix error handling in iomap_zero_iter()
ACPI: NFIT: Import GUID before use
dax: remove the copy_from_iter and copy_to_iter methods
dax: remove the DAXDEV_F_SYNC flag
dax: simplify dax_synchronous and set_dax_synchronous
uio: remove copy_from_iter_flushcache() and copy_mc_to_iter()
iomap: turn the byte variable in iomap_zero_iter into a ssize_t
memremap: remove support for external pgmap refcounts
fsdax: don't require CONFIG_BLOCK
iomap: build the block based code conditionally
dax: fix up some of the block device related ifdefs
fsdax: shift partition offset handling into the file systems
dax: return the partition offset from fs_dax_get_by_bdev
iomap: add a IOMAP_DAX flag
xfs: pass the mapping flags to xfs_bmbt_to_iomap
xfs: use xfs_direct_write_iomap_ops for DAX zeroing
xfs: move dax device handling into xfs_{alloc,free}_buftarg
ext4: cleanup the dax handling in ext4_fill_super
ext2: cleanup the dax handling in ext2_fill_super
fsdax: decouple zeroing from the iomap buffered I/O code
...
This patchset stops just short of actually enabling large folios.
It converts everything that I noticed needs to be converted, but there may
still be places I've overlooked which still have page size assumptions.
The big change here is using large entries in the page cache XArray
instead of many small entries. That only affects shmem for now, but
it's a pretty big change for shmem since it changes where memory needs
to be allocated (at split time instead of insertion).
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmHcraoACgkQDpNsjXcp
gj7C3wgAl0cjtdVzTpkLmbnInsicW1m3thnbkSXYbpqRccFjpu2kEBGj31PT+oGz
dzgXP7SNZ/VkFT+qWtmHSRF/J41B6f9bFojO81B2aQdpRiziU+5QbSbXbfUjwVhE
GJF0WGSJtVqySKynXP/iYTEt2zj6BiVperAwIqzhZpPY7gNoyDgeRD34Xy5bQqdD
ey6/Uwkh7oFHLEDcgxsEnyF0tUR3q+gpe5XZW1fb79p3crWw44xATc3UvKv8qCLC
Rd4oHmKkOj4MvdiUxJEfXI+XxgrkQ8XRO70B+p6ZljhDaoDZYw7ullxA0gvlSpNX
6pnjSQlKA1VQXsi6PMSt+9vf26XxaQ==
=KeYZ
-----END PGP SIGNATURE-----
Merge tag 'folio-5.17' of git://git.infradead.org/users/willy/pagecache
Pull folio conversion updates from Matthew Wilcox:
"Convert much of the page cache to use folios
This stops just short of actually enabling large folios. It converts
everything that I noticed needs to be converted, but there may still
be places I've overlooked which still have page size assumptions.
The big change here is using large entries in the page cache XArray
instead of many small entries. That only affects shmem for now, but
it's a pretty big change for shmem since it changes where memory needs
to be allocated (at split time instead of insertion)"
* tag 'folio-5.17' of git://git.infradead.org/users/willy/pagecache: (49 commits)
mm: Use multi-index entries in the page cache
XArray: Add xas_advance()
truncate,shmem: Handle truncates that split large folios
truncate: Convert invalidate_inode_pages2_range to folios
fs: Convert vfs_dedupe_file_range_compare to folios
mm: Remove pagevec_remove_exceptionals()
mm: Convert find_lock_entries() to use a folio_batch
filemap: Return only folios from find_get_entries()
filemap: Convert filemap_get_read_batch() to use a folio_batch
filemap: Convert filemap_read() to use a folio
truncate: Add invalidate_complete_folio2()
truncate: Convert invalidate_inode_pages2_range() to use a folio
truncate: Skip known-truncated indices
truncate,shmem: Add truncate_inode_folio()
shmem: Convert part of shmem_undo_range() to use a folio
mm: Add unmap_mapping_folio()
truncate: Add truncate_cleanup_folio()
filemap: Add filemap_release_folio()
filemap: Use a folio in filemap_page_mkwrite
filemap: Use a folio in filemap_map_pages
...
This series provides KCSAN fixes and also the ability to take memory
barriers into account for weakly-ordered systems. This last can increase
the probability of detecting certain types of data races.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmHbuRwTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jKDPEACWuzYnd/u/02AHyRd3PIF3Px9uFKlK
TFwaXX95oYSFCXcrmO42YtDUlZm4QcbwNb85KMCu1DvckRtIsNw0rkBU7BGyqv3Z
ZoJEfMNpmC0x9+IFBOeseBHySPVT0x7GmYus05MSh0OLfkbCfyImmxRzgoKJGL+A
ADF9EQb4z2feWjmVEoN8uRaarCAD4f77rSXiX6oTCNDuKrHarqMld/TmoXFrJbu2
QtfwHeyvraKBnZdUoYfVbGVenyKb1vMv4bUlvrOcuJEgIi/J/th4FupR3XCGYulI
aWJTl2TQTGnMoE8VnFHgI27I841w3k5PVL+Y1hr/S4uN1/rIoQQuBzCtlnFeCksa
BiBXsHIchN8N0Dwh8zD8NMd2uxV4t3fmpxXTDAwaOm7vs5hA8AJ0XNu6Sz94Lyjv
wk2CxX41WWUNJVo3gh6SrS4mL6lC8+VvHF1hbIap++jrevj58gj1jAR1fdx4ANlT
e2qA00EeoMngEogDNZH42/Fxs3H9zxrBta2ZbkPkwzIqTHH+4pIQDCy2xO3T3oxc
twdGPYpjYdkf79EGsG4I4R/VA/IfcS09VIWTce8xSDeSnqkgFhcG37r1orJe8hTB
tH+ODkNOsz5HaEoa8OoAL4ko2l0fL99p2AtMPpuQfHjRj7aorF+dJIrqCizASxwx
37PjQgOmHeDHgQ==
=Q5fg
-----END PGP SIGNATURE-----
Merge tag 'kcsan.2022.01.09a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull KCSAN updates from Paul McKenney:
"This provides KCSAN fixes and also the ability to take memory barriers
into account for weakly-ordered systems. This last can increase the
probability of detecting certain types of data races"
* tag 'kcsan.2022.01.09a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (29 commits)
kcsan: Only test clear_bit_unlock_is_negative_byte if arch defines it
kcsan: Avoid nested contexts reading inconsistent reorder_access
kcsan: Turn barrier instrumentation into macros
kcsan: Make barrier tests compatible with lockdep
kcsan: Support WEAK_MEMORY with Clang where no objtool support exists
compiler_attributes.h: Add __disable_sanitizer_instrumentation
objtool, kcsan: Remove memory barrier instrumentation from noinstr
objtool, kcsan: Add memory barrier instrumentation to whitelist
sched, kcsan: Enable memory barrier instrumentation
mm, kcsan: Enable barrier instrumentation
x86/qspinlock, kcsan: Instrument barrier of pv_queued_spin_unlock()
x86/barriers, kcsan: Use generic instrumentation for non-smp barriers
asm-generic/bitops, kcsan: Add instrumentation for barriers
locking/atomics, kcsan: Add instrumentation for barriers
locking/barriers, kcsan: Support generic instrumentation
locking/barriers, kcsan: Add instrumentation for barriers
kcsan: selftest: Add test case to check memory barrier instrumentation
kcsan: Ignore GCC 11+ warnings about TSan runtime support
kcsan: test: Add test cases for memory barrier instrumentation
kcsan: test: Match reordered or normal accesses
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEjUuTAak14xi+SF7M4CHKc/GJqRAFAmHYFIIACgkQ4CHKc/GJ
qRBXqwf+JrWc3PCRF4xKeYmi367RgSX9D8kFCcAry1F+iuq1ssqlDBy/vEp1KtXE
t2Xyn6PILgzGcYdK1/CVNigwAom2NRcb8fHamjjopqYk8wor9m46I564Z6ItVg2I
SCcWhHEuD7M66tmBS+oex3n+LOZ4jPUPhkn5KH04/LSTrR5dzn1op6CnFbpOUZn1
Uy9qB6EbjuyhsONHnO/CdoRUU07K+KqEkzolXFCqpI2Vqf+VBvAwi+RpDLfKkr6l
Vp4PT03ixVsOWhGaJcf7hijKCRyfhsLp7Zyg33pzwpXyngqrowwUPVDMKPyqBy6O
ktehRk+cOQiAi7KnpECljof+NR15Qg==
=/Nyj
-----END PGP SIGNATURE-----
Merge tag 'slab-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- Separate struct slab from struct page - an offshot of the page folio
work.
Struct page fields used by slab allocators are moved from struct page
to a new struct slab, that uses the same physical storage. Similar to
struct folio, it always is a head page. This brings better type
safety, separation of large kmalloc allocations from true slabs, and
cleanup of related objcg code.
- A SLAB_MERGE_DEFAULT config optimization.
* tag 'slab-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (33 commits)
mm/slob: Remove unnecessary page_mapcount_reset() function call
bootmem: Use page->index instead of page->freelist
zsmalloc: Stop using slab fields in struct page
mm/slub: Define struct slab fields for CONFIG_SLUB_CPU_PARTIAL only when enabled
mm/slub: Simplify struct slab slabs field definition
mm/sl*b: Differentiate struct slab fields by sl*b implementations
mm/kfence: Convert kfence_guarded_alloc() to struct slab
mm/kasan: Convert to struct folio and struct slab
mm/slob: Convert SLOB to use struct slab and struct folio
mm/memcg: Convert slab objcgs from struct page to struct slab
mm: Convert struct page to struct slab in functions used by other subsystems
mm/slab: Finish struct page to struct slab conversion
mm/slab: Convert most struct page to struct slab by spatch
mm/slab: Convert kmem_getpages() and kmem_freepages() to struct slab
mm/slub: Finish struct page to struct slab conversion
mm/slub: Convert most struct page to struct slab by spatch
mm/slub: Convert pfmemalloc_match() to take a struct slab
mm/slub: Convert __free_slab() to use struct slab
mm/slub: Convert alloc_slab_page() to return a struct slab
mm/slub: Convert print_page_info() to print_slab_info()
...
from poison memory and error injection into SGX pages
- A bunch of changes to the SGX selftests to simplify and allow of SGX
features testing without the need of a whole SGX software stack
- Add a sysfs attribute which is supposed to show the amount of SGX
memory in a NUMA node, similar to what /proc/meminfo is to normal
memory
- The usual bunch of fixes and cleanups too
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHcDQMACgkQEsHwGGHe
VUq42xAAjWM0AFpIxgUBpbE0swV3ZMulnndl3/vA5XN+9Yn7Q52+AFyPRE0s7Zam
Ap+cInh2Il7d/sv54rZ4x/j7+TH4i7s8fWPVU/XiPALQuOuw0/B1wJJ+jmMiPFiU
3jr7DkUPyWjWTHduMY/tk+xMOpkx1XsxJheYnKvsKVW+fjJ0vPuftAZtfu2z2VOh
3JLcp5cAXPxW0UK9gdoF5bCBQhBu0NRguTbhHhbByAixQO2GyVSKLSRovUdj0a+y
QRrQ6hgcvpTOsVHJoWJ7yIX4SBzQTe9Bg6dT9DghOxE4Sc2GH89hu7wRztGawBJO
nLyzWgiW9ttjQutDpBvZANNVcFAPAdtDWczrzZpREbrGKkzT+kOBnIIL1LWITWOy
2YWTO3ytW0KNIK85GzMjSVOKRMgaHJeBaGuYZ7Z0kb3GuUPJ9zRlaRxNapKQFuzA
0PGoA4IDT+2Afy7VYBBNUA2d/WverFQuXKusSxK6b5zJ173o5/DXL2q0d3gn/j8Z
hhxJUJyVOsfRXSG4NKrj4se4FiA0n/RL4oyUZR9iJ8kWzzZTd0eZTAn468bpGIp5
yiOlPOLgsmu0xzVmAtG1+4d2+S2x+Ec5YE0sP1V/JLNciYk3Ebp7UyfnS3tn33Xc
cpdWjELvD1LJVpMEURnbjRrwU6OiiAekYJCP/9lmK9zfOGpwRHc=
=vFTM
-----END PGP SIGNATURE-----
Merge tag 'x86_sgx_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SGX updates from Borislav Petkov:
- Add support for handling hw errors in SGX pages: poisoning,
recovering from poison memory and error injection into SGX pages
- A bunch of changes to the SGX selftests to simplify and allow of SGX
features testing without the need of a whole SGX software stack
- Add a sysfs attribute which is supposed to show the amount of SGX
memory in a NUMA node, similar to what /proc/meminfo is to normal
memory
- The usual bunch of fixes and cleanups too
* tag 'x86_sgx_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86/sgx: Fix NULL pointer dereference on non-SGX systems
selftests/sgx: Fix corrupted cpuid macro invocation
x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node
x86/sgx: Fix minor documentation issues
selftests/sgx: Add test for multiple TCS entry
selftests/sgx: Enable multiple thread support
selftests/sgx: Add page permission and exception test
selftests/sgx: Rename test properties in preparation for more enclave tests
selftests/sgx: Provide per-op parameter structs for the test enclave
selftests/sgx: Add a new kselftest: Unclobbered_vdso_oversubscribed
selftests/sgx: Move setup_test_encl() to each TEST_F()
selftests/sgx: Encpsulate the test enclave creation
selftests/sgx: Dump segments and /proc/self/maps only on failure
selftests/sgx: Create a heap for the test enclave
selftests/sgx: Make data measurement for an enclave segment optional
selftests/sgx: Assign source for each segment
selftests/sgx: Fix a benign linker warning
x86/sgx: Add check for SGX pages to ghes_do_memory_failure()
x86/sgx: Add hook to error injection address validation
x86/sgx: Hook arch_memory_failure() into mainline code
...
When I say remove I mean remove. All profile_task_exit and
profile_munmap do is call a blocking notifier chain. The helpers
profile_task_register and profile_task_unregister are not called
anywhere in the tree. Which means this is all dead code.
So remove the dead code and make it easier to read do_exit.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/20220103213312.9144-1-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In preparation for removing the flag SIGNAL_GROUP_COREDUMP, change
__task_will_free_mem to test signal->core_state instead of the flag
SIGNAL_GROUP_COREDUMP.
Both fields are protected by siglock and both live in signal_struct so
there are no real tradeoffs here, just a change to which field is
being tested.
Link: https://lkml.kernel.org/r/20211213225350.27481-3-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
We currently store large folios as 2^N consecutive entries. While this
consumes rather more memory than necessary, it also turns out to be buggy.
A writeback operation which starts within a tail page of a dirty folio will
not write back the folio as the xarray's dirty bit is only set on the
head index. With multi-index entries, the dirty bit will be found no
matter where in the folio the operation starts.
This does end up simplifying the page cache slightly, although not as
much as I had hoped.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Handle folio splitting in the parts of the truncation functions which
already handle partial pages. Factor all that code out into a new
function called truncate_inode_partial_folio().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
If we're going to unmap a folio, we have to be sure to unmap the entire
folio, not just the part of it which lies after the search index.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
All of its callers now call folio_batch_remove_exceptionals().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
find_lock_entries() already only returned the head page of folios, so
convert it to return a folio_batch instead of a pagevec. That cascades
through converting truncate_inode_pages_range() to
delete_from_page_cache_batch() and page_cache_delete_batch().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
The callers have all been converted to work on folios, so convert
find_get_entries() to return a batch of folios instead of pages.
We also now return multiple large folios in a single call.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This change ripples all the way through the filemap_read() call chain and
removes a lot of messing about converting folios to pages and back again.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
We know the pagevec always contains folios, but use page_folio() anyway
instead of casting. Removes a few calls to legacy functions.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Convert invalidate_complete_page2() to invalidate_complete_folio2().
Use filemap_free_folio() to free the page instead of calling ->freepage
manually.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
If we're going to unmap a folio, we have to be sure to unmap the entire
folio, not just the part of it which lies after the search index.
We cannot yet remove the struct page from invalidate_inode_pages2_range()
because the page pointer in the pvec might be a shadow/dax/swap entry
instead of actually a page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
If we've truncated an entire folio, we can skip over all the indices
covered by this folio.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Convert all callers of truncate_inode_page() to call
truncate_inode_folio() instead, and move the declaration to mm/internal.h.
Move the assertion that the caller is not passing in a tail page to
generic_error_remove_page(). We can't entirely remove the struct page
from the callers yet because the page pointer in the pvec might be a
shadow/dax/swap entry instead of actually a page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
find_lock_entries() never returns tail pages. We cannot use page_folio()
here as the pagevec may also contain swap entries, so simply cast for
now. This is an intermediate step which will be fully removed by the
end of this series.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Convert both callers of unmap_mapping_page() to call unmap_mapping_folio()
instead. Also move zap_details from linux/mm.h to mm/memory.c
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
All members of struct slab can now be removed from struct page.
This shrinks the definition of struct page by 30 LOC, making
it easier to understand.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
After commit 401fb12c68 ("mm/sl*b: Differentiate struct slab fields by
sl*b implementations"), we can reorder fields of struct slab depending
on slab allocator.
For now, page_mapcount_reset() is called because page->_mapcount and
slab->units have same offset. But this is not necessary for struct slab.
Use unused field for units instead.
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20211212065241.GA886691@odroid
page->freelist is for the use of slab. Using page->index is the same
set of bits as page->freelist, and by using an integer instead of a
pointer, we can avoid casts.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <x86@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
The ->freelist and ->units members of struct page are for the use of
slab only. I'm not particularly familiar with zsmalloc, so generate the
same code by using page->index to store 'page' (page->index and
page->freelist are at the same offset in struct page). This should be
cleaned up properly at some point by somebody who is familiar with
zsmalloc.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
The fields 'next' and 'slabs' are only used when CONFIG_SLUB_CPU_PARTIAL
is enabled. We can put their definition to #ifdef to prevent accidental
use when disabled.
Currenlty show_slab_objects() and slabs_cpu_partial_show() contain code
accessing the slabs field that's effectively dead with
CONFIG_SLUB_CPU_PARTIAL=n through the wrappers slub_percpu_partial() and
slub_percpu_partial_read_once(), but to prevent a compile error, we need
to hide all this code behind #ifdef.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Before commit b47291ef02 ("mm, slub: change percpu partial accounting
from objects to pages") we had to fit two integer fields into a native
word size, so we used short int on 32-bit and int on 64-bit via #ifdef.
After that commit there is only one integer field, so we can simply
define it as int everywhere.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
With a struct slab definition separate from struct page, we can go
further and define only fields that the chosen sl*b implementation uses.
This means everything between __page_flags and __page_refcount
placeholders now depends on the chosen CONFIG_SL*B. Some fields exist in
all implementations (slab_list) but can be part of a union in some, so
it's simpler to repeat them than complicate the definition with ifdefs
even more.
The patch doesn't change physical offsets of the fields, although it
could be done later - for example it's now clear that tighter packing in
SLOB could be possible.
This should also prevent accidental use of fields that don't exist in
given implementation. Before this patch virt_to_cache() and
cache_from_obj() were visible for SLOB (albeit not used), although they
rely on the slab_cache field that isn't set by SLOB. With this patch
it's now a compile error, so these functions are now hidden behind
an #ifndef CONFIG_SLOB.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Tested-by: Marco Elver <elver@google.com> # kfence
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <kasan-dev@googlegroups.com>
The function sets some fields that are being moved from struct page to
struct slab so it needs to be converted.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <kasan-dev@googlegroups.com>
KASAN accesses some slab related struct page fields so we need to
convert it to struct slab. Some places are a bit simplified thanks to
kasan_addr_to_slab() encapsulating the PageSlab flag check through
virt_to_slab(). When resolving object address to either a real slab or
a large kmalloc, use struct folio as the intermediate type for testing
the slab flag to avoid unnecessary implicit compound_head().
[ vbabka@suse.cz: use struct folio, adjust to differences in previous
patches ]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Tested-by: Hyeongogn Yoo <42.hyeyoo@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <kasan-dev@googlegroups.com>
Use struct slab throughout the slob allocator. Where non-slab page can
appear use struct folio instead of struct page.
[ vbabka@suse.cz: don't introduce wrappers for PageSlobFree in mm/slab.h
just for the single callers being wrappers in mm/slob.c ]
[ Hyeonggon Yoo <42.hyeyoo@gmail.com>: fix NULL pointer deference ]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
page->memcg_data is used with MEMCG_DATA_OBJCGS flag only for slab pages
so convert all the related infrastructure to struct slab. Also use
struct folio instead of struct page when resolving object pointers.
This is not just mechanistic changing of types and names. Now in
mem_cgroup_from_obj() we use folio_test_slab() to decide if we interpret
the folio as a real slab instead of a large kmalloc, instead of relying
on MEMCG_DATA_OBJCGS bit that used to be checked in page_objcgs_check().
Similarly in memcg_slab_free_hook() where we can encounter
kmalloc_large() pages (here the folio slab flag check is implied by
virt_to_slab()). As a result, page_objcgs_check() can be dropped instead
of converted.
To avoid include cycles, move the inline definition of slab_objcgs()
from memcontrol.h to mm/slab.h.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <cgroups@vger.kernel.org>
Change cache_free_alien() to use slab_nid(virt_to_slab()). Otherwise
just update of comments and some remaining variable names.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
The majority of conversion from struct page to struct slab in SLAB
internals can be delegated to a coccinelle semantic patch. This includes
renaming of variables with 'page' in name to 'slab', and similar.
Big thanks to Julia Lawall and Luis Chamberlain for help with
coccinelle.
// Options: --include-headers --no-includes --smpl-spacing mm/slab.c
// Note: needs coccinelle 1.1.1 to avoid breaking whitespace, and ocaml for the
// embedded script
// build list of functions for applying the next rule
@initialize:ocaml@
@@
let ok_function p =
not (List.mem (List.hd p).current_element ["kmem_getpages";"kmem_freepages"])
// convert the type in selected functions
@@
position p : script:ocaml() { ok_function p };
@@
- struct page@p
+ struct slab
@@
@@
-PageSlabPfmemalloc(page)
+slab_test_pfmemalloc(slab)
@@
@@
-ClearPageSlabPfmemalloc(page)
+slab_clear_pfmemalloc(slab)
@@
@@
obj_to_index(
...,
- page
+ slab_page(slab)
,...)
// for all functions, change any "struct slab *page" parameter to "struct slab
// *slab" in the signature, and generally all occurences of "page" to "slab" in
// the body - with some special cases.
@@
identifier fn;
expression E;
@@
fn(...,
- struct slab *page
+ struct slab *slab
,...)
{
<...
(
- int page_node;
+ int slab_node;
|
- page_node
+ slab_node
|
- page_slab(page)
+ slab
|
- page_address(page)
+ slab_address(slab)
|
- page_size(page)
+ slab_size(slab)
|
- page_to_nid(page)
+ slab_nid(slab)
|
- virt_to_head_page(E)
+ virt_to_slab(E)
|
- page
+ slab
)
...>
}
// rename a function parameter
@@
identifier fn;
expression E;
@@
fn(...,
- int page_node
+ int slab_node
,...)
{
<...
- page_node
+ slab_node
...>
}
// functions converted by previous rules that were temporarily called using
// slab_page(E) so we want to remove the wrapper now that they accept struct
// slab ptr directly
@@
identifier fn =~ "index_to_obj";
expression E;
@@
fn(...,
- slab_page(E)
+ E
,...)
// functions that were returning struct page ptr and now will return struct
// slab ptr, including slab_page() wrapper removal
@@
identifier fn =~ "cache_grow_begin|get_valid_first_slab|get_first_slab";
expression E;
@@
fn(...)
{
<...
- slab_page(E)
+ E
...>
}
// rename any former struct page * declarations
@@
@@
struct slab *
-page
+slab
;
// all functions (with exceptions) with a local "struct slab *page" variable
// that will be renamed to "struct slab *slab"
@@
identifier fn !~ "kmem_getpages|kmem_freepages";
expression E;
@@
fn(...)
{
<...
(
- page_slab(page)
+ slab
|
- page_to_nid(page)
+ slab_nid(slab)
|
- kasan_poison_slab(page)
+ kasan_poison_slab(slab_page(slab))
|
- page_address(page)
+ slab_address(slab)
|
- page_size(page)
+ slab_size(slab)
|
- page->pages
+ slab->slabs
|
- page = virt_to_head_page(E)
+ slab = virt_to_slab(E)
|
- virt_to_head_page(E)
+ virt_to_slab(E)
|
- page
+ slab
)
...>
}
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Luis Chamberlain <mcgrof@kernel.org>
These functions sit at the boundary to page allocator. Also use folio
internally to avoid extra compound_head() when dealing with page flags.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Update comments mentioning pages to mention slabs where appropriate.
Also some goto labels.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
The majority of conversion from struct page to struct slab in SLUB
internals can be delegated to a coccinelle semantic patch. This includes
renaming of variables with 'page' in name to 'slab', and similar.
Big thanks to Julia Lawall and Luis Chamberlain for help with
coccinelle.
// Options: --include-headers --no-includes --smpl-spacing include/linux/slub_def.h mm/slub.c
// Note: needs coccinelle 1.1.1 to avoid breaking whitespace, and ocaml for the
// embedded script
// build list of functions to exclude from applying the next rule
@initialize:ocaml@
@@
let ok_function p =
not (List.mem (List.hd p).current_element ["nearest_obj";"obj_to_index";"objs_per_slab_page";"__slab_lock";"__slab_unlock";"free_nonslab_page";"kmalloc_large_node"])
// convert the type from struct page to struct page in all functions except the
// list from previous rule
// this also affects struct kmem_cache_cpu, but that's ok
@@
position p : script:ocaml() { ok_function p };
@@
- struct page@p
+ struct slab
// in struct kmem_cache_cpu, change the name from page to slab
// the type was already converted by the previous rule
@@
@@
struct kmem_cache_cpu {
...
-struct slab *page;
+struct slab *slab;
...
}
// there are many places that use c->page which is now c->slab after the
// previous rule
@@
struct kmem_cache_cpu *c;
@@
-c->page
+c->slab
@@
@@
struct kmem_cache {
...
- unsigned int cpu_partial_pages;
+ unsigned int cpu_partial_slabs;
...
}
@@
struct kmem_cache *s;
@@
- s->cpu_partial_pages
+ s->cpu_partial_slabs
@@
@@
static void
- setup_page_debug(
+ setup_slab_debug(
...)
{...}
@@
@@
- setup_page_debug(
+ setup_slab_debug(
...);
// for all functions (with exceptions), change any "struct slab *page"
// parameter to "struct slab *slab" in the signature, and generally all
// occurences of "page" to "slab" in the body - with some special cases.
@@
identifier fn !~ "free_nonslab_page|obj_to_index|objs_per_slab_page|nearest_obj";
@@
fn(...,
- struct slab *page
+ struct slab *slab
,...)
{
<...
- page
+ slab
...>
}
// similar to previous but the param is called partial_page
@@
identifier fn;
@@
fn(...,
- struct slab *partial_page
+ struct slab *partial_slab
,...)
{
<...
- partial_page
+ partial_slab
...>
}
// similar to previous but for functions that take pointer to struct page ptr
@@
identifier fn;
@@
fn(...,
- struct slab **ret_page
+ struct slab **ret_slab
,...)
{
<...
- ret_page
+ ret_slab
...>
}
// functions converted by previous rules that were temporarily called using
// slab_page(E) so we want to remove the wrapper now that they accept struct
// slab ptr directly
@@
identifier fn =~ "slab_free|do_slab_free";
expression E;
@@
fn(...,
- slab_page(E)
+ E
,...)
// similar to previous but for another pattern
@@
identifier fn =~ "slab_pad_check|check_object";
@@
fn(...,
- folio_page(folio, 0)
+ slab
,...)
// functions that were returning struct page ptr and now will return struct
// slab ptr, including slab_page() wrapper removal
@@
identifier fn =~ "allocate_slab|new_slab";
expression E;
@@
static
-struct slab *
+struct slab *
fn(...)
{
<...
- slab_page(E)
+ E
...>
}
// rename any former struct page * declarations
@@
@@
struct slab *
(
- page
+ slab
|
- partial_page
+ partial_slab
|
- oldpage
+ oldslab
)
;
// this has to be separate from previous rule as page and page2 appear at the
// same line
@@
@@
struct slab *
-page2
+slab2
;
// similar but with initial assignment
@@
expression E;
@@
struct slab *
(
- page
+ slab
|
- flush_page
+ flush_slab
|
- discard_page
+ slab_to_discard
|
- page_to_unfreeze
+ slab_to_unfreeze
)
= E;
// convert most of struct page to struct slab usage inside functions (with
// exceptions), including specific variable renames
@@
identifier fn !~ "nearest_obj|obj_to_index|objs_per_slab_page|__slab_(un)*lock|__free_slab|free_nonslab_page|kmalloc_large_node";
expression E;
@@
fn(...)
{
<...
(
- int pages;
+ int slabs;
|
- int pages = E;
+ int slabs = E;
|
- page
+ slab
|
- flush_page
+ flush_slab
|
- partial_page
+ partial_slab
|
- oldpage->pages
+ oldslab->slabs
|
- oldpage
+ oldslab
|
- unsigned int nr_pages;
+ unsigned int nr_slabs;
|
- nr_pages
+ nr_slabs
|
- unsigned int partial_pages = E;
+ unsigned int partial_slabs = E;
|
- partial_pages
+ partial_slabs
)
...>
}
// this has to be split out from the previous rule so that lines containing
// multiple matching changes will be fully converted
@@
identifier fn !~ "nearest_obj|obj_to_index|objs_per_slab_page|__slab_(un)*lock|__free_slab|free_nonslab_page|kmalloc_large_node";
@@
fn(...)
{
<...
(
- slab->pages
+ slab->slabs
|
- pages
+ slabs
|
- page2
+ slab2
|
- discard_page
+ slab_to_discard
|
- page_to_unfreeze
+ slab_to_unfreeze
)
...>
}
// after we simply changed all occurences of page to slab, some usages need
// adjustment for slab-specific functions, or use slab_page() wrapper
@@
identifier fn !~ "nearest_obj|obj_to_index|objs_per_slab_page|__slab_(un)*lock|__free_slab|free_nonslab_page|kmalloc_large_node";
@@
fn(...)
{
<...
(
- page_slab(slab)
+ slab
|
- kasan_poison_slab(slab)
+ kasan_poison_slab(slab_page(slab))
|
- page_address(slab)
+ slab_address(slab)
|
- page_size(slab)
+ slab_size(slab)
|
- PageSlab(slab)
+ folio_test_slab(slab_folio(slab))
|
- page_to_nid(slab)
+ slab_nid(slab)
|
- compound_order(slab)
+ slab_order(slab)
)
...>
}
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Preparatory for mass conversion. Use the new slab_test_pfmemalloc()
helper. As it doesn't do VM_BUG_ON(!PageSlab()) we no longer need the
pfmemalloc_match_unsafe() variant.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
__free_slab() is on the boundary of distinguishing struct slab and
struct page so start with struct slab but convert to folio for working
with flags and folio_page() to call functions that require struct page.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Preparatory, callers convert back to struct page for now.
Also move setting page flags to alloc_slab_page() where we still operate
on a struct page. This means the page->slab_cache pointer is now set
later than the PageSlab flag, which could theoretically confuse some pfn
walker assuming PageSlab means there would be a valid cache pointer. But
as the code had no barriers and used __set_bit() anyway, it could have
happened already, so there shouldn't be such a walker.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Improve the type safety and prepare for further conversion. For flags
access, convert to folio internally.
[ vbabka@suse.cz: access flags via folio_flags() ]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
These functions operate on the PG_locked page flag, but make them accept
struct slab to encapsulate this implementation detail.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Convert kfree(), kmem_cache_free() and ___cache_free() to resolve object
addresses to struct slab, using folio as intermediate step where needed.
Keep passing the result as struct page for now in preparation for mass
conversion of internal functions.
[ vbabka@suse.cz: Use folio as intermediate step when checking for
large kmalloc pages, and when freeing them - rename
free_nonslab_page() to free_large_kmalloc() that takes struct folio ]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
This gives us a little bit of extra typesafety as we know that nobody
called virt_to_page() instead of virt_to_head_page().
[ vbabka@suse.cz: Use folio as intermediate step when filtering out
large kmalloc pages ]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Ensure that we're not seeing a tail page inside __check_heap_object() by
converting to a slab instead of a page. Take the opportunity to mark
the slab as const since we're not modifying it. Also move the
declaration of __check_heap_object() to mm/slab.h so it's not available
to the wider kernel.
[ vbabka@suse.cz: in check_heap_object() only convert to struct slab for
actual PageSlab pages; use folio as intermediate step instead of page ]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>