linux/mm
Peter Xu 5abfd71d93 mm: don't skip swap entry even if zap_details specified
Patch series "mm: Rework zap ptes on swap entries", v5.

Patch 1 should fix a long standing bug for zap_pte_range() on
zap_details usage.  The risk is we could have some swap entries skipped
while we should have zapped them.

Migration entries are not the major concern because file backed memory
always zap in the pattern that "first time without page lock, then
re-zap with page lock" hence the 2nd zap will always make sure all
migration entries are already recovered.

However there can be issues with real swap entries got skipped
errornoously.  There's a reproducer provided in commit message of patch
1 for that.

Patch 2-4 are cleanups that are based on patch 1.  After the whole
patchset applied, we should have a very clean view of zap_pte_range().

Only patch 1 needs to be backported to stable if necessary.

This patch (of 4):

The "details" pointer shouldn't be the token to decide whether we should
skip swap entries.

For example, when the callers specified details->zap_mapping==NULL, it
means the user wants to zap all the pages (including COWed pages), then
we need to look into swap entries because there can be private COWed
pages that was swapped out.

Skipping some swap entries when details is non-NULL may lead to wrongly
leaving some of the swap entries while we should have zapped them.

A reproducer of the problem:

===8<===
        #define _GNU_SOURCE         /* See feature_test_macros(7) */
        #include <stdio.h>
        #include <assert.h>
        #include <unistd.h>
        #include <sys/mman.h>
        #include <sys/types.h>

        int page_size;
        int shmem_fd;
        char *buffer;

        void main(void)
        {
                int ret;
                char val;

                page_size = getpagesize();
                shmem_fd = memfd_create("test", 0);
                assert(shmem_fd >= 0);

                ret = ftruncate(shmem_fd, page_size * 2);
                assert(ret == 0);

                buffer = mmap(NULL, page_size * 2, PROT_READ | PROT_WRITE,
                                MAP_PRIVATE, shmem_fd, 0);
                assert(buffer != MAP_FAILED);

                /* Write private page, swap it out */
                buffer[page_size] = 1;
                madvise(buffer, page_size * 2, MADV_PAGEOUT);

                /* This should drop private buffer[page_size] already */
                ret = ftruncate(shmem_fd, page_size);
                assert(ret == 0);
                /* Recover the size */
                ret = ftruncate(shmem_fd, page_size * 2);
                assert(ret == 0);

                /* Re-read the data, it should be all zero */
                val = buffer[page_size];
                if (val == 0)
                        printf("Good\n");
                else
                        printf("BUG\n");
        }
===8<===

We don't need to touch up the pmd path, because pmd never had a issue with
swap entries.  For example, shmem pmd migration will always be split into
pte level, and same to swapping on anonymous.

Add another helper should_zap_cows() so that we can also check whether we
should zap private mappings when there's no page pointer specified.

This patch drops that trick, so we handle swap ptes coherently.  Meanwhile
we should do the same check upon migration entry, hwpoison entry and
genuine swap entries too.

To be explicit, we should still remember to keep the private entries if
even_cows==false, and always zap them when even_cows==true.

The issue seems to exist starting from the initial commit of git.

[peterx@redhat.com: comment tweaks]
  Link: https://lkml.kernel.org/r/20220217060746.71256-2-peterx@redhat.com

Link: https://lkml.kernel.org/r/20220217060746.71256-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220216094810.60572-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220216094810.60572-2-peterx@redhat.com
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:04 -07:00
..
damon mm/damon: hide kernel pointer from tracepoint event 2022-01-15 16:30:33 +02:00
kasan lib/stackdepot: always do filter_irq_stacks() in stack_depot_save() 2022-01-22 08:33:38 +02:00
kfence kfence: make test case compatible with run time set sample interval 2022-02-11 17:55:00 -08:00
backing-dev.c remove congestion tracking framework 2022-03-22 15:57:01 -07:00
balloon_compaction.c mm: fix typos in comments 2021-05-07 00:26:35 -07:00
bootmem_info.c bootmem: Use page->index instead of page->freelist 2022-01-06 12:27:03 +01:00
cma_debug.c mm/cma: change cma mutex to irq safe spinlock 2021-05-05 11:27:21 -07:00
cma_sysfs.c mm: cma: support sysfs 2021-05-05 11:27:24 -07:00
cma.c memblock: rename memblock_free to memblock_phys_free 2021-11-06 13:30:41 -07:00
cma.h mm: cma: support sysfs 2021-05-05 11:27:24 -07:00
compaction.c mm: compaction: fix the migration stats in trace_mm_compaction_migratepages() 2022-01-15 16:30:30 +02:00
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: remove pte entry from the page table 2022-02-04 09:25:04 -08:00
debug.c mm,fs: split dump_mapping() out from dump_page() 2022-01-15 16:30:26 +02:00
dmapool.c mm/dmapool.c: revert "make dma pool to use kmalloc_node" 2022-01-15 16:30:28 +02:00
early_ioremap.c mm/early_ioremap.c: remove redundant early_ioremap_shutdown() 2021-09-08 11:50:24 -07:00
fadvise.c remove inode_congested() 2022-03-22 15:57:01 -07:00
failslab.c
filemap.c tmpfs: do not allocate pages on read 2022-03-22 15:57:02 -07:00
folio-compat.c filemap: Add filemap_release_folio() 2022-01-04 13:15:34 -05:00
frontswap.c frontswap: remove support for multiple ops 2022-01-22 08:33:38 +02:00
gup_test.c selftests/vm: gup_test: test faulting in kernel, and verify pinnable pages 2021-05-05 11:27:26 -07:00
gup_test.h selftests/vm: gup_test: fix test flag 2021-05-05 11:27:26 -07:00
gup.c mm/gup: remove unused get_user_pages_locked() 2022-03-22 15:57:01 -07:00
highmem.c Fixes for 5.16 folios: 2021-11-25 10:13:56 -08:00
hmm.c mm/hmm.c: allow VM_MIXEDMAP to work with hmm_range_fault 2022-01-15 16:30:31 +02:00
huge_memory.c mm: thp: fix wrong cache flush in remove_migration_pmd() 2022-03-22 15:57:04 -07:00
hugetlb_cgroup.c hugetlb: add hugetlb.*.numa_stat file 2022-01-15 16:30:29 +02:00
hugetlb_vmemmap.c mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON 2021-06-30 20:47:26 -07:00
hugetlb_vmemmap.h mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate 2021-06-30 20:47:25 -07:00
hugetlb.c mm: hugetlb: fix missing cache flush in hugetlb_mcopy_atomic_pte() 2022-03-22 15:57:04 -07:00
hwpoison-inject.c mm: hwpoison: don't drop slab caches for offlining non-LRU page 2021-09-03 09:58:15 -07:00
init-mm.c mm: add setup_initial_init_mm() helper 2021-07-08 11:48:21 -07:00
internal.h Merge branch 'akpm' (patches from Andrew) 2022-01-15 20:37:06 +02:00
interval_tree.c mm/interval_tree: add comments to improve code readability 2021-04-30 11:20:38 -07:00
io-mapping.c mm: add a io_mapping_map_user helper 2021-04-30 11:20:39 -07:00
ioremap.c mm: move ioremap_page_range to vmalloc.c 2021-09-08 11:50:24 -07:00
Kconfig mm: hide the FRONTSWAP Kconfig symbol 2022-01-22 08:33:38 +02:00
Kconfig.debug mm: page table check 2022-01-15 16:30:28 +02:00
khugepaged.c mm/page_table_check: check entries at pmd levels 2022-02-04 09:25:04 -08:00
kmemleak.c mm/kmemleak: avoid scanning potential huge holes 2022-02-04 09:25:05 -08:00
ksm.c mm: ksm: fix use-after-free kasan report in ksm_might_need_to_copy 2022-01-15 16:30:31 +02:00
list_lru.c mm: memcontrol: rename memcg_cache_id to memcg_kmem_id 2022-03-22 15:57:04 -07:00
maccess.c ARM: 9115/1: mm/maccess: fix unaligned copy_{from,to}_kernel_nofault 2021-08-20 11:39:25 +01:00
madvise.c mm: fix use-after-free when anon vma name is used after vma is freed 2022-03-05 11:08:32 -08:00
Makefile mm: remove cleancache 2022-01-22 08:33:38 +02:00
mapping_dirty_helpers.c mm: move tlb_flush_pending inline helpers to mm_inline.h 2022-01-15 16:30:27 +02:00
memblock.c memblock: use kfree() to release kmalloced memblock regions 2022-02-20 08:45:39 +02:00
memcontrol.c mm: memcontrol: fix cannot alloc the maximum memcg ID 2022-03-22 15:57:03 -07:00
memfd.c memfd: fix F_SEAL_WRITE after shmem huge page allocated 2022-03-05 11:08:32 -08:00
memory_hotplug.c treewide: Add missing includes masked by cgroup -> bpf dependency 2021-12-03 10:58:13 -08:00
memory-failure.c memory-failure: fetch compound_head after pgmap_pfn_valid() 2022-01-30 09:56:58 +02:00
memory.c mm: don't skip swap entry even if zap_details specified 2022-03-22 15:57:04 -07:00
mempolicy.c mm: change lookup_node() to use get_user_pages_fast() 2022-03-22 15:57:01 -07:00
mempool.c mm: remove spurious blkdev.h includes 2021-10-18 06:17:01 -06:00
memremap.c mm/memremap: avoid calling kasan_remove_zero_shadow() for device private memory 2022-03-22 15:57:01 -07:00
memtest.c
migrate.c mm: replace multiple dcache flush with flush_dcache_folio() 2022-03-22 15:57:04 -07:00
mincore.c
mlock.c mm: refactor vm_area_struct::anon_vma_name usage code 2022-03-05 11:08:32 -08:00
mm_init.c include/linux/page-flags-layout.h: cleanups 2021-04-30 11:20:42 -07:00
mmap_lock.c mm: mmap_lock: fix disabling preemption directly 2021-07-23 17:43:28 -07:00
mmap.c mm: refactor vm_area_struct::anon_vma_name usage code 2022-03-05 11:08:32 -08:00
mmu_gather.c mm: move tlb_flush_pending inline helpers to mm_inline.h 2022-01-15 16:30:27 +02:00
mmu_notifier.c mm/mmu_notifiers: ensure range_end() is paired with range_start() 2021-03-25 09:22:55 -07:00
mmzone.c
mprotect.c mm: refactor vm_area_struct::anon_vma_name usage code 2022-03-05 11:08:32 -08:00
mremap.c mm, hugepages: add mremap() support for hugepage backed vma 2021-11-06 13:30:39 -07:00
msync.c mm/msync: exit early when the flags is an MS_ASYNC and start < vm_start 2021-04-30 11:20:37 -07:00
nommu.c Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
oom_kill.c Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2022-01-17 05:49:30 +02:00
page_alloc.c Merge branch 'akpm' (patches from Andrew) 2022-01-20 10:41:01 +02:00
page_counter.c mm/page_counter: remove an incorrect call to propagate_protected_usage() 2022-01-15 16:30:27 +02:00
page_ext.c mm: make some vars and functions static or __init 2022-01-15 16:30:31 +02:00
page_idle.c mm/idle_page_tracking: make PG_idle reusable 2021-09-08 11:50:24 -07:00
page_io.c delayacct: support swapin delay accounting for swapping without blkio 2022-01-20 08:52:55 +02:00
page_isolation.c Revert "mm/page_isolation: unset migratetype directly for non Buddy page" 2022-02-04 09:25:04 -08:00
page_owner.c lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() 2022-01-22 08:33:37 +02:00
page_poison.c mm: page_poison: print page info when corruption is caught 2021-04-30 11:20:36 -07:00
page_reporting.c mm/page_reporting: allow driver to specify reporting order 2021-06-29 10:53:47 -07:00
page_reporting.h mm/page_reporting: export reporting order as module parameter 2021-06-29 10:53:47 -07:00
page_table_check.c mm/page_table_check: check entries at pmd levels 2022-02-04 09:25:04 -08:00
page_vma_mapped.c mm: device exclusive memory access 2021-07-01 11:06:03 -07:00
page-writeback.c mm/writeback: minor clean up for highmem_dirtyable_memory 2022-03-22 15:57:01 -07:00
pagewalk.c mm: pagewalk: fix walk for hugepage tables 2021-06-29 10:53:49 -07:00
percpu-internal.h mm: memcg/percpu: account extra objcg space to memory cgroups 2022-01-15 16:30:31 +02:00
percpu-km.c percpu: flush tlb in pcpu_reclaim_populated() 2021-07-04 18:30:17 +00:00
percpu-stats.c percpu: rework memcg accounting 2021-06-05 20:43:15 +00:00
percpu-vm.c percpu: flush tlb in pcpu_reclaim_populated() 2021-07-04 18:30:17 +00:00
percpu.c bitmap patches for 5.17-rc1 2022-01-23 06:20:44 +02:00
pgalloc-track.h mm: fix typos in comments 2021-05-07 00:26:35 -07:00
pgtable-generic.c mm: move tlb_flush_pending inline helpers to mm_inline.h 2022-01-15 16:30:27 +02:00
process_vm_access.c mm/process_vm_access.c: remove duplicate include 2021-05-05 11:27:27 -07:00
ptdump.c mm: ptdump: fix build failure 2021-04-16 16:10:37 -07:00
readahead.c remove inode_congested() 2022-03-22 15:57:01 -07:00
rmap.c mm/rmap: fix potential batched TLB flush race 2022-01-15 16:30:31 +02:00
rodata_test.c
secretmem.c mm/secretmem: avoid letting secretmem_users drop to zero 2021-10-28 17:18:55 -07:00
shmem.c mm: shmem: fix missing cache flush in shmem_mfill_atomic_pte() 2022-03-22 15:57:04 -07:00
shuffle.c mm: eliminate "expecting prototype" kernel-doc warnings 2021-04-16 16:10:36 -07:00
shuffle.h mm/shuffle: fix section mismatch warning 2021-05-22 15:09:07 -10:00
slab_common.c Merge branch 'akpm' (patches from Andrew) 2022-01-15 20:37:06 +02:00
slab.c mm: introduce kmem_cache_alloc_lru 2022-03-22 15:57:03 -07:00
slab.h mm: introduce kmem_cache_alloc_lru 2022-03-22 15:57:03 -07:00
slob.c mm: introduce kmem_cache_alloc_lru 2022-03-22 15:57:03 -07:00
slub.c mm: introduce kmem_cache_alloc_lru 2022-03-22 15:57:03 -07:00
sparse-vmemmap.c mm: remove redundant smp_wmb() 2021-11-06 13:30:36 -07:00
sparse.c bootmem: Use page->index instead of page->freelist 2022-01-06 12:27:03 +01:00
swap_cgroup.c
swap_slots.c treewide: Add missing includes masked by cgroup -> bpf dependency 2021-12-03 10:58:13 -08:00
swap_state.c mm: swap: get rid of livelock in swapin readahead 2022-03-17 11:02:13 -07:00
swap.c mm/swap: fix confusing comment in folio_mark_accessed 2022-03-22 15:57:01 -07:00
swapfile.c mm: mark swap_lock and swap_active_head static 2022-01-22 08:33:38 +02:00
truncate.c mm: remove cleancache 2022-01-22 08:33:38 +02:00
usercopy.c mm: Convert check_heap_object() to use struct slab 2022-01-06 12:25:51 +01:00
userfaultfd.c mm: userfaultfd: fix missing cache flush in mcopy_atomic_pte() and __mcopy_atomic() 2022-03-22 15:57:04 -07:00
util.c mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls 2022-03-04 10:00:37 -08:00
vmacache.c
vmalloc.c mm: merge pte_mkhuge() call into arch_make_huge_pte() 2022-03-22 15:57:04 -07:00
vmpressure.c mm/vmpressure: fix data-race with memcg->socket_pressure 2021-11-06 13:30:40 -07:00
vmscan.c remove bdi_congested() and wb_congested() and related functions 2022-03-22 15:57:01 -07:00
vmstat.c mm/vmstat: add events for THP max_ptes_* exceeds 2022-01-15 16:30:29 +02:00
workingset.c xarray: use kmem_cache_alloc_lru to allocate xa_node 2022-03-22 15:57:03 -07:00
z3fold.c mm/z3fold: add kerneldoc fields for z3fold_pool 2021-07-01 11:06:03 -07:00
zbud.c mm/zbud: add kerneldoc fields for zbud_pool 2021-07-01 11:06:03 -07:00
zpool.c zpool: remove the list of pools_head 2022-01-15 16:30:31 +02:00
zsmalloc.c zsmalloc: replace get_cpu_var with local_lock 2022-01-22 08:33:37 +02:00
zswap.c frontswap: remove support for multiple ops 2022-01-22 08:33:38 +02:00