With the new hugetlb vma lock in place, it can also be used to handle page
fault races with file truncation. The lock is taken at the beginning of
the code fault path in read mode. During truncation, it is taken in write
mode for each vma which has the file mapped. The file's size (i_size) is
modified before taking the vma lock to unmap.
How are races handled?
The page fault code checks i_size early in processing after taking the vma
lock. If the fault is beyond i_size, the fault is aborted. If the fault
is not beyond i_size the fault will continue and a new page will be added
to the file. It could be that truncation code modifies i_size after the
check in fault code. That is OK, as truncation code will soon remove the
page. The truncation code will wait until the fault is finished, as it
must obtain the vma lock in write mode.
This patch cleans up/removes late checks in the fault paths that try to
back out pages racing with truncation. As noted above, we just let the
truncation code remove the pages.
[mike.kravetz@oracle.com: fix reserve_alloc set but not used compiler warning]
Link: https://lkml.kernel.org/r/Yyj7HsJWfHDoU24U@monkey
Link: https://lkml.kernel.org/r/20220914221810.95771-10-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The new hugetlb vma lock is used to address this race:
Faulting thread Unsharing thread
... ...
ptep = huge_pte_offset()
or
ptep = huge_pte_alloc()
...
i_mmap_lock_write
lock page table
ptep invalid <------------------------ huge_pmd_unshare()
Could be in a previously unlock_page_table
sharing process or worse i_mmap_unlock_write
...
The vma_lock is used as follows:
- During fault processing. The lock is acquired in read mode before
doing a page table lock and allocation (huge_pte_alloc). The lock is
held until code is finished with the page table entry (ptep).
- The lock must be held in write mode whenever huge_pmd_unshare is
called.
Lock ordering issues come into play when unmapping a page from all
vmas mapping the page. The i_mmap_rwsem must be held to search for the
vmas, and the vma lock must be held before calling unmap which will
call huge_pmd_unshare. This is done today in:
- try_to_migrate_one and try_to_unmap_ for page migration and memory
error handling. In these routines we 'try' to obtain the vma lock and
fail to unmap if unsuccessful. Calling routines already deal with the
failure of unmapping.
- hugetlb_vmdelete_list for truncation and hole punch. This routine
also tries to acquire the vma lock. If it fails, it skips the
unmapping. However, we can not have file truncation or hole punch
fail because of contention. After hugetlb_vmdelete_list, truncation
and hole punch call remove_inode_hugepages. remove_inode_hugepages
checks for mapped pages and call hugetlb_unmap_file_page to unmap them.
hugetlb_unmap_file_page is designed to drop locks and reacquire in the
correct order to guarantee unmap success.
Link: https://lkml.kernel.org/r/20220914221810.95771-9-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Create the new routine hugetlb_unmap_file_folio that will unmap a single
file folio. This is refactored code from hugetlb_vmdelete_list. It is
modified to do locking within the routine itself and check whether the
page is mapped within a specific vma before unmapping.
This refactoring will be put to use and expanded upon in a subsequent
patch adding vma specific locking.
Link: https://lkml.kernel.org/r/20220914221810.95771-8-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Create the new routine remove_inode_single_folio that will remove a single
folio from a file. This is refactored code from remove_inode_hugepages.
It checks for the uncommon case in which the folio is still mapped and
unmaps.
No functional change. This refactoring will be put to use and expanded
upon in a subsequent patches.
Link: https://lkml.kernel.org/r/20220914221810.95771-5-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
remove_huge_page removes a hugetlb page from the page cache. Change to
hugetlb_delete_from_page_cache as it is a more descriptive name.
huge_add_to_page_cache is global in scope, but only deals with hugetlb
pages. For consistency and clarity, rename to hugetlb_add_to_page_cache.
Link: https://lkml.kernel.org/r/20220914221810.95771-4-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing. However, this has been shown to cause
performance/scaling issues. Revert the code and go back to only taking
the semaphore in huge_pmd_share during the fault path.
Keep the code that takes i_mmap_rwsem in write mode before calling
try_to_unmap as this is required if huge_pmd_unshare is called.
NOTE: Reverting this code does expose the following race condition.
Faulting thread Unsharing thread
... ...
ptep = huge_pte_offset()
or
ptep = huge_pte_alloc()
...
i_mmap_lock_write
lock page table
ptep invalid <------------------------ huge_pmd_unshare()
Could be in a previously unlock_page_table
sharing process or worse i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)
It is unknown if the above race was ever experienced by a user. It was
discovered via code inspection when initially addressed.
In subsequent patches, a new synchronization mechanism will be added to
coordinate pmd sharing and eliminate this race.
Link: https://lkml.kernel.org/r/20220914221810.95771-3-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "hugetlb: Use new vma lock for huge pmd sharing
synchronization", v2.
hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7. At that time, a proposal to
address the regression was suggested [3] but went nowhere.
The regression and benefit of this patch series is not evident when
using the vm_scalability benchmark reported in [2] on a recent kernel.
Results from running,
"./usemem -n 48 --prealloc --prefault -O -U 3448054972"
48 sample Avg
next-20220913 next-20220913 next-20220913
unmodified revert i_mmap_sema locking vma sema locking, this series
-----------------------------------------------------------------------------
498150 KB/s 501934 KB/s 504793 KB/s
The recent regression report [1] notes page fault and fork latency of
shared hugetlb mappings. To measure this, I created two simple programs:
1) map a shared hugetlb area, write fault all pages, unmap area
Do this in a continuous loop to measure faults per second
2) map a shared hugetlb area, write fault a few pages, fork and exit
Do this in a continuous loop to measure forks per second
These programs were run on a 48 CPU VM with 320GB memory. The shared
mapping size was 250GB. For comparison, a single instance of the program
was run. Then, multiple instances were run in parallel to introduce
lock contention. Changing the locking scheme results in a significant
performance benefit.
test instances unmodified revert vma
--------------------------------------------------------------------------
faults per sec 1 393043 395680 389932
faults per sec 24 71405 81191 79048
forks per sec 1 2802 2747 2725
forks per sec 24 439 536 500
Combined faults 24 1621 68070 53662
Combined forks 24 358 67 142
Combined test is when running both faulting program and forking program
simultaneously.
Patches 1 and 2 of this series revert c0d0381ade and 87bf91d39b which
depends on c0d0381ade. Acquisition of i_mmap_rwsem is still required in
the fault path to establish pmd sharing, so this is moved back to
huge_pmd_share. With c0d0381ade reverted, this race is exposed:
Faulting thread Unsharing thread
... ...
ptep = huge_pte_offset()
or
ptep = huge_pte_alloc()
...
i_mmap_lock_write
lock page table
ptep invalid <------------------------ huge_pmd_unshare()
Could be in a previously unlock_page_table
sharing process or worse i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)
Reverting 87bf91d39b exposes races in page fault/file truncation. When
the new vma lock is put to use in patch 8, this will handle the fault/file
truncation races. This is explained in patch 9 where code associated with
these races is cleaned up.
Patches 3 - 5 restructure existing code in preparation for using the new
vma lock (rw semaphore) for pmd sharing synchronization. The idea is that
this semaphore will be held in read mode for the duration of fault
processing, and held in write mode for unmap operations which may call
huge_pmd_unshare. Acquiring i_mmap_rwsem is also still required to
synchronize huge pmd sharing. However it is only required in the fault
path when setting up sharing, and will be acquired in huge_pmd_share().
Patch 6 adds the new vma lock and all supporting routines, but does not
actually change code to use the new lock.
Patch 7 refactors code in preparation for using the new lock. And, patch
8 finally adds code to make use of this new vma lock. Unfortunately, the
fault code and truncate/hole punch code would naturally take locks in the
opposite order which could lead to deadlock. Since the performance of
page faults is more important, the truncation/hole punch code is modified
to back out and take locks in the correct order if necessary.
[1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
[2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
[3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/
This patch (of 9):
Commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing. The use of i_mmap_rwsem to prevent
fault/truncate races depends on this. However, this has been shown to
cause performance/scaling issues. As a result, that code will be
reverted. Since the use i_mmap_rwsem to address page fault/truncate races
depends on this, it must also be reverted.
In a subsequent patch, code will be added to detect the fault/truncate
race and back out operations as required.
Link: https://lkml.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20220914221810.95771-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
NFSv4 mandates a change attribute to avoid problems with timestamp
granularity, which Linux implements using the i_version counter. This is
particularly important when the underlying filesystem is fast.
Give tmpfs an i_version counter. Since it doesn't have to be persistent,
we can just turn on SB_I_VERSION and sprinkle some inode_inc_iversion
calls in the right places.
Also, while there is no formal spec for xattrs, most implementations
update the ctime on setxattr. Fix shmem_xattr_handler_set to update the
ctime and bump the i_version appropriately.
Link: https://lkml.kernel.org/r/20220909130031.15477-1-jlayton@kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "ksm: count allocated rmap_items and update documentation",
v5.
KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information.
To determine how beneficial the ksm-policy (like madvise), they are using
brings, so we add a new interface /proc/<pid>/ksm_stat for each process
The value "ksm_rmap_items" in it indicates the total allocated ksm
rmap_items of this process.
The detailed description can be seen in the following patches' commit
message.
This patch (of 2):
KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information. Some of these pages may be merged,
but some may not be abled to be merged after being checked several times,
which are unprofitable memory consumed.
The information about whether KSM save memory or consume memory in
system-wide range can be determined by the comprehensive calculation of
pages_sharing, pages_shared, pages_unshared and pages_volatile. A simple
approximate calculation:
profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
sizeof(rmap_item);
where all_rmap_items equals to the sum of pages_sharing, pages_shared,
pages_unshared and pages_volatile.
But we cannot calculate this kind of ksm profit inner single-process wide
because the information of ksm rmap_item's number of a process is lacked.
For user applications, if this kind of information could be obtained, it
helps upper users know how beneficial the ksm-policy (like madvise) they
are using brings, and then optimize their app code. For example, one
application madvise 1000 pages as MERGEABLE, while only a few pages are
really merged, then it's not cost-efficient.
So we add a new interface /proc/<pid>/ksm_stat for each process in which
the value of ksm_rmap_itmes is only shown now and so more values can be
added in future.
So similarly, we can calculate the ksm profit approximately for a single
process by:
profit =~ ksm_merging_pages * sizeof(page) - ksm_rmap_items *
sizeof(rmap_item);
where ksm_merging_pages is shown at /proc/<pid>/ksm_merging_pages, and
ksm_rmap_items is shown in /proc/<pid>/ksm_stat.
Link: https://lkml.kernel.org/r/20220830143731.299702-1-xu.xin16@zte.com.cn
Link: https://lkml.kernel.org/r/20220830143838.299758-1-xu.xin16@zte.com.cn
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Don't use the mm_struct linked list or the vma->vm_next in prep for
removal.
Link: https://lkml.kernel.org/r/20220906194824.2110408-45-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove references to mm_struct linked list and highest_vm_end for when
they are removed
Link: https://lkml.kernel.org/r/20220906194824.2110408-44-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use the vma iterator instead of a for loop across the linked list. The
link list of vmas will be removed in this patch set.
Link: https://lkml.kernel.org/r/20220906194824.2110408-43-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove a use of the vm_next list by doing the initial lookup with the VMA
iterator and then using it to find the next entry.
Link: https://lkml.kernel.org/r/20220906194824.2110408-42-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use the Maple Tree iterator instead. This is too complicated for the VMA
iterator to handle, so let's open-code it for now. If this turns out to
be a common pattern, we can migrate it to common code.
Link: https://lkml.kernel.org/r/20220906194824.2110408-41-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
By using the maple tree and the maple tree state, the vmacache is no
longer beneficial and is complicating the VMA code. Remove the vmacache
to reduce the work in keeping it up to date and code complexity.
Link: https://lkml.kernel.org/r/20220906194824.2110408-26-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
These users of the rbtree should probably have been walks of the linked
list, but convert them to use walks of the maple tree.
Link: https://lkml.kernel.org/r/20220906194824.2110408-17-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
To further exploit spatial locality, the aging prefers to walk page tables
to search for young PTEs and promote hot pages. A kill switch will be
added in the next patch to disable this behavior. When disabled, the
aging relies on the rmap only.
NB: this behavior has nothing similar with the page table scanning in the
2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
to swapcache and unmaps them.
To avoid confusion, the term "iteration" specifically means the traversal
of an entire mm_struct list; the term "walk" will be applied to page
tables and the rmap, as usual.
An mm_struct list is maintained for each memcg, and an mm_struct follows
its owner task to the new memcg when this task is migrated. Given an
lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot pages
before it increments max_seq.
When multiple page table walkers iterate the same list, each of them gets
a unique mm_struct; therefore they can run concurrently. Page table
walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
pages it left in the previous memcg will not be promoted when its current
memcg is under reclaim. Similarly, page table walkers will not promote
pages from nodes other than the one under reclaim.
This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
page table walkers can skip processes that have been sleeping since
the last iteration.
2. It uses generational Bloom filters to record populated branches so
that page table walkers can reduce their search space based on the
query results, e.g., to skip page tables containing mostly holes or
misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
spanning multiple VMAs. IOW, it finishes all the VMAs within the
range of the same PMD table before it returns to a PGD table. This
improves the cache performance for workloads that have large
numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[8, 10]%
Ops/sec KB/sec
patch1-7: 1147696.57 44640.29
patch1-8: 1245274.91 48435.66
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-7
48.16% lzo1x_1_do_compress (real work)
8.20% page_vma_mapped_walk (overhead)
7.06% _raw_spin_unlock_irq
2.92% ptep_clear_flush
2.53% __zram_bvec_write
2.11% do_raw_spin_lock
2.02% memmove
1.93% lru_gen_look_around
1.56% free_unref_page_list
1.40% memset
patch1-8
49.44% lzo1x_1_do_compress (real work)
6.19% page_vma_mapped_walk (overhead)
5.97% _raw_spin_unlock_irq
3.13% get_pfn_folio
2.85% ptep_clear_flush
2.42% __zram_bvec_write
2.08% do_raw_spin_lock
1.92% memmove
1.44% alloc_zspage
1.36% memset
Configurations:
no change
Thanks to the following developers for their efforts [3].
kernel test robot <lkp@intel.com>
[1] https://lwn.net/Articles/23732/
[2] https://llvm.org/docs/ScudoHardenedAllocator.html
[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.
Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in folio->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.
There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of working
set estimation and proactive reclaim. These techniques are commonly used
to optimize job scheduling (bin packing) in data centers [1][2].
To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive" will
be applied to the active/inactive LRU, as usual.
The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
applications usually do not prepare themselves for major page
faults like they do for blocked I/O. E.g., GUI applications
commonly use dedicated I/O threads to avoid blocking rendering
threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
present; the latter channel is assumed to follow the latter pattern unless
outlying refaults have been observed [3][4].
The next patch will address the "outlying refaults". Three macros, i.e.,
LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
this patch to make the entire patchset less diffy.
A page is added to the youngest generation on faulting. The aging needs
to check the accessed bit at least twice before handing this page over to
the eviction. The first check takes care of the accessed bit set on the
initial fault; the second check makes sure this page has not been used
since then. This protocol, AKA second chance, requires a minimum of two
generations, hence MIN_NR_GENS.
[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/
Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We've got a bunch of special swap entries that stores PFN inside the swap
offset fields. To fetch the PFN, normally the user just calls
swp_offset() assuming that'll be the PFN.
Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
max possible length of a PFN on the host, meanwhile doing proper check
with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the
PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
One reason to do so is we never tried to sanitize whether swap offset can
really fit for storing PFN. At the meantime, this patch also prepares us
with the future possibility to store more information inside the swp
offset field, so assuming "swp_offset(entry)" to be the PFN will not stand
any more very soon.
Replace many of the swp_offset() callers to use swp_offset_pfn() where
proper. Note that many of the existing users are not candidates for the
replacement, e.g.:
(1) When the swap entry is not a pfn swap entry at all, or,
(2) when we wanna keep the whole swp_offset but only change the swp type.
For the latter, it can happen when fork() triggered on a write-migration
swap entry pte, we may want to only change the migration type from
write->read but keep the rest, so it's not "fetching PFN" but "changing
swap type only". They're left aside so that when there're more
information within the swp offset they'll be carried over naturally in
those cases.
Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
the new swp_offset_pfn() is about.
Link: https://lkml.kernel.org/r/20220811161331.37055-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If CONFIG_SYSFS and CONFIG_SYSCTL are both undefined, hugetlb doesn't work
now as there's no way to set max huge pages. Make sure at least one of the
above configs is defined to make hugetlb works as expected.
Link: https://lkml.kernel.org/r/20220901120030.63318-11-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bh_submit_read() has no user anymore, just remove it.
Link: https://lkml.kernel.org/r/20220901133505.2510834-15-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bh_submit_read() and the uptodate check logic in bh_uptodate_or_lock()
has been integrated in bh_read() helper, so switch to use it directly.
Link: https://lkml.kernel.org/r/20220901133505.2510834-14-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now that all ll_rw_block() users has been replaced to new safe helpers,
we just remove it here.
Link: https://lkml.kernel.org/r/20220901133505.2510834-13-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ll_rw_block() is not safe for the sync read path because it cannot
guarantee that submitting read IO if the buffer has been locked. We
could get false positive EIO after wait_on_buffer() if the buffer has
been locked by others. So stop using ll_rw_block() in ufs.
Link: https://lkml.kernel.org/r/20220901133505.2510834-12-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ll_rw_block() is not safe for the sync read path because it cannot
guarantee that submitting read IO if the buffer has been locked. We
could get false positive EIO after wait_on_buffer() if the buffer has
been locked by others. So stop using ll_rw_block(). We also switch to
new bh_readahead_batch() helper for the buffer array readahead path.
Link: https://lkml.kernel.org/r/20220901133505.2510834-11-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ll_rw_block() is not safe for the sync read/write path because it cannot
guarantee that submitting read/write IO if the buffer has been locked.
We could get false positive EIO after wait_on_buffer() in read path if
the buffer has been locked by others. So stop using ll_rw_block() in
reiserfs. We also switch to new bh_readahead_batch() helper for the
buffer array readahead path.
Link: https://lkml.kernel.org/r/20220901133505.2510834-10-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ll_rw_block() is not safe for the sync read path because it cannot
guarantee that submitting read IO if the buffer has been locked. We
could get false positive EIO after wait_on_buffer() if the buffer has
been locked by others. So stop using ll_rw_block() in ocfs2.
Link: https://lkml.kernel.org/r/20220901133505.2510834-9-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ll_rw_block() is not safe for the sync read path because it cannot
guarantee that submitting read IO if the buffer has been locked. We
could get false positive EIO after wait_on_buffer() if the buffer has
been locked by others. So stop using ll_rw_block() in
ntfs_get_block_vbo().
Link: https://lkml.kernel.org/r/20220901133505.2510834-8-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ll_rw_block() is not safe for the sync read path because it cannot
guarantee that submitting read IO if the buffer has been locked. We
could get false positive EIO after wait_on_buffer() if the buffer has
been locked by others. So stop using ll_rw_block() in
journal_get_superblock(). We also switch to new bh_readahead_batch()
for the buffer array readahead path.
Link: https://lkml.kernel.org/r/20220901133505.2510834-7-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ll_rw_block() is not safe for the sync read path because it cannot
guarantee that submitting read IO if the buffer has been locked. We
could get false positive EIO return from zisofs_uncompress_block() if
he buffer has been locked by others. So stop using ll_rw_block(),
switch to sync helper instead.
Link: https://lkml.kernel.org/r/20220901133505.2510834-6-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ll_rw_block() is not safe for the sync read path because it cannot
guarantee that always submitting read IO if the buffer has been locked,
so stop using it. We also switch to new bh_readahead() helper for the
readahead path.
Link: https://lkml.kernel.org/r/20220901133505.2510834-5-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ll_rw_block() is not safe for the sync IO path because it skip buffers
which has been locked by others, it could lead to false positive EIO
when submitting read IO. So stop using ll_rw_block(), switch to use new
helpers which could guarantee buffer locked and submit IO if needed.
Link: https://lkml.kernel.org/r/20220901133505.2510834-4-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Current ll_rw_block() helper is fragile because it assumes that locked
buffer means it's under IO which is submitted by some other who holds
the lock, it skip buffer if it failed to get the lock, so it's only
safe on the readahead path. Unfortunately, now that most filesystems
still use this helper mistakenly on the sync metadata read path. There
is no guarantee that the one who holds the buffer lock always submit IO
(e.g. buffer_migrate_folio_norefs() after commit 88dbcbb3a4 ("blkdev:
avoid migration stalls for blkdev pages"), it could lead to false
positive -EIO when submitting reading IO.
This patch add some friendly buffer read helpers to prepare replacing
ll_rw_block() and similar calls. We can only call bh_readahead_[]
helpers for the readahead paths.
Link: https://lkml.kernel.org/r/20220901133505.2510834-3-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "fs/buffer: remove ll_rw_block()", v2.
ll_rw_block() will skip locked buffer before submitting IO, it assumes
that locked buffer means it is under IO. This assumption is not always
true because we cannot guarantee every buffer lock path would submit IO.
After commit 88dbcbb3a4 ("blkdev: avoid migration stalls for blkdev
pages"), buffer_migrate_folio_norefs() becomes one exceptional case, and
there may be others. So ll_rw_block() is not safe on the sync read path,
we could get false positive EIO return value when filesystem reading
metadata. It seems that it could be only used on the readahead path.
Unfortunately, many filesystem misuse the ll_rw_block() on the sync read
path. This patch set just remove ll_rw_block() and add new friendly
helpers, which could prevent false positive EIO on the read metadata path.
Thanks for the suggestion from Jan, the original discussion is at [1].
patch 1: remove unused helpers in fs/buffer.c
patch 2: add new bh_read_[*] helpers
patch 3-11: remove all ll_rw_block() calls in filesystems
patch 12-14: do some leftover cleanups.
[1]. https://lore.kernel.org/linux-mm/20220825080146.2021641-1-chengzhihao1@huawei.com/
This patch (of 14):
No one use __breadahead_gfp() and sb_breadahead_unmovable() any more,
remove them.
Link: https://lkml.kernel.org/r/20220901133505.2510834-1-yi.zhang@huawei.com
Link: https://lkml.kernel.org/r/20220901133505.2510834-2-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Heming Zhao <ocfs2-devel@oss.oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Convert to use folios throughout. This is in preparation for the removal
for find_get_pages_contig(). Now also supports large folios.
The initial version of this function set the page_address to be returned
after finishing all the checks. Since folio_batches have a maximum of 15
folios, the function had to be modified to support getting and checking up
to lpages, 15 pages at a time while still returning the initial page
address. Now the function sets ret as soon as the first batch arrives,
and updates it only if a check fails.
The physical adjacency check utilizes the page frame numbers. The page
frame number of each folio must be nr_pages away from the first folio.
Link: https://lkml.kernel.org/r/20220824004023.77310-7-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: David Sterba <dsterb@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Convert function to use folios throughout. This is in preparation for the
removal of find_get_pages_contig(). Now also supports large folios.
Also clean up an unnecessary if statement - pvec.pages[0]->index > index
will always evaluate to false, and filemap_get_folios_contig() returns 0
if there is no folio found at index.
Link: https://lkml.kernel.org/r/20220824004023.77310-6-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: David Sterba <dsterb@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Converted function to use folios throughout. This is in preparation for
the removal of find_get_pages_contig(). Now also supports large folios.
Since we may receive more than nr_pages pages, nr_pages may underflow.
Since nr_pages > 0 is equivalent to index <= end_index, we replaced it
with this check instead.
Also minor comment renaming for consistency in subpage.
Link: https://lkml.kernel.org/r/20220824004023.77310-5-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: David Sterba <dsterb@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Converted function to use folios throughout. This is in preparation for
the removal of find_get_pages_contig(). Now also supports large folios.
Since we may receive more than nr_pages pages, nr_pages may underflow.
Since nr_pages > 0 is equivalent to index <= end_index, we replaced it
with this check instead.
Also this function does not care about the pages being contiguous so we
can just use filemap_get_folios() to be more efficient.
Link: https://lkml.kernel.org/r/20220824004023.77310-4-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterb@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Convert to use folios throughout. This is in preparation for the removal
of find_get_pages_contig(). Now also supports large folios.
Since we may receive more than nr_pages pages, nr_pages may underflow.
Since nr_pages > 0 is equivalent to index <= end_index, we replaced it
with this check instead.
Link: https://lkml.kernel.org/r/20220824004023.77310-3-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterb@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Historically, it has been shown that intercepting kernel faults with
userfaultfd (thereby forcing the kernel to wait for an arbitrary amount of
time) can be exploited, or at least can make some kinds of exploits
easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we
changed things so, in order for kernel faults to be handled by
userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl must
be configured so that any unprivileged user can do it.
In a typical implementation of a hypervisor with live migration (take
QEMU/KVM as one such example), we do indeed need to be able to handle
kernel faults. But, both options above are less than ideal:
- Toggling the sysctl increases attack surface by allowing any
unprivileged user to do it.
- Granting the live migration process CAP_SYS_PTRACE gives it this
ability, but *also* the ability to "observe and control the
execution of another process [...], and examine and change [its]
memory and registers" (from ptrace(2)). This isn't something we need
or want to be able to do, so granting this permission violates the
"principle of least privilege".
This is all a long winded way to say: we want a more fine-grained way to
grant access to userfaultfd, without granting other additional permissions
at the same time.
To achieve this, add a /dev/userfaultfd misc device. This device provides
an alternative to the userfaultfd(2) syscall for the creation of new
userfaultfds. The idea is, any userfaultfds created this way will be able
to handle kernel faults, without the caller having any special
capabilities. Access to this mechanism is instead restricted using e.g.
standard filesystem permissions.
[axelrasmussen@google.com: Handle misc_register() failure properly]
Link: https://lkml.kernel.org/r/20220819205201.658693-3-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20220808175614.3885028-3-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
MADV_COLLAPSE is not coupled to the kernel-oriented sysfs THP settings[1].
hugepage_vma_check() is the authority on determining if a VMA is eligible
for THP allocation/collapse, and currently enforces the sysfs THP
settings. Add a flag to disable these checks. For now, only apply this
arg to anon and file, which use /sys/kernel/transparent_hugepage/enabled.
We can expand this to shmem, which uses
/sys/kernel/transparent_hugepage/shmem_enabled, later.
Use this flag in collapse_pte_mapped_thp() where previously the VMA flags
passed to hugepage_vma_check() were OR'd with VM_HUGEPAGE to elide the
VM_HUGEPAGE check in "madvise" THP mode. Prior to "mm: khugepaged: check
THP flag in hugepage_vma_check()", this check also didn't check "never"
THP mode. As such, this restores the previous behavior of
collapse_pte_mapped_thp() where sysfs THP settings are ignored. See
comment in code for justification why this is OK.
[1] https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20220706235936.2197195-8-zokeefe@google.com
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The SB_BORN flag is stored in the vfs superblock, not xfs_sb.
Link: https://lkml.kernel.org/r/166153428094.2758201.7936572520826540019.stgit@dwillia2-xfh.jf.intel.com
Fixes: 6f643c57d5 ("xfs: implement ->notify_failure() for XFS")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm, xfs, dax: Fixes for memory_failure() handling".
I failed to run the memory error injection section of the ndctl test suite
on linux-next prior to the merge window and as a result some bugs were
missed. While the new enabling targeted reflink enabled XFS filesystems
the bugs cropped up in the surrounding cases of DAX error injection on
ext4-fsdax and device-dax.
One new assumption / clarification in this set is the notion that if a
filesystem's ->notify_failure() handler returns -EOPNOTSUPP, then it must
be the case that the fsdax usage of page->index and page->mapping are
valid. I am fairly certain this is true for xfs_dax_notify_failure(), but
would appreciate another set of eyes.
This patch (of 4):
XFS always registers dax_holder_operations regardless of whether the
filesystem is capable of handling the notifications. The expectation is
that if the notify_failure handler cannot run then there are no scenarios
where it needs to run. In other words the expected semantic is that
page->index and page->mapping are valid for memory_failure() when the
conditions that cause -EOPNOTSUPP in xfs_dax_notify_failure() are present.
A fallback to the generic memory_failure() path is expected so do not warn
when that happens.
Link: https://lkml.kernel.org/r/166153426798.2758201.15108211981034512993.stgit@dwillia2-xfh.jf.intel.com
Link: https://lkml.kernel.org/r/166153427440.2758201.6709480562966161512.stgit@dwillia2-xfh.jf.intel.com
Fixes: 6f643c57d5 ("xfs: implement ->notify_failure() for XFS")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cc:stable, addressing pre-6.0 issues.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYwvgrAAKCRDdBJ7gKXxA
jlweAQC9dzE08Elxl4F7Uvxe+62JWVeflBRrT7sJ6jU1Gu3QcQEAhhI1Xit3/MGq
pRytDBObGADxlA67c9eNq6J5pCT/7gE=
=pD67
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more hotfixes from Andrew Morton:
"Seventeen hotfixes. Mostly memory management things.
Ten patches are cc:stable, addressing pre-6.0 issues"
* tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
.mailmap: update Luca Ceresoli's e-mail address
mm/mprotect: only reference swap pfn page if type match
squashfs: don't call kmalloc in decompressors
mm/damon/dbgfs: avoid duplicate context directory creation
mailmap: update email address for Colin King
asm-generic: sections: refactor memory_intersects
bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem
ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown
Revert "memcg: cleanup racy sum avoidance code"
mm/zsmalloc: do not attempt to free IS_ERR handle
binder_alloc: add missing mmap_lock calls when using the VMA
mm: re-allow pinning of zero pfns (again)
vmcoreinfo: add kallsyms_num_syms symbol
mailmap: update Guilherme G. Piccoli's email addresses
writeback: avoid use-after-free after removing device
shmem: update folio if shmem_replace_page() updates the page
mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte
The decompressors may be called while in an atomic section. So move the
kmalloc() out of this path, and into the "page actor" init function.
This fixes a regression introduced by commit
f268eedddf ("squashfs: extend "page actor" to handle missing pages")
Link: https://lkml.kernel.org/r/20220822215430.15933-1-phillip@squashfs.org.uk
Fixes: f268eedddf ("squashfs: extend "page actor" to handle missing pages")
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After commit 0737e01de9 ("ocfs2: ocfs2_mount_volume does cleanup job
before return error"), any procedure after ocfs2_dlm_init() fails will
trigger crash when calling ocfs2_dlm_shutdown().
ie: On local mount mode, no dlm resource is initialized. If
ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling will call
ocfs2_dlm_shutdown(), then does dlm resource cleanup job, which will
trigger kernel crash.
This solution should bypass uninitialized resources in
ocfs2_dlm_shutdown().
Link: https://lkml.kernel.org/r/20220815085754.20417-1-heming.zhao@suse.com
Fixes: 0737e01de9 ("ocfs2: ocfs2_mount_volume does cleanup job before return error")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When a disk is removed, bdi_unregister gets called to stop further
writeback and wait for associated delayed work to complete. However,
wb_inode_writeback_end() may schedule bandwidth estimation dwork after
this has completed, which can result in the timer attempting to access the
just freed bdi_writeback.
Fix this by checking if the bdi_writeback is alive, similar to when
scheduling writeback work.
Since this requires wb->work_lock, and wb_inode_writeback_end() may get
called from interrupt, switch wb->work_lock to an irqsafe lock.
Link: https://lkml.kernel.org/r/20220801155034.3772543-1-khazhy@google.com
Fixes: 45a2966fd6 ("writeback: fix bandwidth estimate for spiky workload")
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michael Stapelberg <stapelberg+linux@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmMLY9oACgkQxWXV+ddt
WDue/w/8C3ZF8nLAI/sMrUpef2vSD62bvkKRRS45wzR2uod6yc0Fle9upzBssJQZ
qO3mQ53+QV+imCq7dY5mmtmwCUJNmbV5gbiMoF1OoV9TYtpZb/NIDklSX8se2eJX
drdAWQr2pYwU2M4duA4IEW08TvQ2TFh0JiqMi0aYM5apyL80uv3WniOu+xpRipA3
CMFAnDqayIgQ5OIsedqNy2MBLBopodUL5PZv/H7/g6KSKIuAZP9zgg1eKPfaz2t3
HO183ubmMbVtxgxeu+EnvCkg/iQ5hQiuGmyi0FLYMs/A6/NglwBnIJU5jCMQhcp6
HO5+FSUn6lHQetVzt2uHb9Lo+gX4FtCaHqVv1bXT62lnmDsZO1D7RVSg1Fra+CY+
jJmi8vvIbfbYlSZPZlJANoWe8ODOMVPk+pM4SFHlxOWGAY6HViX2RfHnIjNj5x9O
iDSTGvH6++nBF1Wu2/Xja/VKZ1avxRyTu2srW8JOF62j/tTU/EoPJcO9rxXOBBmC
Hi4UmJ690p3h5xZeeiyE8CmaSlPtfdCcnc/97FnusEjBao9O7THX0PCDVJX6VBkm
hVk01Z6+az1UNcD18KecvCpKYF/At4WpjaUGgf7q+LBfJXuXA6jfzOVDJMKV3TFd
n1yMFg+duGj90l8gT0aa/VQiBlUlnzQKz6ceqyKkPccwveNis6I=
=p8YV
-----END PGP SIGNATURE-----
Merge tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Fixes:
- check that subvolume is writable when changing xattrs from security
namespace
- fix memory leak in device lookup helper
- update generation of hole file extent item when merging holes
- fix space cache corruption and potential double allocations; this
is a rare bug but can be serious once it happens, stable backports
and analysis tool will be provided
- fix error handling when deleting root references
- fix crash due to assert when attempting to cancel suspended device
replace, add message what to do if mount fails due to missing
replace item
Regressions:
- don't merge pages into bio if their page offset is not contiguous
- don't allow large NOWAIT direct reads, this could lead to short
reads eg. in io_uring"
* tag 'for-6.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: add info when mount fails due to stale replace target
btrfs: replace: drop assert for suspended replace
btrfs: fix silent failure when deleting root reference
btrfs: fix space cache corruption and potential double allocations
btrfs: don't allow large NOWAIT direct reads
btrfs: don't merge pages into bio if their page offset is not contiguous
btrfs: update generation of hole file extent item when merging holes
btrfs: fix possible memory leak in btrfs_get_dev_args_from_path()
btrfs: check if root is readonly while setting security xattr