powerpc/mm/book3s64: Fix MADV_DONTNEED and parallel page fault race

MADV_DONTNEED holds mmap_sem in read mode and that implies a
parallel page fault is possible and the kernel can end up with a level 1 PTE
entry (THP entry) converted to a level 0 PTE entry without flushing
the THP TLB entry.

Most architectures including POWER have issues with kernel instantiating a level
0 PTE entry while holding level 1 TLB entries.

The code sequence I am looking at is

down_read(mmap_sem)                         down_read(mmap_sem)

zap_pmd_range()
 zap_huge_pmd()
  pmd lock held
  pmd_cleared
  table details added to mmu_gather
  pmd_unlock()
                                         insert a level 0 PTE entry()

tlb_finish_mmu().

Fix this by forcing a tlb flush before releasing pmd lock if this is
not a fullmm invalidate. We can safely skip this invalidate for
task exit case (fullmm invalidate) because in that case we are sure
there can be no parallel fault handlers.

This do change the Qemu guest RAM del/unplug time as below

128 core, 496GB guest:

Without patch:
munmap start: timer = 196449 ms, PID=6681
munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms

With patch:
munmap start: timer = 196345 ms, PID=6879
munmap finish: timer = 196714 ms, PID=6879 - delta = 369ms

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-23-aneesh.kumar@linux.ibm.com
This commit is contained in:
Aneesh Kumar K.V 2020-05-05 12:47:29 +05:30 committed by Michael Ellerman
parent e21dfbf013
commit 75358ea359
2 changed files with 23 additions and 0 deletions
arch/powerpc
include/asm/book3s/64
mm/book3s64

View File

@ -1265,6 +1265,11 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
} }
#define pmdp_collapse_flush pmdp_collapse_flush #define pmdp_collapse_flush pmdp_collapse_flush
#define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR_FULL
pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
unsigned long addr,
pmd_t *pmdp, int full);
#define __HAVE_ARCH_PGTABLE_DEPOSIT #define __HAVE_ARCH_PGTABLE_DEPOSIT
static inline void pgtable_trans_huge_deposit(struct mm_struct *mm, static inline void pgtable_trans_huge_deposit(struct mm_struct *mm,
pmd_t *pmdp, pgtable_t pgtable) pmd_t *pmdp, pgtable_t pgtable)

View File

@ -112,6 +112,24 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
return __pmd(old_pmd); return __pmd(old_pmd);
} }
pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp, int full)
{
pmd_t pmd;
VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
!pmd_devmap(*pmdp)) || !pmd_present(*pmdp));
pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
/*
* if it not a fullmm flush, then we can possibly end up converting
* this PMD pte entry to a regular level 0 PTE by a parallel page fault.
* Make sure we flush the tlb in this case.
*/
if (!full)
flush_pmd_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE);
return pmd;
}
static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot) static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot)
{ {
return __pmd(pmd_val(pmd) | pgprot_val(pgprot)); return __pmd(pmd_val(pmd) | pgprot_val(pgprot));