linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-23 12:42:02 +00:00

A mirror of the official Linux kernel repository just in case

Go to file

Mike Kravetz c0d0381ade hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2. While discussing the issue with huge_pte_offset [1], I remembered that there were more outstanding hugetlb races. These issues are: 1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become invalid via a call to huge_pmd_unshare by another thread. 2) hugetlbfs page faults can race with truncation causing invalid global reserve counts and state. A previous attempt was made to use i_mmap_rwsem in this manner as described at [2]. However, those patches were reverted starting with [3] due to locking issues. To effectively use i_mmap_rwsem to address the above issues it needs to be held (in read mode) during page fault processing. However, during fault processing we need to lock the page we will be adding. Lock ordering requires we take page lock before i_mmap_rwsem. Waiting until after taking the page lock is too late in the fault process for the synchronization we want to do. To address this lock ordering issue, the following patches change the lock ordering for hugetlb pages. This is not too invasive as hugetlbfs processing is done separate from core mm in many places. However, I don't really like this idea. Much ugliness is contained in the new routine hugetlb_page_mapping_lock_write() of patch 1. The only other way I can think of to address these issues is by catching all the races. After catching a race, cleanup, backout, retry ... etc, as needed. This can get really ugly, especially for huge page reservations. At one time, I started writing some of the reservation backout code for page faults and it got so ugly and complicated I went down the path of adding synchronization to avoid the races. Any other suggestions would be welcome. [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/ [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/ [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/ [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/ This patch (of 2): While looking at BUGs associated with invalid huge page map counts, it was discovered and observed that a huge pte pointer could become 'invalid' and point to another task's page table. Consider the following: A task takes a page fault on a shared hugetlbfs file and calls huge_pte_alloc to get a ptep. Suppose the returned ptep points to a shared pmd. Now, another task truncates the hugetlbfs file. As part of truncation, it unmaps everyone who has the file mapped. If the range being truncated is covered by a shared pmd, huge_pmd_unshare will be called. For all but the last user of the shared pmd, huge_pmd_unshare will clear the pud pointing to the pmd. If the task in the middle of the page fault is not the last user, the ptep returned by huge_pte_alloc now points to another task's page table or worse. This leads to bad things such as incorrect page map/reference counts or invalid memory references. To fix, expand the use of i_mmap_rwsem as follows: - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. huge_pmd_share is only called via huge_pte_alloc, so callers of huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers of huge_pte_alloc continue to hold the semaphore until finished with the ptep. - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called. One problem with this scheme is that it requires taking i_mmap_rwsem before taking the page lock during page faults. This is not the order specified in the rest of mm code. Handling of hugetlbfs pages is mostly isolated today. Therefore, we use this alternative locking order for PageHuge() pages. mapping->i_mmap_rwsem hugetlb_fault_mutex (hugetlbfs specific page fault mutex) page->flags PG_locked (lock_page) To help with lock ordering issues, hugetlb_page_mapping_lock_write() is introduced to write lock the i_mmap_rwsem associated with a page. In most cases it is easy to get address_space via vma->vm_file->f_mapping. However, in the case of migration or memory errors for anon pages we do not have an associated vma. A new routine _get_hugetlb_page_mapping() will use anon_vma to get address_space in these cases. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2020-04-02 09:35:32 -07:00
arch	mm/sparse: rename pfn_present() to pfn_in_present_section()	2020-04-02 09:35:30 -07:00
block	Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2020-03-30 16:13:08 -07:00
certs	certs: Add wrapper function to check blacklisted binary hash	2019-11-12 12:25:50 +11:00
crypto	Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity	2020-02-20 15:15:16 -08:00
Documentation	mm/compaction: Disable compact_unevictable_allowed on RT	2020-04-02 09:35:31 -07:00
drivers	mm/sparse: rename pfn_present() to pfn_in_present_section()	2020-04-02 09:35:30 -07:00
fs	hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization	2020-04-02 09:35:32 -07:00
include	hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization	2020-04-02 09:35:32 -07:00
init	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next	2020-03-31 17:29:33 -07:00
ipc	Revert "ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()"	2020-02-21 11:22:15 -08:00
kernel	mm/compaction: Disable compact_unevictable_allowed on RT	2020-04-02 09:35:31 -07:00
lib	kasan: add test for invalid size in memmove	2020-04-02 09:35:30 -07:00
LICENSES	LICENSES: Rename other to deprecated	2019-05-03 06:34:32 -06:00
mm	hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization	2020-04-02 09:35:32 -07:00
net	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next	2020-03-31 17:29:33 -07:00
samples	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next	2020-03-31 17:29:33 -07:00
scripts	scripts/spelling.txt: add more spellings to spelling.txt	2020-04-02 09:35:25 -07:00
security	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next	2020-03-31 17:29:33 -07:00
sound	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next	2020-03-31 17:29:33 -07:00
tools	selftests: vm: drop dependencies on page flags from mlock2 tests	2020-04-02 09:35:31 -07:00
usr	initramfs: restore default compression behavior	2020-03-17 09:50:37 +09:00
virt	irqchip/gic-v4.1: Move doorbell management to the GICv4 abstraction layer	2020-03-24 12:15:51 +00:00
.clang-format	clang-format: Update with the latest for_each macro list	2020-03-06 21:50:05 +01:00
.cocciconfig
.get_maintainer.ignore	Opt out of scripts/get_maintainer.pl	2019-05-16 10:53:40 -07:00
.gitattributes	.gitattributes: use 'dts' diff driver for dts files	2019-12-04 19:44:11 -08:00
.gitignore	selftest/lkdtm: Use local .gitignore	2020-03-02 08:39:39 -07:00
.mailmap	media updates for v5.7-rc1	2020-03-30 13:42:05 -07:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	MAINTAINERS: Hand MIPS over to Thomas	2020-02-24 22:43:18 -08:00
Kbuild	kbuild: rename hostprogs-y/always to hostprogs/always-y	2020-02-04 01:53:07 +09:00
Kconfig	docs: kbuild: convert docs to ReST and rename to *.rst	2019-06-14 14:21:21 -06:00
MAINTAINERS	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next	2020-03-31 17:29:33 -07:00
Makefile	Kbuild updates for v5.7	2020-03-31 16:03:39 -07:00
README	Drop all 00-INDEX files from Documentation/	2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.