c55b51a06b
13992 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
|
55c0ece82a |
mm/hmm: add a helper function that fault pages and map them to a device
This is a all in one helper that fault pages in a range and map them to a device so that every single device driver do not have to re-implement this common pattern. This is taken from ODP RDMA in preparation of ODP RDMA convertion. It will be use by nouveau and other drivers. [jglisse@redhat.com: Was using wrong field and wrong enum] Link: http://lkml.kernel.org/r/20190409175340.26614-1-jglisse@redhat.com Link: http://lkml.kernel.org/r/20190403193318.16478-12-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
992de9a8b7 |
mm/hmm: allow to mirror vma of a file on a DAX backed filesystem
HMM mirror is a device driver helpers to mirror range of virtual address. It means that the process jobs running on the device can access the same virtual address as the CPU threads of that process. This patch adds support for mirroring mapping of file that are on a DAX block device (ie range of virtual address that is an mmap of a file in a filesystem on a DAX block device). There is no reason to not support such case when mirroring virtual address on a device. Note that unlike GUP code we do not take page reference hence when we back-off we have nothing to undo. [jglisse@redhat.com: move THP and hugetlbfs code path behind #if KCONFIG] Link: http://lkml.kernel.org/r/20190422163741.13029-1-jglisse@redhat.com Link: http://lkml.kernel.org/r/20190403193318.16478-10-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
63d5066f6e |
mm/hmm: mirror hugetlbfs (snapshoting, faulting and DMA mapping)
HMM mirror is a device driver helpers to mirror range of virtual address. It means that the process jobs running on the device can access the same virtual address as the CPU threads of that process. This patch adds support for hugetlbfs mapping (ie range of virtual address that are mmap of a hugetlbfs). [rcampbell@nvidia.com: fix initial PFN for hugetlbfs pages] Link: http://lkml.kernel.org/r/20190419233536.8080-1-rcampbell@nvidia.com Link: http://lkml.kernel.org/r/20190403193318.16478-9-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
023a019a9b |
mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays
The HMM mirror API can be use in two fashions. The first one where the HMM user coalesce multiple page faults into one request and set flags per pfns for of those faults. The second one where the HMM user want to pre-fault a range with specific flags. For the latter one it is a waste to have the user pre-fill the pfn arrays with a default flags value. This patch adds a default flags value allowing user to set them for a range without having to pre-fill the pfn array. Link: http://lkml.kernel.org/r/20190403193318.16478-8-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
a3e0d41c2b |
mm/hmm: improve driver API to work and wait over a range
A common use case for HMM mirror is user trying to mirror a range and before they could program the hardware it get invalidated by some core mm event. Instead of having user re-try right away to mirror the range provide a completion mechanism for them to wait for any active invalidation affecting the range. This also changes how hmm_range_snapshot() and hmm_range_fault() works by not relying on vma so that we can drop the mmap_sem when waiting and lookup the vma again on retry. Link: http://lkml.kernel.org/r/20190403193318.16478-7-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
73231612dc |
mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault()
Minor optimization around hmm_pte_need_fault(). Rename for consistency between code, comments and documentation. Also improves the comments on all the possible returns values. Improve the function by returning the number of populated entries in pfns array. Link: http://lkml.kernel.org/r/20190403193318.16478-6-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
25f23a0c71 |
mm/hmm: improve and rename hmm_vma_get_pfns() to hmm_range_snapshot()
Rename for consistency between code, comments and documentation. Also improves the comments on all the possible returns values. Improve the function by returning the number of populated entries in pfns array. Link: http://lkml.kernel.org/r/20190403193318.16478-5-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
9f454612f6 |
mm/hmm: do not erase snapshot when a range is invalidated
Users of HMM might be using the snapshot information to do preparatory step like dma mapping pages to a device before checking for invalidation through hmm_vma_range_done() so do not erase that information and assume users will do the right thing. Link: http://lkml.kernel.org/r/20190403193318.16478-4-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
704f3f2cf6 |
mm/hmm: use reference counting for HMM struct
Every time I read the code to check that the HMM structure does not vanish before it should thanks to the many lock protecting its removal i get a headache. Switch to reference counting instead it is much easier to follow and harder to break. This also remove some code that is no longer needed with refcounting. Link: http://lkml.kernel.org/r/20190403193318.16478-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
734fb89968 |
mm/hmm: select mmu notifier when selecting HMM
To avoid random config build issue, select mmu notifier when HMM is selected. In any cases when HMM get selected it will be by users that will also wants the mmu notifier. Link: http://lkml.kernel.org/r/20190403193318.16478-2-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
1b426bac66 |
hugetlb: use same fault hash key for shared and private mappings
hugetlb uses a fault mutex hash table to prevent page faults of the same pages concurrently. The key for shared and private mappings is different. Shared keys off address_space and file index. Private keys off mm and virtual address. Consider a private mappings of a populated hugetlbfs file. A fault will map the page from the file and if needed do a COW to map a writable page. Hugetlbfs hole punch uses the fault mutex to prevent mappings of file pages. It uses the address_space file index key. However, private mappings will use a different key and could race with this code to map the file page. This causes problems (BUG) for the page cache remove code as it expects the page to be unmapped. A sample stack is: page dumped because: VM_BUG_ON_PAGE(page_mapped(page)) kernel BUG at mm/filemap.c:169! ... RIP: 0010:unaccount_page_cache_page+0x1b8/0x200 ... Call Trace: __delete_from_page_cache+0x39/0x220 delete_from_page_cache+0x45/0x70 remove_inode_hugepages+0x13c/0x380 ? __add_to_page_cache_locked+0x162/0x380 hugetlbfs_fallocate+0x403/0x540 ? _cond_resched+0x15/0x30 ? __inode_security_revalidate+0x5d/0x70 ? selinux_file_permission+0x100/0x130 vfs_fallocate+0x13f/0x270 ksys_fallocate+0x3c/0x80 __x64_sys_fallocate+0x1a/0x20 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 There seems to be another potential COW issue/race with this approach of different private and shared keys as noted in commit |
||
|
0919e1b69a |
hugetlbfs: on restore reserve error path retain subpool reservation
When a huge page is allocated, PagePrivate() is set if the allocation consumed a reservation. When freeing a huge page, PagePrivate is checked. If set, it indicates the reservation should be restored. PagePrivate being set at free huge page time mostly happens on error paths. When huge page reservations are created, a check is made to determine if the mapping is associated with an explicitly mounted filesystem. If so, pages are also reserved within the filesystem. The default action when freeing a huge page is to decrement the usage count in any associated explicitly mounted filesystem. However, if the reservation is to be restored the reservation/use count within the filesystem should not be decrementd. Otherwise, a subsequent page allocation and free for the same mapping location will cause the file filesystem usage to go 'negative'. Filesystem Size Used Avail Use% Mounted on nodev 4.0G -4.0M 4.1G - /opt/hugepool To fix, when freeing a huge page do not adjust filesystem usage if PagePrivate() is set to indicate the reservation should be restored. I did not cc stable as the problem has been around since reserves were added to hugetlbfs and nobody has noticed. Link: http://lkml.kernel.org/r/20190328234704.27083-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
7567cfc5da |
mm/sparse.c: clean up obsolete code comment
The code comment above sparse_add_one_section() is obsolete and incorrect. Clean it up and write a new one. Link: http://lkml.kernel.org/r/20190329144250.14315-1-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
dae966dc8f |
mm/swap.c: __pagevec_lru_add_fn: typo fix
There is no function named munlock_vma_pages(). Correct it to munlock_vma_page(). Link: http://lkml.kernel.org/r/20190402095609.27181-1-peng.fan@nxp.com Signed-off-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2d0adf7e0d |
mm/hugetlb: get rid of NODEMASK_ALLOC
NODEMASK_ALLOC is used to allocate a nodemask bitmap, and it does it by first determining whether it should be allocated on the stack or dynamically, depending on NODES_SHIFT. Right now, it goes the dynamic path whenever the nodemask_t is above 32 bytes. Although we could bump it to a reasonable value, the largest a nodemask_t can get is 128 bytes, so since __nr_hugepages_store_common is called from a rather short stack we can just get rid of the NODEMASK_ALLOC call here. This reduces some code churn and complexity. Link: http://lkml.kernel.org/r/20190402133415.21983-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Alex Ghiti <alex@ghiti.fr> Cc: David Rientjes <rientjes@google.com> Cc: Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
fd875dca7c |
hugetlbfs: fix potential over/underflow setting node specific nr_hugepages
The number of node specific huge pages can be set via a file such as: /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages When a node specific value is specified, the global number of huge pages must also be adjusted. This adjustment is calculated as the specified node specific value + (global value - current node value). If the node specific value provided by the user is large enough, this calculation could overflow an unsigned long leading to a smaller than expected number of huge pages. To fix, check the calculation for overflow. If overflow is detected, use ULONG_MAX as the requested value. This is inline with the user request to allocate as many huge pages as possible. It was also noticed that the above calculation was done outside the hugetlb_lock. Therefore, the values could be inconsistent and result in underflow. To fix, the calculation is moved within the routine set_max_huge_pages() where the lock is held. In addition, the code in __nr_hugepages_store_common() which tries to handle the case of not being able to allocate a node mask would likely result in incorrect behavior. Luckily, it is very unlikely we will ever take this path. If we do, simply return ENOMEM. Link: http://lkml.kernel.org/r/20190328220533.19884-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Jing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Alex Ghiti <alex@ghiti.fr> Cc: Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
299c83dce9 |
mem-hotplug: fix node spanned pages when we have a node with only ZONE_MOVABLE
|
||
|
3481c37ffa |
mm/vmscan: drop may_writepage and classzone_idx from direct reclaim begin template
There are three tracepoints using this template, which are mm_vmscan_direct_reclaim_begin, mm_vmscan_memcg_reclaim_begin, mm_vmscan_memcg_softlimit_reclaim_begin. Regarding mm_vmscan_direct_reclaim_begin, sc.may_writepage is !laptop_mode, that's a static setting, and reclaim_idx is derived from gfp_mask which is already show in this tracepoint. Regarding mm_vmscan_memcg_reclaim_begin, may_writepage is !laptop_mode too, and reclaim_idx is (MAX_NR_ZONES-1), which are both static value. mm_vmscan_memcg_softlimit_reclaim_begin is the same with mm_vmscan_memcg_reclaim_begin. So we can drop them all. Link: http://lkml.kernel.org/r/1553736322-32235-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
024eee0e83 |
mm: page_mkclean vs MADV_DONTNEED race
MADV_DONTNEED is handled with mmap_sem taken in read mode. We call page_mkclean without holding mmap_sem. MADV_DONTNEED implies that pages in the region are unmapped and subsequent access to the pages in that range is handled as a new page fault. This implies that if we don't have parallel access to the region when MADV_DONTNEED is run we expect those range to be unallocated. w.r.t page_mkclean() we need to make sure that we don't break the MADV_DONTNEED semantics. MADV_DONTNEED check for pmd_none without holding pmd_lock. This implies we skip the pmd if we temporarily mark pmd none. Avoid doing that while marking the page clean. Keep the sequence same for dax too even though we don't support MADV_DONTNEED for dax mapping The bug was noticed by code review and I didn't observe any failures w.r.t test run. This is similar to commit |
||
|
fc1d8e7cca |
mm: introduce put_user_page*(), placeholder versions
A discussion of the overall problem is below. As mentioned in patch 0001, the steps are to fix the problem are: 1) Provide put_user_page*() routines, intended to be used for releasing pages that were pinned via get_user_pages*(). 2) Convert all of the call sites for get_user_pages*(), to invoke put_user_page*(), instead of put_page(). This involves dozens of call sites, and will take some time. 3) After (2) is complete, use get_user_pages*() and put_user_page*() to implement tracking of these pages. This tracking will be separate from the existing struct page refcounting. 4) Use the tracking and identification of these pages, to implement special handling (especially in writeback paths) when the pages are backed by a filesystem. Overview ======== Some kernel components (file systems, device drivers) need to access memory that is specified via process virtual address. For a long time, the API to achieve that was get_user_pages ("GUP") and its variations. However, GUP has critical limitations that have been overlooked; in particular, GUP does not interact correctly with filesystems in all situations. That means that file-backed memory + GUP is a recipe for potential problems, some of which have already occurred in the field. GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem code to get the struct page behind a virtual address and to let storage hardware perform a direct copy to or from that page. This is a short-lived access pattern, and as such, the window for a concurrent writeback of GUP'd page was small enough that there were not (we think) any reported problems. Also, userspace was expected to understand and accept that Direct IO was not synchronized with memory-mapped access to that data, nor with any process address space changes such as munmap(), mremap(), etc. Over the years, more GUP uses have appeared (virtualization, device drivers, RDMA) that can keep the pages they get via GUP for a long period of time (seconds, minutes, hours, days, ...). This long-term pinning makes an underlying design problem more obvious. In fact, there are a number of key problems inherent to GUP: Interactions with file systems ============================== File systems expect to be able to write back data, both to reclaim pages, and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain write access to the file memory pages means that such hardware can dirty the pages, without the filesystem being aware. This can, in some cases (depending on filesystem, filesystem options, block device, block device options, and other variables), lead to data corruption, and also to kernel bugs of the form: kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899! backtrace: ext4_writepage __writepage write_cache_pages ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork ...which is due to the file system asserting that there are still buffer heads attached: ({ \ BUG_ON(!PagePrivate(page)); \ ((struct buffer_head *)page_private(page)); \ }) Dave Chinner's description of this is very clear: "The fundamental issue is that ->page_mkwrite must be called on every write access to a clean file backed page, not just the first one. How long the GUP reference lasts is irrelevant, if the page is clean and you need to dirty it, you must call ->page_mkwrite before it is marked writeable and dirtied. Every. Time." This is just one symptom of the larger design problem: real filesystems that actually write to a backing device, do not actually support get_user_pages() being called on their pages, and letting hardware write directly to those pages--even though that pattern has been going on since about 2005 or so. Long term GUP ============= Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a writeable mapping is created), and the pages are file-backed. That can lead to filesystem corruption. What happens is that when a file-backed page is being written back, it is first mapped read-only in all of the CPU page tables; the file system then assumes that nobody can write to the page, and that the page content is therefore stable. Unfortunately, the GUP callers generally do not monitor changes to the CPU pages tables; they instead assume that the following pattern is safe (it's not): get_user_pages() Hardware can keep a reference to those pages for a very long time, and write to it at any time. Because "hardware" here means "devices that are not a CPU", this activity occurs without any interaction with the kernel's file system code. for each page set_page_dirty put_page() In fact, the GUP documentation even recommends that pattern. Anyway, the file system assumes that the page is stable (nothing is writing to the page), and that is a problem: stable page content is necessary for many filesystem actions during writeback, such as checksum, encryption, RAID striping, etc. Furthermore, filesystem features like COW (copy on write) or snapshot also rely on being able to use a new page for as memory for that memory range inside the file. Corruption during write back is clearly possible here. To solve that, one idea is to identify pages that have active GUP, so that we can use a bounce page to write stable data to the filesystem. The filesystem would work on the bounce page, while any of the active GUP might write to the original page. This would avoid the stable page violation problem, but note that it is only part of the overall solution, because other problems remain. Other filesystem features that need to replace the page with a new one can be inhibited for pages that are GUP-pinned. This will, however, alter and limit some of those filesystem features. The only fix for that would be to require GUP users to monitor and respond to CPU page table updates. Subsystems such as ODP and HMM do this, for example. This aspect of the problem is still under discussion. Direct IO ========= Direct IO can cause corruption, if userspace does Direct-IO that writes to a range of virtual addresses that are mmap'd to a file. The pages written to are file-backed pages that can be under write back, while the Direct IO is taking place. Here, Direct IO races with a write back: it calls GUP before page_mkclean() has replaced the CPU pte with a read-only entry. The race window is pretty small, which is probably why years have gone by before we noticed this problem: Direct IO is generally very quick, and tends to finish up before the filesystem gets around to do anything with the page contents. However, it's still a real problem. The solution is to never let GUP return pages that are under write back, but instead, force GUP to take a write fault on those pages. That way, GUP will properly synchronize with the active write back. This does not change the required GUP behavior, it just avoids that race. Details ======= Introduces put_user_page(), which simply calls put_page(). This provides a way to update all get_user_pages*() callers, so that they call put_user_page(), instead of put_page(). Also introduces put_user_pages(), and a few dirty/locked variations, as a replacement for release_pages(), and also as a replacement for open-coded loops that release multiple pages. These may be used for subsequent performance improvements, via batching of pages to be released. This is the first step of fixing a problem (also described in [1] and [2]) with interactions between get_user_pages ("gup") and filesystems. Problem description: let's start with a bug report. Below, is what happens sometimes, under memory pressure, when a driver pins some pages via gup, and then marks those pages dirty, and releases them. Note that the gup documentation actually recommends that pattern. The problem is that the filesystem may do a writeback while the pages were gup-pinned, and then the filesystem believes that the pages are clean. So, when the driver later marks the pages as dirty, that conflicts with the filesystem's page tracking and results in a BUG(), like this one that I experienced: kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899! backtrace: ext4_writepage __writepage write_cache_pages ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork ...which is due to the file system asserting that there are still buffer heads attached: ({ \ BUG_ON(!PagePrivate(page)); \ ((struct buffer_head *)page_private(page)); \ }) Dave Chinner's description of this is very clear: "The fundamental issue is that ->page_mkwrite must be called on every write access to a clean file backed page, not just the first one. How long the GUP reference lasts is irrelevant, if the page is clean and you need to dirty it, you must call ->page_mkwrite before it is marked writeable and dirtied. Every. Time." This is just one symptom of the larger design problem: real filesystems that actually write to a backing device, do not actually support get_user_pages() being called on their pages, and letting hardware write directly to those pages--even though that pattern has been going on since about 2005 or so. The steps are to fix it are: 1) (This patch): provide put_user_page*() routines, intended to be used for releasing pages that were pinned via get_user_pages*(). 2) Convert all of the call sites for get_user_pages*(), to invoke put_user_page*(), instead of put_page(). This involves dozens of call sites, and will take some time. 3) After (2) is complete, use get_user_pages*() and put_user_page*() to implement tracking of these pages. This tracking will be separate from the existing struct page refcounting. 4) Use the tracking and identification of these pages, to implement special handling (especially in writeback paths) when the pages are backed by a filesystem. [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()" [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()" Link: http://lkml.kernel.org/r/20190327023632.13307-2-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> [docs] Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Christoph Lameter <cl@linux.com> Tested-by: Ira Weiny <ira.weiny@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
4eb0716e86 |
hugetlb: allow to free gigantic pages regardless of the configuration
On systems without CONTIG_ALLOC activated but that support gigantic pages, boottime reserved gigantic pages can not be freed at all. This patch simply enables the possibility to hand back those pages to memory allocator. Link: http://lkml.kernel.org/r/20190327063626.18421-5-alex@ghiti.fr Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Acked-by: David S. Miller <davem@davemloft.net> [sparc] Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
8df995f6bd |
mm: simplify MEMORY_ISOLATION && COMPACTION || CMA into CONTIG_ALLOC
This condition allows to define alloc_contig_range, so simplify it into a more accurate naming. Link: http://lkml.kernel.org/r/20190327063626.18421-4-alex@ghiti.fr Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
1df3a33907 |
mm/cma.c: fix crash on CMA allocation if bitmap allocation fails
|
||
|
113b7dfd82 |
mm: memcontrol: quarantine the mem_cgroup_[node_]nr_lru_pages() API
Only memcg_numa_stat_show() uses those wrappers and the lru bitmasks, group them together. Link: http://lkml.kernel.org/r/20190228163020.24100-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
21d89d151b |
mm: memcontrol: push down mem_cgroup_nr_lru_pages()
mem_cgroup_nr_lru_pages() is just a convenience wrapper around memcg_page_state() that takes bitmasks of lru indexes and aggregates the counts for those. Replace callsites where the bitmask is simple enough with direct memcg_page_state() call(s). Link: http://lkml.kernel.org/r/20190228163020.24100-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2b487e59f0 |
mm: memcontrol: push down mem_cgroup_node_nr_lru_pages()
mem_cgroup_node_nr_lru_pages() is just a convenience wrapper around lruvec_page_state() that takes bitmasks of lru indexes and aggregates the counts for those. Replace callsites where the bitmask is simple enough with direct lruvec_page_state() calls. This removes the last extern user of mem_cgroup_node_nr_lru_pages(), so make that function private again, too. Link: http://lkml.kernel.org/r/20190228163020.24100-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
22796c844f |
mm: memcontrol: replace node summing with memcg_page_state()
Instead of adding up the node counters, use memcg_page_state() to get the memcg state directly. This is a bit cheaper and more stream-lined. Link: http://lkml.kernel.org/r/20190228163020.24100-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
1a61ab8038 |
mm: memcontrol: replace zone summing with lruvec_page_state()
Instead of adding up the zone counters, use lruvec_page_state() to get the node state directly. This is a bit cheaper and more stream-lined. Link: http://lkml.kernel.org/r/20190228163020.24100-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
132bb8cfc9 |
mm/vmscan: add tracepoints for node reclaim
The page alloc fast path it may perform node reclaim, which may cause a latency spike. We should add tracepoint for this event, and also measure the latency it causes. So bellow two tracepoints are introduced, mm_vmscan_node_reclaim_begin mm_vmscan_node_reclaim_end Link: http://lkml.kernel.org/r/1551421452-5385-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: <shaoyafang@didiglobal.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
5e65af19e8 |
mm/page_isolation.c: remove redundant pfn_valid_within() in __first_valid_page()
pfn_valid_within() calls pfn_valid() when CONFIG_HOLES_IN_ZONE making it redundant for both definitions (w/wo CONFIG_MEMORY_HOTPLUG) of the helper pfn_to_online_page() which either calls pfn_valid() or pfn_valid_within(). pfn_valid_within() being 1 when !CONFIG_HOLES_IN_ZONE is irrelevant either way. This does not change functionality. Link: http://lkml.kernel.org/r/1553141595-26907-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2b59e01a3a |
mm/cma.c: fix the bitmap status to show failed allocation reason
Currently one bit in cma bitmap represents number of pages rather than one page, cma->count means cma size in pages. So to find available pages via find_next_zero_bit()/find_next_bit() we should use cma size not in pages but in bits although current free pages number is correct due to zero value of order_per_bit. Once order_per_bit is changed the bitmap status will be incorrect. The size input in cma_debug_show_areas() is not correct. It will affect the available pages at some position to debug the failure issue. This is an example with order_per_bit = 1 Before this change: [ 4.120060] cma: number of available pages: 1@93+4@108+7@121+7@137+7@153+7@169+7@185+7@201+3@213+3@221+3@229+3@237+3@245+3@253+3@261+3@269+3@277+3@285+3@293+3@301+3@309+3@317+3@325+19@333+15@369+512@512=> 638 free of 1024 total pages After this change: [ 4.143234] cma: number of available pages: 2@93+8@108+14@121+14@137+14@153+14@169+14@185+14@201+6@213+6@221+6@229+6@237+6@245+6@253+6@261+6@269+6@277+6@285+6@293+6@301+6@309+6@317+6@325+38@333+30@369=> 252 free of 1024 total pages Obviously the bitmap status before is incorrect. Link: http://lkml.kernel.org/r/20190320060829.9144-1-zbestahu@gmail.com Signed-off-by: Yue Hu <huyue2@yulong.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Laura Abbott <labbott@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
dd7ef7bd14 |
mm/compaction.c: fix an undefined behaviour
In a low-memory situation, cc->fast_search_fail can keep increasing as it
is unable to find an available page to isolate in
fast_isolate_freepages(). As the result, it could trigger an error below,
so just compare with the maximum bits can be shifted first.
UBSAN: Undefined behaviour in mm/compaction.c:1160:30
shift exponent 64 is too large for 64-bit type 'unsigned long'
CPU: 131 PID: 1308 Comm: kcompactd1 Kdump: loaded Tainted: G
W L 5.0.0+ #17
Call trace:
dump_backtrace+0x0/0x450
show_stack+0x20/0x2c
dump_stack+0xc8/0x14c
__ubsan_handle_shift_out_of_bounds+0x7e8/0x8c4
compaction_alloc+0x2344/0x2484
unmap_and_move+0xdc/0x1dbc
migrate_pages+0x274/0x1310
compact_zone+0x26ec/0x43bc
kcompactd+0x15b8/0x1a24
kthread+0x374/0x390
ret_from_fork+0x10/0x18
[akpm@linux-foundation.org: code cleanup]
Link: http://lkml.kernel.org/r/20190320203338.53367-1-cai@lca.pw
Fixes:
|
||
|
d3ba3ae197 |
mm/memory_hotplug.c: fix the wrong usage of N_HIGH_MEMORY
In node_states_check_changes_online(), N_HIGH_MEMORY is used to substitute
ZONE_HIGHMEM directly. This is not right. N_HIGH_MEMORY is to mark the
memory state of node. Here zone index is checked, which should be
compared with 'ZONE_HIGHMEM' accordingly.
Replace it with ZONE_HIGHMEM.
This is a code cleanup - no known runtime effects.
Link: http://lkml.kernel.org/r/20190320080732.14933-1-bhe@redhat.com
Fixes:
|
||
|
39186cbe65 |
mm,memory_hotplug: drop redundant hugepage_migration_supported check
has_unmovable_pages() already checks whether the hugetlb page supports migration, so all non-migratable hugetlb pages should have been caught there. Let us drop the check from scan_movable_pages() as is redundant. Link: http://lkml.kernel.org/r/20190320152658.10855-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
10eeadf304 |
mm,memory_hotplug: unlock 1GB-hugetlb on x86_64
On x86_64, 1GB-hugetlb pages could never be offlined due to the fact that hugepage_migration_supported() returned false for PUD_SHIFT. So whenever we wanted to offline a memblock containing a gigantic hugetlb page, we never got beyond has_unmovable_pages() check. This changed with [1], where now we also return true for PUD_SHIFT. After that patch, the check in has_unmovable_pages() and scan_movable_pages() returned true, but we still had a final barrier in do_migrate_range(): if (compound_order(head) > PFN_SECTION_SHIFT) { ret = -EBUSY; break; } This is not really nice, and we do not really need it. It is perfectly possible to migrate a gigantic page as long as another node has a spare gigantic page for us. In alloc_huge_page_nodemask(), we calculate the __real__ number of free pages, and if any, we try to dequeue one from another node. This all works fine when we do have another node with a spare gigantic page, but if that is not the case, alloc_huge_page_nodemask() ends up calling alloc_migrate_huge_page() which bails out if the wanted page is gigantic. That is mainly because finding a 1GB (or even 16GB on powerpc) contiguous memory is quite unlikely when the system has been running for a while. In that situation, we will keep looping forever because scan_movable_pages() will give us the same page and we will fail again because there is no node where we can dequeue a gigantic page from. This is not nice, and it has been raised that we might want to treat -ENOMEM as a fatal error in do_migrate_range(), but this has to be checked further. Anyway, I would tend say that this is the administrator's job, to make sure that the system can keep up with the memory to be offlined, so that would mean that if we want to use gigantic pages, make sure that the other nodes have at least enough gigantic pages to keep up in case we need to offline memory. Just for the sake of completeness, this is one of the tests done: # echo 1 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages # echo 1 > /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/nr_hugepages # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages 1 # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages 1 # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/nr_hugepages 1 # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages 1 (hugetlb1gb is a program that maps 1GB region using MAP_HUGE_1GB) # numactl -m 1 ./hugetlb1gb # cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages 0 # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages 1 # offline node1 memory # cat /sys/devices/system/node/node2/hugepages/hugepages-1048576kB/free_hugepages 0 [1] https://lore.kernel.org/patchwork/patch/998796/ Link: http://lkml.kernel.org/r/20190320152658.10855-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
7af75561e1 |
mm/gup: add FOLL_LONGTERM capability to GUP fast
DAX pages were previously unprotected from longterm pins when users called get_user_pages_fast(). Use the new FOLL_LONGTERM flag to check for DEVMAP pages and fall back to regular GUP processing if a DEVMAP page is encountered. [ira.weiny@intel.com: v3] Link: http://lkml.kernel.org/r/20190328084422.29911-5-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190328084422.29911-5-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-5-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
73b0140bf0 |
mm/gup: change GUP fast to use flags rather than a write 'bool'
To facilitate additional options to get_user_pages_fast() change the singular write parameter to be gup_flags. This patch does not change any functionality. New functionality will follow in subsequent patches. Some of the get_user_pages_fast() call sites were unchanged because they already passed FOLL_WRITE or 0 for the write parameter. NOTE: It was suggested to change the ordering of the get_user_pages_fast() arguments to ensure that callers were converted. This breaks the current GUP call site convention of having the returned pages be the final parameter. So the suggestion was rejected. Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Mike Marshall <hubcap@omnibond.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
b798bec474 |
mm/gup: change write parameter to flags in fast walk
In order to support more options in the GUP fast walk, change the write parameter to flags throughout the call stack. This patch does not change functionality and passes FOLL_WRITE where write was previously used. Link: http://lkml.kernel.org/r/20190328084422.29911-3-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-3-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
932f4a630a |
mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM
Pach series "Add FOLL_LONGTERM to GUP fast and use it". HFI1, qib, and mthca, use get_user_pages_fast() due to its performance advantages. These pages can be held for a significant time. But get_user_pages_fast() does not protect against mapping FS DAX pages. Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which retains the performance while also adding the FS DAX checks. XDP has also shown interest in using this functionality.[1] In addition we change get_user_pages() to use the new FOLL_LONGTERM flag and remove the specialized get_user_pages_longterm call. [1] https://lkml.org/lkml/2019/3/19/939 "longterm" is a relative thing and at this point is probably a misnomer. This is really flagging a pin which is going to be given to hardware and can't move. I've thought of a couple of alternative names but I think we have to settle on if we are going to use FL_LAYOUT or something else to solve the "longterm" problem. Then I think we can change the flag to a better name. Secondly, it depends on how often you are registering memory. I have spoken with some RDMA users who consider MR in the performance path... For the overall application performance. I don't have the numbers as the tests for HFI1 were done a long time ago. But there was a significant advantage. Some of which is probably due to the fact that you don't have to hold mmap_sem. Finally, architecturally I think it would be good for everyone to use *_fast. There are patches submitted to the RDMA list which would allow the use of *_fast (they reworking the use of mmap_sem) and as soon as they are accepted I'll submit a patch to convert the RDMA core as well. Also to this point others are looking to use *_fast. As an aside, Jasons pointed out in my previous submission that *_fast and *_unlocked look very much the same. I agree and I think further cleanup will be coming. But I'm focused on getting the final solution for DAX at the moment. This patch (of 7): This patch starts a series which aims to support FOLL_LONGTERM in get_user_pages_fast(). Some callers who would like to do a longterm (user controlled pin) of pages with the fast variant of GUP for performance purposes. Rather than have a separate get_user_pages_longterm() call, introduce FOLL_LONGTERM and change the longterm callers to use it. This patch does not change any functionality. In the short term "longterm" or user controlled pins are unsafe for Filesystems and FS DAX in particular has been blocked. However, callers of get_user_pages_fast() were not "protected". FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it requires vmas to determine if DAX is in use. NOTE: In merging with the CMA changes we opt to change the get_user_pages() call in check_and_migrate_cma_pages() to a call of __get_user_pages_locked() on the newly migrated pages. This makes the code read better in that we are calling __get_user_pages_locked() on the pages before and after a potential migration. As a side affect some of the interfaces are cleaned up but this is not the primary purpose of the series. In review[1] it was asked: <quote> > This I don't get - if you do lock down long term mappings performance > of the actual get_user_pages call shouldn't matter to start with. > > What do I miss? A couple of points. First "longterm" is a relative thing and at this point is probably a misnomer. This is really flagging a pin which is going to be given to hardware and can't move. I've thought of a couple of alternative names but I think we have to settle on if we are going to use FL_LAYOUT or something else to solve the "longterm" problem. Then I think we can change the flag to a better name. Second, It depends on how often you are registering memory. I have spoken with some RDMA users who consider MR in the performance path... For the overall application performance. I don't have the numbers as the tests for HFI1 were done a long time ago. But there was a significant advantage. Some of which is probably due to the fact that you don't have to hold mmap_sem. Finally, architecturally I think it would be good for everyone to use *_fast. There are patches submitted to the RDMA list which would allow the use of *_fast (they reworking the use of mmap_sem) and as soon as they are accepted I'll submit a patch to convert the RDMA core as well. Also to this point others are looking to use *_fast. As an asside, Jasons pointed out in my previous submission that *_fast and *_unlocked look very much the same. I agree and I think further cleanup will be coming. But I'm focused on getting the final solution for DAX at the moment. </quote> [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965 [ira.weiny@intel.com: v3] Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rich Felker <dalias@libc.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: James Hogan <jhogan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
a222f34158 |
mm: generalize putback scan functions
This combines two similar functions move_active_pages_to_lru() and putback_inactive_pages() into single move_pages_to_lru(). This remove duplicate code and makes object file size smaller. Before: text data bss dec hex filename 57082 4732 128 61942 f1f6 mm/vmscan.o After: text data bss dec hex filename 55112 4600 128 59840 e9c0 mm/vmscan.o Note, that now we are checking for !page_evictable() coming from shrink_active_list(), which shouldn't change any behavior since that path works with evictable pages only. Link: http://lkml.kernel.org/r/155290129627.31489.8321971028677203248.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
f372d89e5d |
mm: remove pages_to_free argument of move_active_pages_to_lru()
We may use input argument list as output argument too. This makes the function more similar to putback_inactive_pages(). Link: http://lkml.kernel.org/r/155290129079.31489.16180612694090502942.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
9851ac1359 |
mm: move nr_deactivate accounting to shrink_active_list()
We know which LRU is not active. [chris@chrisdown.name: fix build on !CONFIG_MEMCG] Link: http://lkml.kernel.org/r/20190322150513.GA22021@chrisdown.name Link: http://lkml.kernel.org/r/155290128498.31489.18250485448913338607.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Chris Down <chris@chrisdown.name> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
886cf1901d |
mm: move recent_rotated pages calculation to shrink_inactive_list()
Patch series "mm: Generalize putback functions"] putback_inactive_pages() and move_active_pages_to_lru() are almost similar, so this patchset merges them ina single function. This patch (of 4): The patch moves the calculation from putback_inactive_pages() to shrink_inactive_list(). This makes putback_inactive_pages() looking more similar to move_active_pages_to_lru(). To do that, we account activated pages in reclaim_stat::nr_activate. Since a page may change its LRU type from anon to file cache inside shrink_page_list() (see ClearPageSwapBacked()), we have to account pages for the both types. So, nr_activate becomes an array. Previously we used nr_activate to account PGACTIVATE events, but now we account them into pgactivate variable (since they are about number of pages in general, not about sum of hpage_nr_pages). Link: http://lkml.kernel.org/r/155290127956.31489.3393586616054413298.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
63931eb975 |
mm, page_alloc: disallow __GFP_COMP in alloc_pages_exact()
alloc_pages_exact*() allocates a page of sufficient order and then splits it to return only the number of pages requested. That makes it incompatible with __GFP_COMP, because compound pages cannot be split. As shown by [1] things may silently work until the requested size (possibly depending on user) stops being power of two. Then for CONFIG_DEBUG_VM, BUG_ON() triggers in split_page(). Without CONFIG_DEBUG_VM, consequences are unclear. There are several options here, none of them great: 1) Don't do the splitting when __GFP_COMP is passed, and return the whole compound page. However if caller then returns it via free_pages_exact(), that will be unexpected and the freeing actions there will be wrong. 2) Warn and remove __GFP_COMP from the flags. But the caller may have really wanted it, so things may break later somewhere. 3) Warn and return NULL. However NULL may be unexpected, especially for small sizes. This patch picks option 2, because as Michal Hocko put it: "callers wanted it" is much less probable than "caller is simply confused and more gfp flags is surely better than fewer". [1] https://lore.kernel.org/lkml/20181126002805.GI18977@shao2-debian/T/#u Link: http://lkml.kernel.org/r/0c6393eb-b28d-4607-c386-862a71f09de6@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Takashi Iwai <tiwai@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
5fd4ca2d84 |
mm: page cache: store only head pages in i_pages
Transparent Huge Pages are currently stored in i_pages as pointers to consecutive subpages. This patch changes that to storing consecutive pointers to the head page in preparation for storing huge pages more efficiently in i_pages. Large parts of this are "inspired" by Kirill's patch https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/ [willy@infradead.org: fix swapcache pages] Link: http://lkml.kernel.org/r/20190324155441.GF10344@bombadil.infradead.org [kirill@shutemov.name: hugetlb stores pages in page cache differently] Link: http://lkml.kernel.org/r/20190404134553.vuvhgmghlkiw2hgl@kshutemo-mobl1 Link: http://lkml.kernel.org/r/20190307153051.18815-1-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Kirill Shutemov <kirill@shutemov.name> Reviewed-and-tested-by: Song Liu <songliubraving@fb.com> Tested-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Tested-by: Qian Cai <cai@lca.pw> Cc: Hugh Dickins <hughd@google.com> Cc: Song Liu <liu.song.a23@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
f0fd50504a |
mm/cma_debug.c: fix the break condition in cma_maxchunk_get()
If not find zero bit in find_next_zero_bit(), it will return the size parameter passed in, so the start bit should be compared with bitmap_maxno rather than cma->count. Although getting maxchunk is working fine due to zero value of order_per_bit currently, the operation will be stuck if order_per_bit is set as non-zero. Link: http://lkml.kernel.org/r/20190319092734.276-1-zbestahu@gmail.com Signed-off-by: Yue Hu <huyue2@yulong.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Joe Perches <joe@perches.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Safonov <d.safonov@partner.samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
745e10146c |
mm/slab.c: fix an infinite loop in leaks_show()
"cat /proc/slab_allocators" could hang forever on SMP machines with
kmemleak or object debugging enabled due to other CPUs running do_drain()
will keep making kmemleak_object or debug_objects_cache dirty and unable
to escape the first loop in leaks_show(),
do {
set_store_user_clean(cachep);
drain_cpu_caches(cachep);
...
} while (!is_store_user_clean(cachep));
For example,
do_drain
slabs_destroy
slab_destroy
kmem_cache_free
__cache_free
___cache_free
kmemleak_free_recursive
delete_object_full
__delete_object
put_object
free_object_rcu
kmem_cache_free
cache_free_debugcheck --> dirty kmemleak_object
One approach is to check cachep->name and skip both kmemleak_object and
debug_objects_cache in leaks_show(). The other is to set store_user_clean
after drain_cpu_caches() which leaves a small window between
drain_cpu_caches() and set_store_user_clean() where per-CPU caches could
be dirty again lead to slightly wrong information has been stored but
could also speed up things significantly which sounds like a good
compromise. For example,
# cat /proc/slab_allocators
0m42.778s # 1st approach
0m0.737s # 2nd approach
[akpm@linux-foundation.org: tweak comment]
Link: http://lkml.kernel.org/r/20190411032635.10325-1-cai@lca.pw
Fixes:
|
||
|
632b2ef0c7 |
mm/slub.c: update the comment about slab frozen
Now frozen slab can only be on the per cpu partial list. Link: http://lkml.kernel.org/r/1554022325-11305-1-git-send-email-liu.xiang6@zte.com.cn Signed-off-by: Liu Xiang <liu.xiang6@zte.com.cn> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
517f9f1ee5 |
mm/slab.c: remove unneed check in cpuup_canceled
nc is a member of percpu allocation memory, and cannot be NULL. Link: http://lkml.kernel.org/r/1553159353-5056-1-git-send-email-lirongqing@baidu.com Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
a4d3f8916c |
slub: remove useless kmem_cache_debug() before remove_full()
When CONFIG_SLUB_DEBUG is not enabled, remove_full() is empty. While CONFIG_SLUB_DEBUG is enabled, remove_full() can check s->flags by itself. So kmem_cache_debug() is useless and can be removed. Link: http://lkml.kernel.org/r/1552577313-2830-1-git-send-email-liu.xiang6@zte.com.cn Signed-off-by: Liu Xiang <liu.xiang6@zte.com.cn> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
16cb0ec75b |
slab: use slab_list instead of lru
Currently we use the page->lru list for maintaining lists of slabs. We have a list in the page structure (slab_list) that can be used for this purpose. Doing so makes the code cleaner since we are not overloading the lru list. Use the slab_list instead of the lru list for maintaining lists of slabs. Link: http://lkml.kernel.org/r/20190402230545.2929-7-tobin@kernel.org Signed-off-by: Tobin C. Harding <tobin@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
916ac05278 |
slub: use slab_list instead of lru
Currently we use the page->lru list for maintaining lists of slabs. We have a list in the page structure (slab_list) that can be used for this purpose. Doing so makes the code cleaner since we are not overloading the lru list. Use the slab_list instead of the lru list for maintaining lists of slabs. Link: http://lkml.kernel.org/r/20190402230545.2929-6-tobin@kernel.org Signed-off-by: Tobin C. Harding <tobin@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
6dfd1b653c |
slub: add comments to endif pre-processor macros
SLUB allocator makes heavy use of ifdef/endif pre-processor macros. The pairing of these statements is at times hard to follow e.g. if the pair are further than a screen apart or if there are nested pairs. We can reduce cognitive load by adding a comment to the endif statement of form #ifdef CONFIG_FOO ... #endif /* CONFIG_FOO */ Add comments to endif pre-processor macros if ifdef/endif pair is not immediately apparent. Link: http://lkml.kernel.org/r/20190402230545.2929-5-tobin@kernel.org Signed-off-by: Tobin C. Harding <tobin@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
adab7b6818 |
slob: use slab_list instead of lru
Currently we use the page->lru list for maintaining lists of slabs. We
have a list_head in the page structure (slab_list) that can be used for
this purpose. Doing so makes the code cleaner since we are not
overloading the lru list.
The slab_list is part of a union within the page struct (included here
stripped down):
union {
struct { /* Page cache and anonymous pages */
struct list_head lru;
...
};
struct {
dma_addr_t dma_addr;
};
struct { /* slab, slob and slub */
union {
struct list_head slab_list;
struct { /* Partial pages */
struct page *next;
int pages; /* Nr of pages left */
int pobjects; /* Approximate count */
};
};
...
Here we see that slab_list and lru are the same bits. We can verify that
this change is safe to do by examining the object file produced from
slob.c before and after this patch is applied.
Steps taken to verify:
1. checkout current tip of Linus' tree
commit
|
||
|
130e8e09e2 |
slob: respect list_head abstraction layer
Currently we reach inside the list_head. This is a violation of the layer of abstraction provided by the list_head. It makes the code fragile. More importantly it makes the code wicked hard to understand. The code reaches into the list_head structure to counteract the fact that the list _may_ have been changed during slob_page_alloc(). Instead of this we can add a return parameter to slob_page_alloc() to signal that the list was modified (list_del() called with page->lru to remove page from the freelist). This code is concerned with an optimisation that counters the tendency for first fit allocation algorithm to fragment memory into many small chunks at the front of the memory pool. Since the page is only removed from the list when an allocation uses _all_ the remaining memory in the page then in this special case fragmentation does not occur and we therefore do not need the optimisation. Add a return parameter to slob_page_alloc() to signal that the allocation used up the whole page and that the page was removed from the free list. After calling slob_page_alloc() check the return value just added and only attempt optimisation if the page is still on the list. Use list_head API instead of reaching into the list_head structure to check if sp is at the front of the list. Link: http://lkml.kernel.org/r/20190402230545.2929-3-tobin@kernel.org Signed-off-by: Tobin C. Harding <tobin@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2bf753e64b |
mm/hugetlb.c: don't put_page in lock of hugetlb_lock
spinlock recursion happened when do LTP test:
#!/bin/bash
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
./runltp -p -f hugetlb &
The dtor returned by get_compound_page_dtor in __put_compound_page may be
the function of free_huge_page which will lock the hugetlb_lock, so don't
put_page in lock of hugetlb_lock.
BUG: spinlock recursion on CPU#0, hugemmap05/1079
lock: hugetlb_lock+0x0/0x18, .magic: dead4ead, .owner: hugemmap05/1079, .owner_cpu: 0
Call trace:
dump_backtrace+0x0/0x198
show_stack+0x24/0x30
dump_stack+0xa4/0xcc
spin_dump+0x84/0xa8
do_raw_spin_lock+0xd0/0x108
_raw_spin_lock+0x20/0x30
free_huge_page+0x9c/0x260
__put_compound_page+0x44/0x50
__put_page+0x2c/0x60
alloc_surplus_huge_page.constprop.19+0xf0/0x140
hugetlb_acct_memory+0x104/0x378
hugetlb_reserve_pages+0xe0/0x250
hugetlbfs_file_mmap+0xc0/0x140
mmap_region+0x3e8/0x5b0
do_mmap+0x280/0x460
vm_mmap_pgoff+0xf4/0x128
ksys_mmap_pgoff+0xb4/0x258
__arm64_sys_mmap+0x34/0x48
el0_svc_common+0x78/0x130
el0_svc_handler+0x38/0x78
el0_svc+0x8/0xc
Link: http://lkml.kernel.org/r/b8ade452-2d6b-0372-32c2-703644032b47@huawei.com
Fixes:
|
||
|
fce86ff580 |
mm/huge_memory: fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned addresses
Starting with |
||
|
3aff5fac54 |
Merge branch 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou: - scan hint update which helps address performance issues with heavily fragmented blocks - lockdep fix when freeing an allocation causes balance work to be scheduled * 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu: percpu: remove spurious lock dependency between percpu and sched percpu: use chunk scan_hint to skip some scanning percpu: convert chunk hints to be based on pcpu_block_md percpu: make pcpu_block_md generic percpu: use block scan_hint to only scan forward percpu: remember largest area skipped during allocation percpu: add block level scan_hint percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE percpu: relegate chunks unusable when failing small allocations percpu: manage chunks based on contig_bits instead of free_bytes percpu: introduce helper to determine if two regions overlap percpu: do not search past bitmap when allocating an area percpu: update free path with correct new free region |
||
|
5a28fc94c9 |
x86/mpx, mm/core: Fix recursive munmap() corruption
This is a bit of a mess, to put it mildly. But, it's a bug that only seems to have showed up in 4.20 but wasn't noticed until now, because nobody uses MPX. MPX has the arch_unmap() hook inside of munmap() because MPX uses bounds tables that protect other areas of memory. When memory is unmapped, there is also a need to unmap the MPX bounds tables. Barring this, unused bounds tables can eat 80% of the address space. But, the recursive do_munmap() that gets called vi arch_unmap() wreaks havoc with __do_munmap()'s state. It can result in freeing populated page tables, accessing bogus VMA state, double-freed VMAs and more. See the "long story" further below for the gory details. To fix this, call arch_unmap() before __do_unmap() has a chance to do anything meaningful. Also, remove the 'vma' argument and force the MPX code to do its own, independent VMA lookup. == UML / unicore32 impact == Remove unused 'vma' argument to arch_unmap(). No functional change. I compile tested this on UML but not unicore32. == powerpc impact == powerpc uses arch_unmap() well to watch for munmap() on the VDSO and zeroes out 'current->mm->context.vdso_base'. Moving arch_unmap() makes this happen earlier in __do_munmap(). But, 'vdso_base' seems to only be used in perf and in the signal delivery that happens near the return to userspace. I can not find any likely impact to powerpc, other than the zeroing happening a little earlier. powerpc does not use the 'vma' argument and is unaffected by its removal. I compile-tested a 64-bit powerpc defconfig. == x86 impact == For the common success case this is functionally identical to what was there before. For the munmap() failure case, it's possible that some MPX tables will be zapped for memory that continues to be in use. But, this is an extraordinarily unlikely scenario and the harm would be that MPX provides no protection since the bounds table got reset (zeroed). I can't imagine anyone doing this: ptr = mmap(); // use ptr ret = munmap(ptr); if (ret) // oh, there was an error, I'll // keep using ptr. Because if you're doing munmap(), you are *done* with the memory. There's probably no good data in there _anyway_. This passes the original reproducer from Richard Biener as well as the existing mpx selftests/. The long story: munmap() has a couple of pieces: 1. Find the affected VMA(s) 2. Split the start/end one(s) if neceesary 3. Pull the VMAs out of the rbtree 4. Actually zap the memory via unmap_region(), including freeing page tables (or queueing them to be freed). 5. Fix up some of the accounting (like fput()) and actually free the VMA itself. This specific ordering was actually introduced by: |
||
|
198790d9a3 |
percpu: remove spurious lock dependency between percpu and sched
In free_percpu() we sometimes call pcpu_schedule_balance_work() to queue a work item (which does a wakeup) while holding pcpu_lock. This creates an unnecessary lock dependency between pcpu_lock and the scheduler's pi_lock. There are other places where we call pcpu_schedule_balance_work() without hold pcpu_lock, and this case doesn't need to be different. Moving the call outside the lock prevents the following lockdep splat when running tools/testing/selftests/bpf/{test_maps,test_progs} in sequence with lockdep enabled: ====================================================== WARNING: possible circular locking dependency detected 5.1.0-dbg-DEV #1 Not tainted ------------------------------------------------------ kworker/23:255/18872 is trying to acquire lock: 000000000bc79290 (&(&pool->lock)->rlock){-.-.}, at: __queue_work+0xb2/0x520 but task is already holding lock: 00000000e3e7a6aa (pcpu_lock){..-.}, at: free_percpu+0x36/0x260 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (pcpu_lock){..-.}: lock_acquire+0x9e/0x180 _raw_spin_lock_irqsave+0x3a/0x50 pcpu_alloc+0xfa/0x780 __alloc_percpu_gfp+0x12/0x20 alloc_htab_elem+0x184/0x2b0 __htab_percpu_map_update_elem+0x252/0x290 bpf_percpu_hash_update+0x7c/0x130 __do_sys_bpf+0x1912/0x1be0 __x64_sys_bpf+0x1a/0x20 do_syscall_64+0x59/0x400 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #3 (&htab->buckets[i].lock){....}: lock_acquire+0x9e/0x180 _raw_spin_lock_irqsave+0x3a/0x50 htab_map_update_elem+0x1af/0x3a0 -> #2 (&rq->lock){-.-.}: lock_acquire+0x9e/0x180 _raw_spin_lock+0x2f/0x40 task_fork_fair+0x37/0x160 sched_fork+0x211/0x310 copy_process.part.43+0x7b1/0x2160 _do_fork+0xda/0x6b0 kernel_thread+0x29/0x30 rest_init+0x22/0x260 arch_call_rest_init+0xe/0x10 start_kernel+0x4fd/0x520 x86_64_start_reservations+0x24/0x26 x86_64_start_kernel+0x6f/0x72 secondary_startup_64+0xa4/0xb0 -> #1 (&p->pi_lock){-.-.}: lock_acquire+0x9e/0x180 _raw_spin_lock_irqsave+0x3a/0x50 try_to_wake_up+0x41/0x600 wake_up_process+0x15/0x20 create_worker+0x16b/0x1e0 workqueue_init+0x279/0x2ee kernel_init_freeable+0xf7/0x288 kernel_init+0xf/0x180 ret_from_fork+0x24/0x30 -> #0 (&(&pool->lock)->rlock){-.-.}: __lock_acquire+0x101f/0x12a0 lock_acquire+0x9e/0x180 _raw_spin_lock+0x2f/0x40 __queue_work+0xb2/0x520 queue_work_on+0x38/0x80 free_percpu+0x221/0x260 pcpu_freelist_destroy+0x11/0x20 stack_map_free+0x2a/0x40 bpf_map_free_deferred+0x3c/0x50 process_one_work+0x1f7/0x580 worker_thread+0x54/0x410 kthread+0x10f/0x150 ret_from_fork+0x24/0x30 other info that might help us debug this: Chain exists of: &(&pool->lock)->rlock --> &htab->buckets[i].lock --> pcpu_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(pcpu_lock); lock(&htab->buckets[i].lock); lock(pcpu_lock); lock(&(&pool->lock)->rlock); *** DEADLOCK *** 3 locks held by kworker/23:255/18872: #0: 00000000b36a6e16 ((wq_completion)events){+.+.}, at: process_one_work+0x17a/0x580 #1: 00000000dfd966f0 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x17a/0x580 #2: 00000000e3e7a6aa (pcpu_lock){..-.}, at: free_percpu+0x36/0x260 stack backtrace: CPU: 23 PID: 18872 Comm: kworker/23:255 Not tainted 5.1.0-dbg-DEV #1 Hardware name: ... Workqueue: events bpf_map_free_deferred Call Trace: dump_stack+0x67/0x95 print_circular_bug.isra.38+0x1c6/0x220 check_prev_add.constprop.50+0x9f6/0xd20 __lock_acquire+0x101f/0x12a0 lock_acquire+0x9e/0x180 _raw_spin_lock+0x2f/0x40 __queue_work+0xb2/0x520 queue_work_on+0x38/0x80 free_percpu+0x221/0x260 pcpu_freelist_destroy+0x11/0x20 stack_map_free+0x2a/0x40 bpf_map_free_deferred+0x3c/0x50 process_one_work+0x1f7/0x580 worker_thread+0x54/0x410 kthread+0x10f/0x150 ret_from_fork+0x24/0x30 Signed-off-by: John Sperbeck <jsperbeck@google.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> |
||
|
168e153d5e |
Merge branch 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs inode freeing updates from Al Viro: "Introduction of separate method for RCU-delayed part of ->destroy_inode() (if any). Pretty much as posted, except that destroy_inode() stashes ->free_inode into the victim (anon-unioned with ->i_fops) before scheduling i_callback() and the last two patches (sockfs conversion and folding struct socket_wq into struct socket) are excluded - that pair should go through netdev once davem reopens his tree" * 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (58 commits) orangefs: make use of ->free_inode() shmem: make use of ->free_inode() hugetlb: make use of ->free_inode() overlayfs: make use of ->free_inode() jfs: switch to ->free_inode() fuse: switch to ->free_inode() ext4: make use of ->free_inode() ecryptfs: make use of ->free_inode() ceph: use ->free_inode() btrfs: use ->free_inode() afs: switch to use of ->free_inode() dax: make use of ->free_inode() ntfs: switch to ->free_inode() securityfs: switch to ->free_inode() apparmor: switch to ->free_inode() rpcpipe: switch to ->free_inode() bpf: switch to ->free_inode() mqueue: switch to ->free_inode() ufs: switch to ->free_inode() coda: switch to ->free_inode() ... |
||
|
0968621917 |
Printk changes for 5.2
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAlzP8nQACgkQUqAMR0iA lPK79A/+NkRouqA9ihAZhUbgW0DHzOAFvUJSBgX11HQAZbGjngakuoyYFvwUx0T0 m80SUTCysxQrWl+xLdccPZ9ZrhP2KFQrEBEdeYHZ6ymcYcl83+3bOIBS7VwdZAbO EzB8u/58uU/sI6ABL4lF7ZF/+R+U4CXveEUoVUF04bxdPOxZkRX4PT8u3DzCc+RK r4yhwQUXGcKrHa2GrRL3GXKsDxcnRdFef/nzq4RFSZsi0bpskzEj34WrvctV6j+k FH/R3kEcZrtKIMPOCoDMMWq07yNqK/QKj0MJlGoAlwfK4INgcrSXLOx+pAmr6BNq uMKpkxCFhnkZVKgA/GbKEGzFf+ZGz9+2trSFka9LD2Ig6DIstwXqpAgiUK8JFQYj lq1mTaJZD3DfF2vnGHGeAfBFG3XETv+mIT/ow6BcZi3NyNSVIaqa5GAR+lMc6xkR waNkcMDkzLFuP1r0p7ZizXOksk9dFkMP3M6KqJomRtApwbSNmtt+O2jvyLPvB3+w wRyN9WT7IJZYo4v0rrD5Bl6BjV15ZeCPRSFZRYofX+vhcqJQsFX1M9DeoNqokh55 Cri8f6MxGzBVjE1G70y2/cAFFvKEKJud0NUIMEuIbcy+xNrEAWPF8JhiwpKKnU10 c0u674iqHJ2HeVsYWZF0zqzqQ6E1Idhg/PrXfuVuhAaL5jIOnYY= =WZfC -----END PGP SIGNATURE----- Merge tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk Pull printk updates from Petr Mladek: - Allow state reset of printk_once() calls. - Prevent crashes when dereferencing invalid pointers in vsprintf(). Only the first byte is checked for simplicity. - Make vsprintf warnings consistent and inlined. - Treewide conversion of obsolete %pf, %pF to %ps, %pF printf modifiers. - Some clean up of vsprintf and test_printf code. * tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: lib/vsprintf: Make function pointer_string static vsprintf: Limit the length of inlined error messages vsprintf: Avoid confusion between invalid address and value vsprintf: Prevent crash when dereferencing invalid pointers vsprintf: Consolidate handling of unknown pointer specifiers vsprintf: Factor out %pO handler as kobject_string() vsprintf: Factor out %pV handler as va_format() vsprintf: Factor out %p[iI] handler as ip_addr_string() vsprintf: Do not check address of well-known strings vsprintf: Consistent %pK handling for kptr_restrict == 0 vsprintf: Shuffle restricted_pointer() printk: Tie printk_once / printk_deferred_once into .data.once for reset treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively lib/test_printf: Switch to bitmap_zalloc() |
||
|
c620f7bd0b |
arm64 updates for 5.2
Mostly just incremental improvements here: - Introduce AT_HWCAP2 for advertising CPU features to userspace - Expose SVE2 availability to userspace - Support for "data cache clean to point of deep persistence" (DC PODP) - Honour "mitigations=off" on the cmdline and advertise status via sysfs - CPU timer erratum workaround (Neoverse-N1 #1188873) - Introduce perf PMU driver for the SMMUv3 performance counters - Add config option to disable the kuser helpers page for AArch32 tasks - Futex modifications to ensure liveness under contention - Rework debug exception handling to seperate kernel and user handlers - Non-critical fixes and cleanup -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAlzMFGgACgkQt6xw3ITB YzTicAf/TX1h1+ecbx4WJAa4qeiOCPoNpG9efldQumqJhKL44MR5bkhuShna5mwE ptm5qUXkZCxLTjzssZKnbdbgwa3t+emW8Of3D91IfI9akiZbMoDx5FGgcNbqjazb RLrhOFHwgontA38yppZN+DrL+sXbvif/CVELdHahkEx6KepSGaS2lmPXRmz/W56v 4yIRy/zxc3Dhjgfm3wKh72nBwoZdLiIc4mchd5pthNlR9E2idrYkQegG1C+gA00r o8uZRVOWgoh7H+QJE+xLUc8PaNCg8xqRRXOuZYg9GOz6hh7zSWhm+f1nRz9S2tIR gIgsCHNqoO2I3E1uJpAQXDGtt2kFhA== =ulpJ -----END PGP SIGNATURE----- Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "Mostly just incremental improvements here: - Introduce AT_HWCAP2 for advertising CPU features to userspace - Expose SVE2 availability to userspace - Support for "data cache clean to point of deep persistence" (DC PODP) - Honour "mitigations=off" on the cmdline and advertise status via sysfs - CPU timer erratum workaround (Neoverse-N1 #1188873) - Introduce perf PMU driver for the SMMUv3 performance counters - Add config option to disable the kuser helpers page for AArch32 tasks - Futex modifications to ensure liveness under contention - Rework debug exception handling to seperate kernel and user handlers - Non-critical fixes and cleanup" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (92 commits) Documentation: Add ARM64 to kernel-parameters.rst arm64/speculation: Support 'mitigations=' cmdline option arm64: ssbs: Don't treat CPUs with SSBS as unaffected by SSB arm64: enable generic CPU vulnerabilites support arm64: add sysfs vulnerability show for speculative store bypass arm64: Fix size of __early_cpu_boot_status clocksource/arm_arch_timer: Use arch_timer_read_counter to access stable counters clocksource/arm_arch_timer: Remove use of workaround static key clocksource/arm_arch_timer: Drop use of static key in arch_timer_reg_read_stable clocksource/arm_arch_timer: Direcly assign set_next_event workaround arm64: Use arch_timer_read_counter instead of arch_counter_get_cntvct watchdog/sbsa: Use arch_timer_read_counter instead of arch_counter_get_cntvct ARM: vdso: Remove dependency with the arch_timer driver internals arm64: Apply ARM64_ERRATUM_1188873 to Neoverse-N1 arm64: Add part number for Neoverse N1 arm64: Make ARM64_ERRATUM_1188873 depend on COMPAT arm64: Restrict ARM64_ERRATUM_1188873 mitigation to AArch32 arm64: mm: Remove pte_unmap_nested() arm64: Fix compiler warning from pte_unmap() with -Wunused-but-set-variable arm64: compat: Reduce address limit for 64K pages ... |
||
|
0bc40e549a |
Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar: "The changes in here are: - text_poke() fixes and an extensive set of executability lockdowns, to (hopefully) eliminate the last residual circumstances under which we are using W|X mappings even temporarily on x86 kernels. This required a broad range of surgery in text patching facilities, module loading, trampoline handling and other bits. - tweak page fault messages to be more informative and more structured. - remove DISCONTIGMEM support on x86-32 and make SPARSEMEM the default. - reduce KASLR granularity on 5-level paging kernels from 512 GB to 1 GB. - misc other changes and updates" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) x86/mm: Initialize PGD cache during mm initialization x86/alternatives: Add comment about module removal races x86/kprobes: Use vmalloc special flag x86/ftrace: Use vmalloc special flag bpf: Use vmalloc special flag modules: Use vmalloc special flag mm/vmalloc: Add flag for freeing of special permsissions mm/hibernation: Make hibernation handle unmapped pages x86/mm/cpa: Add set_direct_map_*() functions x86/alternatives: Remove the return value of text_poke_*() x86/jump-label: Remove support for custom text poker x86/modules: Avoid breaking W^X while loading modules x86/kprobes: Set instruction page as executable x86/ftrace: Set trampoline pages as executable x86/kgdb: Avoid redundant comparison of patched code x86/alternatives: Use temporary mm for text poking x86/alternatives: Initialize temporary mm for patching fork: Provide a function for copying init_mm uprobes: Initialize uprobes earlier x86/mm: Save debug registers when loading a temporary mm ... |
||
|
8f14772703 |
Merge branch 'x86-irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 irq updates from Ingo Molnar: "Here are the main changes in this tree: - Introduce x86-64 IRQ/exception/debug stack guard pages to detect stack overflows immediately and deterministically. - Clean up over a decade worth of cruft accumulated. The outcome of this should be more clear-cut faults/crashes when any of the low level x86 CPU stacks overflow, instead of silent memory corruption and sporadic failures much later on" * 'x86-irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) x86/irq: Fix outdated comments x86/irq/64: Remove stack overflow debug code x86/irq/64: Remap the IRQ stack with guard pages x86/irq/64: Split the IRQ stack into its own pages x86/irq/64: Init hardirq_stack_ptr during CPU hotplug x86/irq/32: Handle irq stack allocation failure proper x86/irq/32: Invoke irq_ctx_init() from init_IRQ() x86/irq/64: Rename irq_stack_ptr to hardirq_stack_ptr x86/irq/32: Rename hard/softirq_stack to hard/softirq_stack_ptr x86/irq/32: Make irq stack a character array x86/irq/32: Define IRQ_STACK_SIZE x86/dumpstack/64: Speedup in_exception_stack() x86/exceptions: Split debug IST stack x86/exceptions: Enable IST guard pages x86/exceptions: Disconnect IST index and stack order x86/cpu: Remove orig_ist array x86/cpu: Prepare TSS.IST setup for guard pages x86/dumpstack/64: Use cpu_entry_area instead of orig_ist x86/irq/64: Use cpu entry area instead of orig_ist x86/traps: Use cpu_entry_area instead of orig_ist ... |
||
|
2c6a392cdd |
Merge branch 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stack trace updates from Ingo Molnar: "So Thomas looked at the stacktrace code recently and noticed a few weirdnesses, and we all know how such stories of crummy kernel code meeting German engineering perfection end: a 45-patch series to clean it all up! :-) Here's the changes in Thomas's words: 'Struct stack_trace is a sinkhole for input and output parameters which is largely pointless for most usage sites. In fact if embedded into other data structures it creates indirections and extra storage overhead for no benefit. Looking at all usage sites makes it clear that they just require an interface which is based on a storage array. That array is either on stack, global or embedded into some other data structure. Some of the stack depot usage sites are outright wrong, but fortunately the wrongness just causes more stack being used for nothing and does not have functional impact. Another oddity is the inconsistent termination of the stack trace with ULONG_MAX. It's pointless as the number of entries is what determines the length of the stored trace. In fact quite some call sites remove the ULONG_MAX marker afterwards with or without nasty comments about it. Not all architectures do that and those which do, do it inconsistenly either conditional on nr_entries == 0 or unconditionally. The following series cleans that up by: 1) Removing the ULONG_MAX termination in the architecture code 2) Removing the ULONG_MAX fixups at the call sites 3) Providing plain storage array based interfaces for stacktrace and stackdepot. 4) Cleaning up the mess at the callsites including some related cleanups. 5) Removing the struct stack_trace based interfaces This is not changing the struct stack_trace interfaces at the architecture level, but it removes the exposure to the generic code'" * 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits) x86/stacktrace: Use common infrastructure stacktrace: Provide common infrastructure lib/stackdepot: Remove obsolete functions stacktrace: Remove obsolete functions livepatch: Simplify stack trace retrieval tracing: Remove the last struct stack_trace usage tracing: Simplify stack trace retrieval tracing: Make ftrace_trace_userstack() static and conditional tracing: Use percpu stack trace buffer more intelligently tracing: Simplify stacktrace retrieval in histograms lockdep: Simplify stack trace handling lockdep: Remove save argument from check_prev_add() lockdep: Remove unused trace argument from print_circular_bug() drm: Simplify stacktrace handling dm persistent data: Simplify stack trace handling dm bufio: Simplify stack trace retrieval btrfs: ref-verify: Simplify stack trace retrieval dma/debug: Simplify stracktrace retrieval fault-inject: Simplify stacktrace retrieval mm/page_owner: Simplify stack trace handling ... |
||
|
6ec62961e6 |
Merge branch 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar: "This is a series from Peter Zijlstra that adds x86 build-time uaccess validation of SMAP to objtool, which will detect and warn about the following uaccess API usage bugs and weirdnesses: - call to %s() with UACCESS enabled - return with UACCESS enabled - return with UACCESS disabled from a UACCESS-safe function - recursive UACCESS enable - redundant UACCESS disable - UACCESS-safe disables UACCESS As it turns out not leaking uaccess permissions outside the intended uaccess functionality is hard when the interfaces are complex and when such bugs are mostly dormant. As a bonus we now also check the DF flag. We had at least one high-profile bug in that area in the early days of Linux, and the checking is fairly simple. The checks performed and warnings emitted are: - call to %s() with DF set - return with DF set - return with modified stack frame - recursive STD - redundant CLD It's all x86-only for now, but later on this can also be used for PAN on ARM and objtool is fairly cross-platform in principle. While all warnings emitted by this new checking facility that got reported to us were fixed, there might be GCC version dependent warnings that were not reported yet - which we'll address, should they trigger. The warnings are non-fatal build warnings" * 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) mm/uaccess: Use 'unsigned long' to placate UBSAN warnings on older GCC versions x86/uaccess: Dont leak the AC flag into __put_user() argument evaluation sched/x86_64: Don't save flags on context switch objtool: Add Direction Flag validation objtool: Add UACCESS validation objtool: Fix sibling call detection objtool: Rewrite alt->skip_orig objtool: Add --backtrace support objtool: Rewrite add_ignores() objtool: Handle function aliases objtool: Set insn->func for alternatives x86/uaccess, kcov: Disable stack protector x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP x86/uaccess, ubsan: Fix UBSAN vs. SMAP x86/uaccess, kasan: Fix KASAN vs SMAP x86/smap: Ditch __stringify() x86/uaccess: Introduce user_access_{save,restore}() x86/uaccess, signal: Fix AC=1 bloat x86/uaccess: Always inline user_access_begin() x86/uaccess, xen: Suppress SMAP warnings ... |
||
|
171c2bcbcb |
Merge branch 'core-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull unified TLB flushing from Ingo Molnar: "This contains the generic mmu_gather feature from Peter Zijlstra, which is an all-arch unification of TLB flushing APIs, via the following (broad) steps: - enhance the <asm-generic/tlb.h> APIs to cover more arch details - convert most TLB flushing arch implementations to the generic <asm-generic/tlb.h> APIs. - remove leftovers of per arch implementations After this series every single architecture makes use of the unified TLB flushing APIs" * 'core-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: mm/resource: Use resource_overlaps() to simplify region_intersects() ia64/tlb: Eradicate tlb_migrate_finish() callback asm-generic/tlb: Remove tlb_table_flush() asm-generic/tlb: Remove tlb_flush_mmu_free() asm-generic/tlb: Remove CONFIG_HAVE_GENERIC_MMU_GATHER asm-generic/tlb: Remove arch_tlb*_mmu() s390/tlb: Convert to generic mmu_gather asm-generic/tlb: Introduce CONFIG_HAVE_MMU_GATHER_NO_GATHER=y arch/tlb: Clean up simple architectures um/tlb: Convert to generic mmu_gather sh/tlb: Convert SH to generic mmu_gather ia64/tlb: Convert to generic mmu_gather arm/tlb: Convert to generic mmu_gather asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE asm-generic/tlb, ia64: Conditionally provide tlb_migrate_finish() asm-generic/tlb: Provide generic tlb_flush() based on flush_tlb_mm() asm-generic/tlb, arch: Provide generic tlb_flush() based on flush_tlb_range() asm-generic/tlb, arch: Provide generic VIPT cache flush asm-generic/tlb, arch: Provide CONFIG_HAVE_MMU_GATHER_PAGE_SIZE asm-generic/tlb: Provide a comment |
||
|
74b1da5645 |
shmem: make use of ->free_inode()
same situation as for hugetlbfs Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
868b104d73 |
mm/vmalloc: Add flag for freeing of special permsissions
Add a new flag VM_FLUSH_RESET_PERMS, for enabling vfree operations to immediately clear executable TLB entries before freeing pages, and handle resetting permissions on the directmap. This flag is useful for any kind of memory with elevated permissions, or where there can be related permissions changes on the directmap. Today this is RO+X and RO memory. Although this enables directly vfreeing non-writeable memory now, non-writable memory cannot be freed in an interrupt because the allocation itself is used as a node on deferred free list. So when RO memory needs to be freed in an interrupt the code doing the vfree needs to have its own work queue, as was the case before the deferred vfree list was added to vmalloc. For architectures with set_direct_map_ implementations this whole operation can be done with one TLB flush when centralized like this. For others with directmap permissions, currently only arm64, a backup method using set_memory functions is used to reset the directmap. When arm64 adds set_direct_map_ functions, this backup can be removed. When the TLB is flushed to both remove TLB entries for the vmalloc range mapping and the direct map permissions, the lazy purge operation could be done to try to save a TLB flush later. However today vm_unmap_aliases could flush a TLB range that does not include the directmap. So a helper is added with extra parameters that can allow both the vmalloc address and the direct mapping to be flushed during this operation. The behavior of the normal vm_unmap_aliases function is unchanged. Suggested-by: Dave Hansen <dave.hansen@intel.com> Suggested-by: Andy Lutomirski <luto@kernel.org> Suggested-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-17-namit@vmware.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
d633269286 |
mm/hibernation: Make hibernation handle unmapped pages
Make hibernate handle unmapped pages on the direct map when CONFIG_ARCH_HAS_SET_ALIAS=y is set. These functions allow for setting pages to invalid configurations, so now hibernate should check if the pages have valid mappings and handle if they are unmapped when doing a hibernate save operation. Previously this checking was already done when CONFIG_DEBUG_PAGEALLOC=y was configured. It does not appear to have a big hibernating performance impact. The speed of the saving operation before this change was measured as 819.02 MB/s, and after was measured at 813.32 MB/s. Before: [ 4.670938] PM: Wrote 171996 kbytes in 0.21 seconds (819.02 MB/s) After: [ 4.504714] PM: Wrote 178932 kbytes in 0.22 seconds (813.32 MB/s) Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: <will.deacon@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-16-namit@vmware.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
af52bf6b92 |
mm/page_owner: Simplify stack trace handling
Replace the indirection through struct stack_trace by using the storage array based interfaces. The original code in all printing functions is really wrong. It allocates a storage array on stack which is unused because depot_fetch_stack() does not store anything in it. It overwrites the entries pointer in the stack_trace struct so it points to the depot storage. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: linux-mm@kvack.org Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alexander Potapenko <glider@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: kasan-dev@googlegroups.com Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: iommu@lists.linux-foundation.org Cc: Robin Murphy <robin.murphy@arm.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: linux-btrfs@vger.kernel.org Cc: dm-devel@redhat.com Cc: Mike Snitzer <snitzer@redhat.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: intel-gfx@lists.freedesktop.org Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: dri-devel@lists.freedesktop.org Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: linux-arch@vger.kernel.org Link: https://lkml.kernel.org/r/20190425094802.067210525@linutronix.de |
||
|
880e049c9c |
mm/kasan: Simplify stacktrace handling
Replace the indirection through struct stack_trace by using the storage array based interfaces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: kasan-dev@googlegroups.com Cc: linux-mm@kvack.org Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: iommu@lists.linux-foundation.org Cc: Robin Murphy <robin.murphy@arm.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: linux-btrfs@vger.kernel.org Cc: dm-devel@redhat.com Cc: Mike Snitzer <snitzer@redhat.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: intel-gfx@lists.freedesktop.org Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: dri-devel@lists.freedesktop.org Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: linux-arch@vger.kernel.org Link: https://lkml.kernel.org/r/20190425094801.963261479@linutronix.de |
||
|
07984aad1c |
mm/kmemleak: Simplify stacktrace handling
Replace the indirection through struct stack_trace by using the storage array based interfaces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: linux-mm@kvack.org Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alexander Potapenko <glider@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: kasan-dev@googlegroups.com Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: iommu@lists.linux-foundation.org Cc: Robin Murphy <robin.murphy@arm.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: linux-btrfs@vger.kernel.org Cc: dm-devel@redhat.com Cc: Mike Snitzer <snitzer@redhat.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: intel-gfx@lists.freedesktop.org Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: dri-devel@lists.freedesktop.org Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: linux-arch@vger.kernel.org Link: https://lkml.kernel.org/r/20190425094801.863716911@linutronix.de |
||
|
7971679994 |
mm/slub: Simplify stack trace retrieval
Replace the indirection through struct stack_trace with an invocation of the storage array based interface. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: linux-mm@kvack.org Cc: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alexander Potapenko <glider@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: kasan-dev@googlegroups.com Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: iommu@lists.linux-foundation.org Cc: Robin Murphy <robin.murphy@arm.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: linux-btrfs@vger.kernel.org Cc: dm-devel@redhat.com Cc: Mike Snitzer <snitzer@redhat.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: intel-gfx@lists.freedesktop.org Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: dri-devel@lists.freedesktop.org Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: linux-arch@vger.kernel.org Link: https://lkml.kernel.org/r/20190425094801.771410441@linutronix.de |
||
|
8118b82eb7 |
mm/page_alloc.c: fix never set ALLOC_NOFRAGMENT flag
Commit |
||
|
8139ad043d |
mm/page_alloc.c: avoid potential NULL pointer dereference
ac.preferred_zoneref->zone passed to alloc_flags_nofragment() can be NULL.
'zone' pointer unconditionally derefernced in alloc_flags_nofragment().
Bail out on NULL zone to avoid potential crash. Currently we don't see
any crashes only because alloc_flags_nofragment() has another bug which
allows compiler to optimize away all accesses to 'zone'.
Link: http://lkml.kernel.org/r/20190423120806.3503-1-aryabinin@virtuozzo.com
Fixes:
|
||
|
ee8ab0eeb4 |
mm, page_alloc: always use a captured page regardless of compaction result
During the development of commit |
||
|
24512228b7 |
mm: do not boost watermarks to avoid fragmentation for the DISCONTIG memory model
Mikulas Patocka reported that commit |
||
|
89c02e69fc |
mm/memory_hotplug.c: drop memory device reference after find_memory_block()
Right now we are using find_memory_block() to get the node id for the
pfn range to online. We are missing to drop a reference to the memory
block device. While the device still gets unregistered via
device_unregister(), resulting in no user visible problem, the device is
never released via device_release(), resulting in a memory leak. Fix
that by properly using a put_device().
Link: http://lkml.kernel.org/r/20190411110955.1430-1-david@redhat.com
Fixes:
|
||
|
4c3f49ae13 |
Merge branch 'for-5.1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu fixlet from Dennis Zhou: "This stops printing the base address of percpu memory on initialization" * 'for-5.1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu: percpu: stop printing kernel addresses |
||
|
04f5866e41 |
coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma
layout will not change from under it. Only using some signal
serialization on the processes belonging to the mm is not nearly enough.
This was pointed out earlier. For example in Hugh's post from Jul 2017:
https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils
"Not strictly relevant here, but a related note: I was very surprised
to discover, only quite recently, how handle_mm_fault() may be called
without down_read(mmap_sem) - when core dumping. That seems a
misguided optimization to me, which would also be nice to correct"
In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.
Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side
effects in the core dumping code.
Adding mmap_sem for writing around the ->core_dump invocation is a
viable long term fix, but it requires removing all copy user and page
faults and to replace them with get_dump_page() for all binary formats
which is not suitable as a short term fix.
For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs. Once ->core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.
Allowing mmap_sem protected sections to run in parallel with the
coredump provides some minor parallelism advantage to the swapoff code
(which seems to be safe enough by never mangling any vma field and can
keep doing swapins in parallel to the core dumping) and to some other
corner case.
In order to facilitate the backporting I added "Fixes: 86039bd3b4e6"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any
other "Fixes:" because there's no hash beyond the git genesis commit.
Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for
reading, by adding the mmget_still_valid() check to it, all other cases
that take the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm(). The expand_stack() in page fault
context also doesn't need the new check, because all tasks under core
dumping are frozen.
Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes:
|
||
|
dce5b0bdee |
mm/kmemleak.c: fix unused-function warning
The only references outside of the #ifdef have been removed, so now we
get a warning in non-SMP configurations:
mm/kmemleak.c:1404:13: error: unused function 'scan_large_block' [-Werror,-Wunused-function]
Add a new #ifdef around it.
Link: http://lkml.kernel.org/r/20190416123148.3502045-1-arnd@arndb.de
Fixes:
|
||
|
3b991208b8 |
mm: fix inactive list balancing between NUMA nodes and cgroups
During !CONFIG_CGROUP reclaim, we expand the inactive list size if it's
thrashing on the node that is about to be reclaimed. But when cgroups
are enabled, we suddenly ignore the node scope and use the cgroup scope
only. The result is that pressure bleeds between NUMA nodes depending
on whether cgroups are merely compiled into Linux. This behavioral
difference is unexpected and undesirable.
When the refault adaptivity of the inactive list was first introduced,
there were no statistics at the lruvec level - the intersection of node
and memcg - so it was better than nothing.
But now that we have that infrastructure, use lruvec_page_state() to
make the list balancing decision always NUMA aware.
[hannes@cmpxchg.org: fix bisection hole]
Link: http://lkml.kernel.org/r/20190417155241.GB23013@cmpxchg.org
Link: http://lkml.kernel.org/r/20190412144438.2645-1-hannes@cmpxchg.org
Fixes:
|
||
|
1a9f219157 |
mm/hotplug: treat CMA pages as unmovable
has_unmovable_pages() is used by allocating CMA and gigantic pages as well as the memory hotplug. The later doesn't know how to offline CMA pool properly now, but if an unused (free) CMA page is encountered, then has_unmovable_pages() happily considers it as a free memory and propagates this up the call chain. Memory offlining code then frees the page without a proper CMA tear down which leads to an accounting issues. Moreover if the same memory range is onlined again then the memory never gets back to the CMA pool. State after memory offline: # grep cma /proc/vmstat nr_free_cma 205824 # cat /sys/kernel/debug/cma/cma-kvm_cma/count 209920 Also, kmemleak still think those memory address are reserved below but have already been used by the buddy allocator after onlining. This patch fixes the situation by treating CMA pageblocks as unmovable except when has_unmovable_pages() is called as part of CMA allocation. Offlined Pages 4096 kmemleak: Cannot insert 0xc000201f7d040008 into the object search tree (overlaps existing) Call Trace: dump_stack+0xb0/0xf4 (unreliable) create_object+0x344/0x380 __kmalloc_node+0x3ec/0x860 kvmalloc_node+0x58/0x110 seq_read+0x41c/0x620 __vfs_read+0x3c/0x70 vfs_read+0xbc/0x1a0 ksys_read+0x7c/0x140 system_call+0x5c/0x70 kmemleak: Kernel memory leak detector disabled kmemleak: Object 0xc000201cc8000000 (size 13757317120): kmemleak: comm "swapper/0", pid 0, jiffies 4294937297 kmemleak: min_count = -1 kmemleak: count = 0 kmemleak: flags = 0x5 kmemleak: checksum = 0 kmemleak: backtrace: cma_declare_contiguous+0x2a4/0x3b0 kvm_cma_reserve+0x11c/0x134 setup_arch+0x300/0x3f8 start_kernel+0x9c/0x6e8 start_here_common+0x1c/0x4b0 kmemleak: Automatic memory scanning thread ended [cai@lca.pw: use is_migrate_cma_page() and update commit log] Link: http://lkml.kernel.org/r/20190416170510.20048-1-cai@lca.pw Link: http://lkml.kernel.org/r/20190413002623.8967-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
e8277b3b52 |
mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n
Commit |
||
|
af53d3e9e0 |
mm: swapoff: shmem_unuse() stop eviction without igrab()
The igrab() in shmem_unuse() looks good, but we forgot that it gives no protection against concurrent unmounting: a point made by Konstantin Khlebnikov eight years ago, and then fixed in 2.6.39 by |
||
|
64165b1aff |
mm: swapoff: take notice of completion sooner
The old try_to_unuse() implementation was driven by find_next_to_unuse(),
which terminated as soon as all the swap had been freed.
Add inuse_pages checks now (alongside signal_pending()) to stop scanning
mms and swap_map once finished.
The same ought to be done in shmem_unuse() too, but never was before,
and needs a different interface: so leave it as is for now.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081258200.1523@eggly.anvils
Fixes:
|
||
|
dd862deb15 |
mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
SWAP_UNUSE_MAX_TRIES 3 appeared to work well in earlier testing, but
further testing has proved it to be a source of unnecessary swapoff
EBUSY failures (which can then be followed by unmount EBUSY failures).
When mmget_not_zero() or shmem's igrab() fails, there is an mm exiting
or inode being evicted, freeing up swap independent of try_to_unuse().
Those typically completed much sooner than the old quadratic swapoff,
but now it's more common that swapoff may need to wait for them.
It's possible to move those cases from init_mm.mmlist and shmem_swaplist
to separate "exiting" swaplists, and try_to_unuse() then wait for those
lists to be emptied; but we've not bothered with that in the past, and
don't want to risk missing some other forgotten case. So just revert to
cycling around until the swap is gone, without any retries limit.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081256170.1523@eggly.anvils
Fixes:
|
||
|
8703954654 |
mm: swapoff: shmem_find_swap_entries() filter out other types
Swapfile "type" was passed all the way down to shmem_unuse_inode(), but
then forgotten from shmem_find_swap_entries(): with the result that
removing one swapfile would try to free up all the swap from shmem - no
problem when only one swapfile anyway, but counter-productive when more,
causing swapoff to be unnecessarily OOM-killed when it should succeed.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081254470.1523@eggly.anvils
Fixes:
|
||
|
1a62b18d51 |
slab: store tagged freelist for off-slab slabmgmt
Commit |
||
|
80552f0f7a |
mm/slab: Remove store_stackinfo()
store_stackinfo() does not seem used in actual SLAB debugging. Potentially, it could be added to check_poison_obj() to provide more information but this seems like an overkill due to the declining popularity of SLAB, so just remove it instead. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: linux-mm <linux-mm@kvack.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: rientjes@google.com Cc: sean.j.christopherson@intel.com Link: https://lkml.kernel.org/r/20190416142258.18694-1-cai@lca.pw |
||
|
6b3a707736 |
Merge branch 'page-refs' (page ref overflow)
Merge page ref overflow branch. Jann Horn reported that he can overflow the page ref count with sufficient memory (and a filesystem that is intentionally extremely slow). Admittedly it's not exactly easy. To have more than four billion references to a page requires a minimum of 32GB of kernel memory just for the pointers to the pages, much less any metadata to keep track of those pointers. Jann needed a total of 140GB of memory and a specially crafted filesystem that leaves all reads pending (in order to not ever free the page references and just keep adding more). Still, we have a fairly straightforward way to limit the two obvious user-controllable sources of page references: direct-IO like page references gotten through get_user_pages(), and the splice pipe page duplication. So let's just do that. * branch page-refs: fs: prevent page refcount overflow in pipe_buf_get mm: prevent get_user_pages() from overflowing page refcount mm: add 'try_get_page()' helper function mm: make page ref count overflow check tighter and more explicit |
||
|
ead97a49ec |
mm/kasan: Remove the ULONG_MAX stack trace hackery
No architecture terminates the stack trace with ULONG_MAX anymore. Remove the cruft. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: kasan-dev@googlegroups.com Cc: linux-mm@kvack.org Link: https://lkml.kernel.org/r/20190410103644.750219625@linutronix.de |
||
|
4621c9858f |
mm/page_owner: Remove the ULONG_MAX stack trace hackery
No architecture terminates the stack trace with ULONG_MAX anymore. Remove the cruft. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alexander Potapenko <glider@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: linux-mm@kvack.org Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20190410103644.661974663@linutronix.de |
||
|
b8ca7ff773 |
mm/slub: Remove the ULONG_MAX stack trace hackery
No architecture terminates the stack trace with ULONG_MAX anymore. Remove the cruft. While at it remove the pointless loop of clearing the stack array completely. It's sufficient to clear the last entry as the consumers break out on the first zeroed entry anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: linux-mm@kvack.org Cc: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Link: https://lkml.kernel.org/r/20190410103644.574058244@linutronix.de |
||
|
8fde12ca79 |
mm: prevent get_user_pages() from overflowing page refcount
If the page refcount wraps around past zero, it will be freed while there are still four billion references to it. One of the possible avenues for an attacker to try to make this happen is by doing direct IO on a page multiple times. This patch makes get_user_pages() refuse to take a new page reference if there are already more than two billion references to the page. Reported-by: Jann Horn <jannh@google.com> Acked-by: Matthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d75f773c86 |
treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
%pF and %pf are functionally equivalent to %pS and %ps conversion specifiers. The former are deprecated, therefore switch the current users to use the preferred variant. The changes have been produced by the following command: git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \ while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done And verifying the result. Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: linux-arm-kernel@lists.infradead.org Cc: sparclinux@vger.kernel.org Cc: linux-um@lists.infradead.org Cc: xen-devel@lists.xenproject.org Cc: linux-acpi@vger.kernel.org Cc: linux-pm@vger.kernel.org Cc: drbd-dev@lists.linbit.com Cc: linux-block@vger.kernel.org Cc: linux-mmc@vger.kernel.org Cc: linux-nvdimm@lists.01.org Cc: linux-pci@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: linux-btrfs@vger.kernel.org Cc: linux-f2fs-devel@lists.sourceforge.net Cc: linux-mm@kvack.org Cc: ceph-devel@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Acked-by: David Sterba <dsterba@suse.com> (for btrfs) Acked-by: Mike Rapoport <rppt@linux.ibm.com> (for mm/memblock.c) Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci) Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Petr Mladek <pmladek@suse.com> |
||
|
e2092740b7 |
kasan: Makefile: Replace -pg with CC_FLAGS_FTRACE
In preparation for arm64 supporting ftrace built on other compiler options, let's have Makefiles remove the $(CC_FLAGS_FTRACE) flags, whatever these may be, rather than assuming '-pg'. There should be no functional change as a result of this patch. Reviewed-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Torsten Duwe <duwe@suse.de> Signed-off-by: Will Deacon <will.deacon@arm.com> |
||
|
fcf88917dd |
slab: fix a crash by reading /proc/slab_allocators
The commit |
||
|
e91455217d |
mm/util.c: fix strndup_user() comment
The kerneldoc misdescribes strndup_user()'s return value. Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Timur Tabi <timur@freescale.com> Cc: Mihai Caraman <mihai.caraman@freescale.com> Cc: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
0b3d6e6f2d |
mm: writeback: use exact memcg dirty counts
Since commit
|
||
|
c6f3c5ee40 |
mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()
With some architectures like ppc64, set_pmd_at() cannot cope with a
situation where there is already some (different) valid entry present.
Use pmdp_set_access_flags() instead to modify the pfn which is built to
deal with modifying existing PMD entries.
This is similar to commit
|
||
|
298a32b132 |
kmemleak: powerpc: skip scanning holes in the .bss section
Commit
|
||
|
5b56d996dd |
mm/compaction.c: abort search if isolation fails
Running LTP oom01 in a tight loop or memory stress testing put the system
in a low-memory situation could triggers random memory corruption like
page flag corruption below due to in fast_isolate_freepages(), if
isolation fails, next_search_order() does not abort the search immediately
could lead to improper accesses.
UBSAN: Undefined behaviour in ./include/linux/mm.h:1195:50
index 7 is out of range for type 'zone [5]'
Call Trace:
dump_stack+0x62/0x9a
ubsan_epilogue+0xd/0x7f
__ubsan_handle_out_of_bounds+0x14d/0x192
__isolate_free_page+0x52c/0x600
compaction_alloc+0x886/0x25f0
unmap_and_move+0x37/0x1e70
migrate_pages+0x2ca/0xb20
compact_zone+0x19cb/0x3620
kcompactd_do_work+0x2df/0x680
kcompactd+0x1d8/0x6c0
kthread+0x32c/0x3f0
ret_from_fork+0x35/0x40
------------[ cut here ]------------
kernel BUG at mm/page_alloc.c:3124!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
RIP: 0010:__isolate_free_page+0x464/0x600
RSP: 0000:ffff888b9e1af848 EFLAGS: 00010007
RAX: 0000000030000000 RBX: ffff888c39fcf0f8 RCX: 0000000000000000
RDX: 1ffff111873f9e25 RSI: 0000000000000004 RDI: ffffed1173c35ef6
RBP: ffff888b9e1af898 R08: fffffbfff4fc2461 R09: fffffbfff4fc2460
R10: fffffbfff4fc2460 R11: ffffffffa7e12303 R12: 0000000000000008
R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000007
FS: 0000000000000000(0000) GS:ffff888ba8e80000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc7abc00000 CR3: 0000000752416004 CR4: 00000000001606a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
compaction_alloc+0x886/0x25f0
unmap_and_move+0x37/0x1e70
migrate_pages+0x2ca/0xb20
compact_zone+0x19cb/0x3620
kcompactd_do_work+0x2df/0x680
kcompactd+0x1d8/0x6c0
kthread+0x32c/0x3f0
ret_from_fork+0x35/0x40
Link: http://lkml.kernel.org/r/20190320192648.52499-1-cai@lca.pw
Fixes:
|
||
|
6b0868c820 |
mm/compaction.c: correct zone boundary handling when resetting pageblock skip hints
Mikhail Gavrilo reported the following bug being triggered in a Fedora kernel based on 5.1-rc1 but it is relevant to a vanilla kernel. kernel: page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) kernel: ------------[ cut here ]------------ kernel: kernel BUG at include/linux/mm.h:1021! kernel: invalid opcode: 0000 [#1] SMP NOPTI kernel: CPU: 6 PID: 116 Comm: kswapd0 Tainted: G C 5.1.0-0.rc1.git1.3.fc31.x86_64 #1 kernel: Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 1201 12/07/2018 kernel: RIP: 0010:__reset_isolation_pfn+0x244/0x2b0 kernel: Code: fe 06 e8 0f 8e fc ff 44 0f b6 4c 24 04 48 85 c0 0f 85 dc fe ff ff e9 68 fe ff ff 48 c7 c6 58 b7 2e 8c 4c 89 ff e8 0c 75 00 00 <0f> 0b 48 c7 c6 58 b7 2e 8c e8 fe 74 00 00 0f 0b 48 89 fa 41 b8 01 kernel: RSP: 0018:ffff9e2d03f0fde8 EFLAGS: 00010246 kernel: RAX: 0000000000000034 RBX: 000000000081f380 RCX: ffff8cffbddd6c20 kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8cffbddd6c20 kernel: RBP: 0000000000000001 R08: 0000009898b94613 R09: 0000000000000000 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000100000 kernel: R13: 0000000000100000 R14: 0000000000000001 R15: ffffca7de07ce000 kernel: FS: 0000000000000000(0000) GS:ffff8cffbdc00000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 00007fc1670e9000 CR3: 00000007f5276000 CR4: 00000000003406e0 kernel: Call Trace: kernel: __reset_isolation_suitable+0x62/0x120 kernel: reset_isolation_suitable+0x3b/0x40 kernel: kswapd+0x147/0x540 kernel: ? finish_wait+0x90/0x90 kernel: kthread+0x108/0x140 kernel: ? balance_pgdat+0x560/0x560 kernel: ? kthread_park+0x90/0x90 kernel: ret_from_fork+0x27/0x50 He bisected it down to |
||
|
57b78a62e7 |
x86/uaccess, kasan: Fix KASAN vs SMAP
KASAN inserts extra code for every LOAD/STORE emitted by te compiler. Much of this code is simple and safe to run with AC=1, however the kasan_report() function, called on error, is most certainly not safe to call with AC=1. Therefore wrap kasan_report() in user_access_{save,restore}; which for x86 SMAP, saves/restores EFLAGS and clears AC before calling the real function. Also ensure all the functions are without __fentry__ hook. The function tracer is also not safe. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
0a8caf211b |
asm-generic/tlb: Remove tlb_table_flush()
There are no external users of this API (nor should there be); remove it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
fa0aafb8ac |
asm-generic/tlb: Remove tlb_flush_mmu_free()
As the comment notes; it is a potentially dangerous operation. Just use tlb_flush_mmu(), that will skip the (double) TLB invalidate if it really isn't needed anyway. No change in behavior intended. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
b3fa8ed4e4 |
asm-generic/tlb: Remove CONFIG_HAVE_GENERIC_MMU_GATHER
Since all architectures are now using it, it is redundant. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
1808d65b55 |
asm-generic/tlb: Remove arch_tlb*_mmu()
Now that all architectures are converted to the generic code, remove the arch hooks. No change in behavior intended. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
952a31c9e6 |
asm-generic/tlb: Introduce CONFIG_HAVE_MMU_GATHER_NO_GATHER=y
Add the Kconfig option HAVE_MMU_GATHER_NO_GATHER to the generic mmu_gather code. If the option is set the mmu_gather will not track individual pages for delayed page free anymore. A platform that enables the option needs to provide its own implementation of the __tlb_remove_page_size() function to free pages. No change in behavior intended. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: aneesh.kumar@linux.vnet.ibm.com Cc: heiko.carstens@de.ibm.com Cc: linux@armlinux.org.uk Cc: npiggin@gmail.com Link: http://lkml.kernel.org/r/20180918125151.31744-2-schwidefsky@de.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
96bc9567cb |
asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
Make issuing a TLB invalidate for page-table pages the normal case. The reason is twofold: - too many invalidates is safer than too few, - most architectures use the linux page-tables natively and would thus require this. Make it an opt-out, instead of an opt-in. No change in behavior intended. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
ed6a79352c |
asm-generic/tlb, arch: Provide CONFIG_HAVE_MMU_GATHER_PAGE_SIZE
Move the mmu_gather::page_size things into the generic code instead of PowerPC specific bits. No change in behavior intended. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nick Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
d2b2c6dd22 |
mm/migrate.c: add missing flush_dcache_page for non-mapped page migrate
Our MIPS 1004Kc SoCs were seeing random userspace crashes with SIGILL and SIGSEGV that could not be traced back to a userspace code bug. They had all the magic signs of an I/D cache coherency issue. Now recently we noticed that the /proc/sys/vm/compact_memory interface was quite efficient at provoking this class of userspace crashes. Studying the code in mm/migrate.c there is a distinction made between migrating a page that is mapped at the instant of migration and one that is not mapped. Our problem turned out to be the non-mapped pages. For the non-mapped page the code performs a copy of the page content and all relevant meta-data of the page without doing the required D-cache maintenance. This leaves dirty data in the D-cache of the CPU and on the 1004K cores this data is not visible to the I-cache. A subsequent page-fault that triggers a mapping of the page will happily serve the process with potentially stale code. What about ARM then, this bug should have seen greater exposure? Well ARM became immune to this flaw back in 2010, see commit |
||
|
f5777bc2d9 |
mm/page_isolation.c: fix a wrong flag in set_migratetype_isolate()
Due to has_unmovable_pages() taking an incorrect irqsave flag instead of
the isolation flag in set_migratetype_isolate(), there are issues with
HWPOSION and error reporting where dump_page() is not called when there
is an unmovable page.
Link: http://lkml.kernel.org/r/20190320204941.53731-1-cai@lca.pw
Fixes:
|
||
|
c4efe484b5 |
mm/memory_hotplug.c: fix notification in offline error path
When start_isolate_page_range() returned -EBUSY in __offline_pages(), it
calls memory_notify(MEM_CANCEL_OFFLINE, &arg) with an uninitialized
"arg". As the result, it triggers warnings below. Also, it is only
necessary to notify MEM_CANCEL_OFFLINE after MEM_GOING_OFFLINE.
page:ffffea0001200000 count:1 mapcount:0 mapping:0000000000000000
index:0x0
flags: 0x3fffe000001000(reserved)
raw: 003fffe000001000 ffffea0001200008 ffffea0001200008 0000000000000000
raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: unmovable page
WARNING: CPU: 25 PID: 1665 at mm/kasan/common.c:665
kasan_mem_notifier+0x34/0x23b
CPU: 25 PID: 1665 Comm: bash Tainted: G W 5.0.0+ #94
Hardware name: HP ProLiant DL180 Gen9/ProLiant DL180 Gen9, BIOS U20
10/25/2017
RIP: 0010:kasan_mem_notifier+0x34/0x23b
RSP: 0018:ffff8883ec737890 EFLAGS: 00010206
RAX: 0000000000000246 RBX: ff10f0f4435f1000 RCX: f887a7a21af88000
RDX: dffffc0000000000 RSI: 0000000000000020 RDI: ffff8881f221af88
RBP: ffff8883ec737898 R08: ffff888000000000 R09: ffffffffb0bddcd0
R10: ffffed103e857088 R11: ffff8881f42b8443 R12: dffffc0000000000
R13: 00000000fffffff9 R14: dffffc0000000000 R15: 0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000560fbd31d730 CR3: 00000004049c6003 CR4: 00000000001606a0
Call Trace:
notifier_call_chain+0xbf/0x130
__blocking_notifier_call_chain+0x76/0xc0
blocking_notifier_call_chain+0x16/0x20
memory_notify+0x1b/0x20
__offline_pages+0x3e2/0x1210
offline_pages+0x11/0x20
memory_block_action+0x144/0x300
memory_subsys_offline+0xe5/0x170
device_offline+0x13f/0x1e0
state_store+0xeb/0x110
dev_attr_store+0x3f/0x70
sysfs_kf_write+0x104/0x150
kernfs_fop_write+0x25c/0x410
__vfs_write+0x66/0x120
vfs_write+0x15a/0x4f0
ksys_write+0xd2/0x1b0
__x64_sys_write+0x73/0xb0
do_syscall_64+0xeb/0xb78
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f14f75cc3b8
RSP: 002b:00007ffe84d01d68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f14f75cc3b8
RDX: 0000000000000008 RSI: 0000563f8e433d70 RDI: 0000000000000001
RBP: 0000563f8e433d70 R08: 000000000000000a R09: 00007ffe84d018f0
R10: 000000000000000a R11: 0000000000000246 R12: 00007f14f789e780
R13: 0000000000000008 R14: 00007f14f7899740 R15: 0000000000000008
Link: http://lkml.kernel.org/r/20190320204255.53571-1-cai@lca.pw
Fixes:
|
||
|
5ae2efb1de |
mm/debug.c: fix __dump_page when mapping->host is not set
While debugging something, I added a dump_page() into do_swap_page(),
and I got the splat from below. The issue happens when dereferencing
mapping->host in __dump_page():
...
else if (mapping) {
pr_warn("%ps ", mapping->a_ops);
if (mapping->host->i_dentry.first) {
struct dentry *dentry;
dentry = container_of(mapping->host->i_dentry.first, struct dentry, d_u.d_alias);
pr_warn("name:\"%pd\" ", dentry);
}
}
...
Swap address space does not contain an inode information, and so
mapping->host equals NULL.
Although the dump_page() call was added artificially into
do_swap_page(), I am not sure if we can hit this from any other path, so
it looks worth fixing it. We can easily do that by checking
mapping->host first.
Link: http://lkml.kernel.org/r/20190318072931.29094-1-osalvador@suse.de
Fixes:
|
||
|
a7f40cfe3b |
mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified
When MPOL_MF_STRICT was specified and an existing page was already on a node that does not follow the policy, mbind() should return -EIO. But commit |
||
|
6d6ea1e967 |
mm: add support for kmem caches in DMA32 zone
Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
v6.
This is a followup to the discussion in [1], [2].
IOMMUs using ARMv7 short-descriptor format require page tables (level 1
and 2) to be allocated within the first 4GB of RAM, even on 64-bit
systems.
For L1 tables that are bigger than a page, we can just use
__get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
use GFP_DMA).
For L2 tables that only take 1KB, it would be a waste to allocate a full
page, so we considered 3 approaches:
1. This series, adding support for GFP_DMA32 slab caches.
2. genalloc, which requires pre-allocating the maximum number of L2 page
tables (4096, so 4MB of memory).
3. page_frag, which is not very memory-efficient as it is unable to reuse
freed fragments until the whole page is freed. [3]
This series is the most memory-efficient approach.
stable@ note:
We confirmed that this is a regression, and IOMMU errors happen on 4.19
and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
most likely starts from commit
|
||
|
9b7ea46a82 |
mm/hotplug: fix offline undo_isolate_page_range()
Commit |
||
|
44dc1b1fab |
mm/debug.c: add a cast to u64 for atomic64_read()
atomic64_read() on ppc64le returns "long int", so fix the same way as commit |
||
|
cae85cb8ad |
mm/memory.c: fix modifying of page protection by insert_pfn()
Aneesh has reported that PPC triggers the following warning when
excercising DAX code:
IP set_pte_at+0x3c/0x190
LR insert_pfn+0x208/0x280
Call Trace:
insert_pfn+0x68/0x280
dax_iomap_pte_fault.isra.7+0x734/0xa40
__xfs_filemap_fault+0x280/0x2d0
do_wp_page+0x48c/0xa40
__handle_mm_fault+0x8d0/0x1fd0
handle_mm_fault+0x140/0x250
__do_page_fault+0x300/0xd60
handle_page_fault+0x18
Now that is WARN_ON in set_pte_at which is
VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));
The problem is that on some architectures set_pte_at() cannot cope with
a situation where there is already some (different) valid entry present.
Use ptep_set_access_flags() instead to modify the pfn which is built to
deal with modifying existing PTE.
Link: http://lkml.kernel.org/r/20190311084537.16029-1-jack@suse.cz
Fixes:
|
||
|
c412a769d2 |
kasan: fix variable 'tag' set but not used warning
set_tag() compiles away when CONFIG_KASAN_SW_TAGS=n, so make arch_kasan_set_tag() a static inline function to fix warnings below. mm/kasan/common.c: In function '__kasan_kmalloc': mm/kasan/common.c:475:5: warning: variable 'tag' set but not used [-Wunused-but-set-variable] u8 tag; ^~~ Link: http://lkml.kernel.org/r/20190307185244.54648-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
00206a69ee |
percpu: stop printing kernel addresses
Since commit |
||
|
f67e3fb489 |
device-dax for 5.1
* Replace the /sys/class/dax device model with /sys/bus/dax, and include a compat driver so distributions can opt-in to the new ABI. * Allow for an alternative driver for the device-dax address-range * Introduce the 'kmem' driver to hotplug / assign a device-dax address-range to the core-mm. * Arrange for the device-dax target-node to be onlined so that the newly added memory range can be uniquely referenced by numa apis. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJchWpGAAoJEB7SkWpmfYgCJk8P/0Q1DINszUDO/vKjJ09cDs9P Jw3it6GBIL50rDOu9QdcprSpwYDD0h1mLAV/m6oa3bVO+p4uWGvnxaxRx2HN2c/v vhZFtUDpHlqR63vzWMNVKRprYixCRJDUr6xQhhCcE3ak/ELN6w7LWfikKVWv15UL MfR96IQU38f+xRda/zSXnL9606Dvkvu/inEHj84lRcHIwj3sQAUalrE8bR3O32gZ bDg/l5kzT49o8ZXUo/TegvRSSSZpJmOl2DD0RW+ax5q3NI2bOXFrVDUKBKxf/hcQ E/V9i57TrqQx0GqRhnU7rN/v53cFZGGs31TEEIB/xs3bzCnADxwXcjL5b5K005J6 vJjBA2ODBewHFK3uVx46Hy1iV4eCtZWj4QrMnrjdSrjXOfbF5GTbWOhPFgoq7TWf S7VqFEf3I2gDPaMq4o8Ej1kLH4HMYeor2NSOZjyvGn87rSZ3ZIQguwbaNIVl+itz gdDt0ZOU0BgOBkV+rZIeZDaGdloWCHcDPL15CkZaOZyzdWhfEZ7dod6ad+9udilU EUPH62RgzXZtfm5zpebYyjNVLbb9pLZ0nT+UypyGR6zqWx1SqU3mXi63NFXPco+x XA9j//edPeI6NHg2CXLEh8DLuCg3dG1zWRJANkiF+niBwyCR8CHtGWAoY6soXbKe 2UrXGcIfXxyJ8V9v8v4q =hfa3 -----END PGP SIGNATURE----- Merge tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull device-dax updates from Dan Williams: "New device-dax infrastructure to allow persistent memory and other "reserved" / performance differentiated memories, to be assigned to the core-mm as "System RAM". Some users want to use persistent memory as additional volatile memory. They are willing to cope with potential performance differences, for example between DRAM and 3D Xpoint, and want to use typical Linux memory management apis rather than a userspace memory allocator layered over an mmap() of a dax file. The administration model is to decide how much Persistent Memory (pmem) to use as System RAM, create a device-dax-mode namespace of that size, and then assign it to the core-mm. The rationale for device-dax is that it is a generic memory-mapping driver that can be layered over any "special purpose" memory, not just pmem. On subsequent boots udev rules can be used to restore the memory assignment. One implication of using pmem as RAM is that mlock() no longer keeps data off persistent media. For this reason it is recommended to enable NVDIMM Security (previously merged for 5.0) to encrypt pmem contents at rest. We considered making this recommendation an actively enforced requirement, but in the end decided to leave it as a distribution / administrator policy to allow for emulation and test environments that lack security capable NVDIMMs. Summary: - Replace the /sys/class/dax device model with /sys/bus/dax, and include a compat driver so distributions can opt-in to the new ABI. - Allow for an alternative driver for the device-dax address-range - Introduce the 'kmem' driver to hotplug / assign a device-dax address-range to the core-mm. - Arrange for the device-dax target-node to be onlined so that the newly added memory range can be uniquely referenced by numa apis" NOTE! I'm not entirely happy with the whole "PMEM as RAM" model because we currently have special - and very annoying rules in the kernel about accessing PMEM only with the "MC safe" accessors, because machine checks inside the regular repeat string copy functions can be fatal in some (not described) circumstances. And apparently the PMEM modules can cause that a lot more than regular RAM. The argument is that this happens because PMEM doesn't necessarily get scrubbed at boot like RAM does, but that is planned to be added for the user space tooling. Quoting Dan from another email: "The exposure can be reduced in the volatile-RAM case by scanning for and clearing errors before it is onlined as RAM. The userspace tooling for that can be in place before v5.1-final. There's also runtime notifications of errors via acpi_nfit_uc_error_notify() from background scrubbers on the DIMM devices. With that mechanism the kernel could proactively clear newly discovered poison in the volatile case, but that would be additional development more suitable for v5.2. I understand the concern, and the need to highlight this issue by tapping the brakes on feature development, but I don't see PMEM as RAM making the situation worse when the exposure is also there via DAX in the PMEM case. Volatile-RAM is arguably a safer use case since it's possible to repair pages where the persistent case needs active application coordination" * tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: device-dax: "Hotplug" persistent memory for use like normal RAM mm/resource: Let walk_system_ram_range() search child resources mm/memory-hotplug: Allow memory resources to be children mm/resource: Move HMM pr_debug() deeper into resource code mm/resource: Return real error codes from walk failures device-dax: Add a 'modalias' attribute to DAX 'bus' devices device-dax: Add a 'target_node' attribute device-dax: Auto-bind device after successful new_id acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node device-dax: Add /sys/class/dax backwards compatibility device-dax: Add support for a dax override driver device-dax: Move resource pinning+mapping into the common driver device-dax: Introduce bus + driver model device-dax: Start defining a dax bus model device-dax: Remove multi-resource infrastructure device-dax: Kill dax_region base device-dax: Kill dax_region ida |
||
|
8b0f9fa2e0 |
filemap: add a comment about FAULT_FLAG_RETRY_NOWAIT behavior
I thought Josef Bacik's patch to drop the mmap_sem was buggy, because when looking at the error cases, there was one case where we returned VM_FAULT_RETRY without actually dropping the mmap_sem. Josef had to explain to me (using small words) that yes, that's actually what we're supposed to do, and his patch was correct. Which not only convinced me he knew what he was doing and I should stop arguing with him, but also that I should add a comment to the case I was confused about. Patiently-pointed-out-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
6b4c9f4469 |
filemap: drop the mmap_sem for all blocking operations
Currently we only drop the mmap_sem if there is contention on the page lock. The idea is that we issue readahead and then go to lock the page while it is under IO and we want to not hold the mmap_sem during the IO. The problem with this is the assumption that the readahead does anything. In the case that the box is under extreme memory or IO pressure we may end up not reading anything at all for readahead, which means we will end up reading in the page under the mmap_sem. Even if the readahead does something, it could get throttled because of io pressure on the system and the process is in a lower priority cgroup. Holding the mmap_sem while doing IO is problematic because it can cause system-wide priority inversions. Consider some large company that does a lot of web traffic. This large company has load balancing logic in it's core web server, cause some engineer thought this was a brilliant plan. This load balancing logic gets statistics from /proc about the system, which trip over processes mmap_sem for various reasons. Now the web server application is in a protected cgroup, but these other processes may not be, and if they are being throttled while their mmap_sem is held we'll stall, and cause this nice death spiral. Instead rework filemap fault path to drop the mmap sem at any point that we may do IO or block for an extended period of time. This includes while issuing readahead, locking the page, or needing to call ->readpage because readahead did not occur. Then once we have a fully uptodate page we can return with VM_FAULT_RETRY and come back again to find our nicely in-cache page that was gotten outside of the mmap_sem. This patch also adds a new helper for locking the page with the mmap_sem dropped. This doesn't make sense currently as generally speaking if the page is already locked it'll have been read in (unless there was an error) before it was unlocked. However a forthcoming patchset will change this with the ability to abort read-ahead bio's if necessary, making it more likely that we could contend for a page lock and still have a not uptodate page. This allows us to deal with this case by grabbing the lock and issuing the IO without the mmap_sem held, and then returning VM_FAULT_RETRY to come back around. [josef@toxicpanda.com: v6] Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com [kirill@shutemov.name: fix race in filemap_fault()] Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1 [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.com Signed-off-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com Cc: Dave Chinner <david@fromorbit.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
a75d4c3337 |
filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6. Now that we have proper isolation in place with cgroups2 we have started going through and fixing the various priority inversions. Most are all gone now, but this one is sort of weird since it's not necessarily a priority inversion that happens within the kernel, but rather because of something userspace does. We have giant applications that we want to protect, and parts of these giant applications do things like watch the system state to determine how healthy the box is for load balancing and such. This involves running 'ps' or other such utilities. These utilities will often walk /proc/<pid>/whatever, and these files can sometimes need to down_read(&task->mmap_sem). Not usually a big deal, but we noticed when we are stress testing that sometimes our protected application has latency spikes trying to get the mmap_sem for tasks that are in lower priority cgroups. This is because any down_write() on a semaphore essentially turns it into a mutex, so even if we currently have it held for reading, any new readers will not be allowed on to keep from starving the writer. This is fine, except a lower priority task could be stuck doing IO because it has been throttled to the point that its IO is taking much longer than normal. But because a higher priority group depends on this completing it is now stuck behind lower priority work. In order to avoid this particular priority inversion we want to use the existing retry mechanism to stop from holding the mmap_sem at all if we are going to do IO. This already exists in the read case sort of, but needed to be extended for more than just grabbing the page lock. With io.latency we throttle at submit_bio() time, so the readahead stuff can block and even page_cache_read can block, so all these paths need to have the mmap_sem dropped. The other big thing is ->page_mkwrite. btrfs is particularly shitty here because we have to reserve space for the dirty page, which can be a very expensive operation. We use the same retry method as the read path, and simply cache the page and verify the page is still setup properly the next pass through ->page_mkwrite(). I've tested these patches with xfstests and there are no regressions. This patch (of 3): If we do not have a page at filemap_fault time we'll do this weird forced page_cache_read thing to populate the page, and then drop it again and loop around and find it. This makes for 2 ways we can read a page in filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP flag so that pagecache_get_page() will return a unlocked page that's in pagecache. Then use the normal page locking and readpage logic already in filemap_fault. This simplifies the no page in page cache case significantly. [akpm@linux-foundation.org: fix comment text] [josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case] Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com Signed-off-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2a1180f1bd |
filemap: pass vm_fault to the mmap ra helpers
All of the arguments to these functions come from the vmf. Cut down on the amount of arguments passed by simply passing in the vmf to these two helpers. Link: http://lkml.kernel.org/r/20181211173801.29535-3-josef@toxicpanda.com Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d33d9f3dd9 |
percpu: use chunk scan_hint to skip some scanning
Just like blocks, chunks now maintain a scan_hint. This can be used to skip some scanning by promoting the scan_hint to be the contig_hint. The chunk's scan_hint is primarily updated on the backside and relies on full scanning when a block becomes free or the free region spans across blocks. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> |
||
|
92c14cab43 |
percpu: convert chunk hints to be based on pcpu_block_md
As mentioned in the last patch, a chunk's hints are no different than a block just responsible for more bits. This converts chunk level hints to use a pcpu_block_md to maintain them. This lets us reuse the same hint helper functions as a block. The left_free and right_free are unused by the chunk's pcpu_block_md. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> |
||
|
047924c968 |
percpu: make pcpu_block_md generic
In reality, a chunk is just a block covering a larger number of bits. The hints themselves are one in the same. Rather than maintaining the hints separately, first introduce nr_bits to genericize pcpu_block_update() to correctly maintain block->right_free. The next patch will convert chunk hints to be managed as a pcpu_block_md. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> |
||
|
da3afdd5bb |
percpu: use block scan_hint to only scan forward
Blocks now remember the latest scan_hint. This can be used on the allocation path as when a contig_hint is broken, we can promote the scan_hint to the contig_hint and scan forward from there. This works because pcpu_block_refresh_hint() is only called on the allocation path while block free regions are updated manually in pcpu_block_update_hint_free(). Signed-off-by: Dennis Zhou <dennis@kernel.org> |
||
|
b89462a9c5 |
percpu: remember largest area skipped during allocation
Percpu allocations attempt to do first fit by scanning forward from the first_free of a block. However, fragmentation from allocation requests can cause holes not seen by block hint update functions. To address this, create a local version of bitmap_find_next_zero_area_off() that remembers the largest area skipped over. The caveat is that it only sees regions skipped over due to not fitting, not regions skipped due to alignment. Prior to updating the scan_hint, a scan backwards is done to try and recover free bits skipped due to alignment. While this can cause scanning to miss earlier possible free areas, smaller allocations will eventually fill those holes due to first fit. Signed-off-by: Dennis Zhou <dennis@kernel.org> |
||
|
382b88e961 |
percpu: add block level scan_hint
Fragmentation can cause both blocks and chunks to have an early first_firee bit available, but only able to satisfy allocations much later on. This patch introduces a scan_hint to help mitigate some unnecessary scanning. The scan_hint remembers the largest area prior to the contig_hint. If the contig_hint == scan_hint, then scan_hint_start > contig_hint_start. This is necessary for scan_hint discovery when refreshing a block. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> |
||
|
b239f7daf5 |
percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE
Previously, block size was flexible based on the constraint that the GCD(PCPU_BITMAP_BLOCK_SIZE, PAGE_SIZE) > 1. However, this carried the overhead that keeping a floating number of populated free pages required scanning over the free regions of a chunk. Setting the block size to be fixed at PAGE_SIZE lets us know when an empty page becomes used as we will break a full contig_hint of a block. This means we no longer have to scan the whole chunk upon breaking a contig_hint which empty page management piggybacked off. A later patch takes advantage of this to optimize the allocation path by only scanning forward using the scan_hint introduced later too. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> |
||
|
8744d85942 |
percpu: relegate chunks unusable when failing small allocations
In certain cases, requestors of percpu memory may want specific alignments. However, it is possible to end up in situations where the contig_hint matches, but the alignment does not. This causes excess scanning of chunks that will fail. To prevent this, if a small allocation fails (< 32B), the chunk is moved to the empty list. Once an allocation is freed from that chunk, it is placed back into rotation. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> |
||
|
3e54097beb |
percpu: manage chunks based on contig_bits instead of free_bytes
When a chunk becomes fragmented, it can end up having a large number of small allocation areas free. The free_bytes sorting of chunks leads to unnecessary checking of chunks that cannot satisfy the allocation. Switch to contig_bits sorting to prevent scanning chunks that may not be able to service the allocation request. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> |
||
|
d9f3a01eeb |
percpu: introduce helper to determine if two regions overlap
While block hints were always accurate, it's possible when spanning across blocks that we miss updating the chunk's contig_hint. Rather than rely on correctness of the boundaries of hints, do a full overlap comparison. A future patch introduces the scan_hint which makes the contig_hint slightly fuzzy as they can at times be smaller than the actual hint. Signed-off-by: Dennis Zhou <dennis@kernel.org> |
||
|
8c43004af0 |
percpu: do not search past bitmap when allocating an area
pcpu_find_block_fit() guarantees that a fit is found within PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit as it compares against the block's contig_hint. This can lead to incorrectly scanning past the end of the bitmap. The behavior was okay given the check after for bit_off >= end and the correctness of the hints from pcpu_find_block_fit(). This patch fixes this by bounding the end offset by the number of bits in a chunk. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> |
||
|
8e5a2b9893 |
percpu: update free path with correct new free region
When updating the chunk's contig_hint on the free path of a hint that does not touch the page boundaries, it was incorrectly using the starting offset of the free region and the block's contig_hint. This could lead to incorrect assumptions about fit given a size and better alignment of the start. Fix this by using (end - start) as this is only called when updating a hint within a block. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> |
||
|
a667cb7a94 |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: - a few misc things - the rest of MM - remove flex_arrays, replace with new simple radix-tree implementation * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (38 commits) Drop flex_arrays sctp: convert to genradix proc: commit to genradix generic radix trees selinux: convert to kvmalloc md: convert to kvmalloc openvswitch: convert to kvmalloc of: fix kmemleak crash caused by imbalance in early memory reservation mm: memblock: update comments and kernel-doc memblock: split checks whether a region should be skipped to a helper function memblock: remove memblock_{set,clear}_region_flags memblock: drop memblock_alloc_*_nopanic() variants memblock: memblock_alloc_try_nid: don't panic treewide: add checks for the return value of memblock_alloc*() swiotlb: add checks for the return value of memblock_alloc*() init/main: add checks for the return value of memblock_alloc*() mm/percpu: add checks for the return value of memblock_alloc*() sparc: add checks for the return value of memblock_alloc*() ia64: add checks for the return value of memblock_alloc*() arch: don't memset(0) memory returned by memblock_alloc() ... |
||
|
a2974133b7 |
mm: memblock: update comments and kernel-doc
* Remove comments mentioning bootmem * Extend "DOC: memblock overview" * Add kernel-doc comments for several more functions [akpm@linux-foundation.org: fix copy-n-paste error] Link: http://lkml.kernel.org/r/1549626347-25461-1-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
c9a688a3e9 |
memblock: split checks whether a region should be skipped to a helper function
__next_mem_range() and __next_mem_range_rev() duplicate the code that checks whether a region should be skipped because of node or flags incompatibility. Split this code into a helper function. Link: http://lkml.kernel.org/r/1549455025-17706-3-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
fe145124db |
memblock: remove memblock_{set,clear}_region_flags
The memblock API provides dedicated helpers to set or clear a flag on a memory region, e.g. memblock_{mark,clear}_hotplug(). The memblock_{set,clear}_region_flags() functions are used only by the memblock internal function that adjusts the region flags. Drop these functions and use open-coded implementation instead. Link: http://lkml.kernel.org/r/1549455025-17706-2-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
26fb3dae0a |
memblock: drop memblock_alloc_*_nopanic() variants
As all the memblock allocation functions return NULL in case of error rather than panic(), the duplicates with _nopanic suffix can be removed. Link: http://lkml.kernel.org/r/1548057848-15136-22-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Petr Mladek <pmladek@suse.com> [printk] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
c0dbe825a9 |
memblock: memblock_alloc_try_nid: don't panic
As all the memblock_alloc*() users are now checking the return value and panic() in case of error, the panic() call can be removed from the core memblock allocator, namely memblock_alloc_try_nid(). Link: http://lkml.kernel.org/r/1548057848-15136-21-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
8a7f97b902 |
treewide: add checks for the return value of memblock_alloc*()
Add check for the return value of memblock_alloc*() functions and call panic() in case of error. The panic message repeats the one used by panicing memblock allocators with adjustment of parameters to include only relevant ones. The replacement was mostly automated with semantic patches like the one below with manual massaging of format strings. @@ expression ptr, size, align; @@ ptr = memblock_alloc(size, align); + if (!ptr) + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, size, align); [anders.roxell@linaro.org: use '%pa' with 'phys_addr_t' type] Link: http://lkml.kernel.org/r/20190131161046.21886-1-anders.roxell@linaro.org [rppt@linux.ibm.com: fix format strings for panics after memblock_alloc] Link: http://lkml.kernel.org/r/1548950940-15145-1-git-send-email-rppt@linux.ibm.com [rppt@linux.ibm.com: don't panic if the allocation in sparse_buffer_init fails] Link: http://lkml.kernel.org/r/20190131074018.GD28876@rapoport-lnx [akpm@linux-foundation.org: fix xtensa printk warning] Link: http://lkml.kernel.org/r/1548057848-15136-20-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Reviewed-by: Guo Ren <ren_guo@c-sky.com> [c-sky] Acked-by: Paul Burton <paul.burton@mips.com> [MIPS] Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> [s390] Reviewed-by: Juergen Gross <jgross@suse.com> [Xen] Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Acked-by: Max Filippov <jcmvbkbc@gmail.com> [xtensa] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
f655f40537 |
mm/percpu: add checks for the return value of memblock_alloc*()
Add panic() calls if memblock_alloc() returns NULL. The panic() format duplicates the one used by memblock itself and in order to avoid explosion with long parameters list replace open coded allocation size calculations with a local variable. Link: http://lkml.kernel.org/r/1548057848-15136-17-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
c366ea89fa |
memblock: make memblock_find_in_range_node() and choose_memblock_flags() static
These functions are not used outside memblock. Make them static. Link: http://lkml.kernel.org/r/1548057848-15136-12-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
92d12f9544 |
memblock: refactor internal allocation functions
Currently, memblock has several internal functions with overlapping functionality. They all call memblock_find_in_range_node() to find free memory and then reserve the allocated range and mark it with kmemleak. However, there is difference in the allocation constraints and in fallback strategies. The allocations returning physical address first attempt to find free memory on the specified node within mirrored memory regions, then retry on the same node without the requirement for memory mirroring and finally fall back to all available memory. The allocations returning virtual address start with clamping the allowed range to memblock.current_limit, attempt to allocate from the specified node from regions with mirroring and with user defined minimal address. If such allocation fails, next attempt is done with node restriction lifted. Next, the allocation is retried with minimal address reset to zero and at last without the requirement for mirrored regions. Let's consolidate various fallbacks handling and make them more consistent for physical and virtual variants. Most of the fallback handling is moved to memblock_alloc_range_nid() and it now handles node and mirror fallbacks. The memblock_alloc_internal() uses memblock_alloc_range_nid() to get a physical address of the allocated range and converts it to virtual address. The fallback for allocation below the specified minimal address remains in memblock_alloc_internal() because memblock_alloc_range_nid() is used by CMA with exact requirement for lower bounds. The memblock_phys_alloc_nid() function is completely dropped as it is not used anywhere outside memblock and its only usage can be replaced by a call to memblock_alloc_range_nid(). [rppt@linux.ibm.com: fix parameter order in memblock_phys_alloc_try_nid()] Link: http://lkml.kernel.org/r/20190203113915.GC8620@rapoport-lnx Link: http://lkml.kernel.org/r/1548057848-15136-11-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
0ba9e6edd4 |
memblock: drop memblock_alloc_base()
The memblock_alloc_base() function tries to allocate a memory up to the limit specified by its max_addr parameter and panics if the allocation fails. Replace its usage with memblock_phys_alloc_range() and make the callers check the return value and panic in case of error. Link: http://lkml.kernel.org/r/1548057848-15136-10-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
42b46aeff2 |
memblock: drop __memblock_alloc_base()
The __memblock_alloc_base() function tries to allocate a memory up to the limit specified by its max_addr parameter. Depending on the value of this parameter, the __memblock_alloc_base() can is replaced with the appropriate memblock_phys_alloc*() variant. Link: http://lkml.kernel.org/r/1548057848-15136-9-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Rob Herring <robh@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
ecc3e771f4 |
memblock: memblock_phys_alloc(): don't panic
Make the memblock_phys_alloc() function an inline wrapper for memblock_phys_alloc_range() and update the memblock_phys_alloc() callers to check the returned value and panic in case of error. Link: http://lkml.kernel.org/r/1548057848-15136-8-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
337555744e |
memblock: memblock_phys_alloc_try_nid(): don't panic
The memblock_phys_alloc_try_nid() function tries to allocate memory from the requested node and then falls back to allocation from any node in the system. The memblock_alloc_base() fallback used by this function panics if the allocation fails. Replace the memblock_alloc_base() fallback with the direct call to memblock_alloc_range_nid() and update the memblock_phys_alloc_try_nid() callers to check the returned value and panic in case of error. Link: http://lkml.kernel.org/r/1548057848-15136-7-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
8a770c2a83 |
memblock: emphasize that memblock_alloc_range() returns a physical address
Rename memblock_alloc_range() to memblock_phys_alloc_range() to emphasize that it returns a physical address. While on it, remove the 'enum memblock_flags' parameter from this function as its only user anyway sets it to MEMBLOCK_NONE, which is the default for the most of memblock allocations. Link: http://lkml.kernel.org/r/1548057848-15136-6-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
53d818d274 |
memblock: drop memblock_alloc_base_nid()
memblock_alloc_base_nid() is a oneliner wrapper for memblock_alloc_range_nid() without any side effect. Replace it's usage by the direct calls to memblock_alloc_range_nid(). Link: http://lkml.kernel.org/r/1548057848-15136-5-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Christoph Hellwig <hch@lst.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Guo Ren <ren_guo@c-sky.com> [c-sky] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Juergen Gross <jgross@suse.com> [Xen] Cc: Mark Salter <msalter@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Paul Burton <paul.burton@mips.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
b57e622e6d |
mm/hmm: convert to use vm_fault_t
Convert to use vm_fault_t type as return type for fault handler. kbuild reported warning during testing of *mm-create-the-new-vm_fault_t-type.patch* available in below link - https://patchwork.kernel.org/patch/10752741/ kernel/memremap.c:46:34: warning: incorrect type in return expression (different base types) kernel/memremap.c:46:34: expected restricted vm_fault_t kernel/memremap.c:46:34: got int This patch has fixed the warnings and also hmm_devmem_fault() is converted to return vm_fault_t to avoid further warnings. [sfr@canb.auug.org.au: drm/nouveau/dmem: update for struct hmm_devmem_ops member change] Link: http://lkml.kernel.org/r/20190220174407.753d94e5@canb.auug.org.au Link: http://lkml.kernel.org/r/20190110145900.GA1317@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d14d7f14f1 |
xen: fixes and features for 5.1-rc1
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCXIYrgwAKCRCAXGG7T9hj viyuAP4/bKpQ8QUp2V6ddkyEG4NTkA7H87pqQQsxJe9sdoyRRwD5AReS7oitoRS/ cm6SBpwdaPRX/hfVvT2/h1GWxkvDFgA= =8Zfa -----END PGP SIGNATURE----- Merge tag 'for-linus-5.1a-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: "xen fixes and features: - remove fallback code for very old Xen hypervisors - three patches for fixing Xen dom0 boot regressions - an old patch for Xen PCI passthrough which was never applied for unknown reasons - some more minor fixes and cleanup patches" * tag 'for-linus-5.1a-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: fix dom0 boot on huge systems xen, cpu_hotplug: Prevent an out of bounds access xen: remove pre-xen3 fallback handlers xen/ACPI: Switch to bitmap_zalloc() x86/xen: dont add memory above max allowed allocation x86: respect memory size limiting via mem= parameter xen/gntdev: Check and release imported dma-bufs on close xen/gntdev: Do not destroy context while dma-bufs are in use xen/pciback: Don't disable PCI_COMMAND on PCI device reset. xen-scsiback: mark expected switch fall-through xen: mark expected switch fall-through |
||
|
a50243b1dd |
5.1 Merge Window Pull Request
This has been a slightly more active cycle than normal with ongoing core changes and quite a lot of collected driver updates. - Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe - A new data transfer mode for HFI1 giving higher performance - Significant functional and bug fix update to the mlx5 On-Demand-Paging MR feature - A chip hang reset recovery system for hns - Change mm->pinned_vm to an atomic64 - Update bnxt_re to support a new 57500 chip - A sane netlink 'rdma link add' method for creating rxe devices and fixing the various unregistration race conditions in rxe's unregister flow - Allow lookup up objects by an ID over netlink - Various reworking of the core to driver interface: * Drivers should not assume umem SGLs are in PAGE_SIZE chunks * ucontext is accessed via udata not other means * Start to make the core code responsible for object memory allocation * Drivers should convert struct device to struct ib_device via a helper * Drivers have more tools to avoid use after unregister problems -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlyAJYYACgkQOG33FX4g mxrWwQ/+OyAx4Moru7Aix0C6GWxTJp/wKgw21CS3reZxgLai6x81xNYG/s2wCNjo IccObVd7mvzyqPdxOeyHBsJBbQDqWvoD6O2duH8cqGMgBRgh3CSdUep2zLvPpSAx 2W1SvWYCLDnCuarboFrCA8c4AN3eCZiqD7z9lHyFQGjy3nTUWzk1uBaOP46uaiMv w89N8EMdXJ/iY6ONzihvE05NEYbMA8fuvosKLLNdghRiHIjbMQU8SneY23pvyPDd ZziPu9NcO3Hw9OVbkwtJp47U3KCBgvKHmnixyZKkikjiD+HVoABw2IMwcYwyBZwP Bic/ddONJUvAxMHpKRnQaW7znAiHARk21nDG28UAI7FWXH/wMXgicMp6LRcNKqKF vqXdxHTKJb0QUR4xrYI+eA8ihstss7UUpgSgByuANJ0X729xHiJtlEvPb1DPo1Dz 9CB4OHOVRl5O8sA5Jc6PSusZiKEpvWoyWbdmw0IiwDF5pe922VLl5Nv88ta+sJ38 v2Ll5AgYcluk7F3599Uh9D7gwp5hxW2Ph3bNYyg2j3HP4/dKsL9XvIJPXqEthgCr 3KQS9rOZfI/7URieT+H+Mlf+OWZhXsZilJG7No0fYgIVjgJ00h3SF1/299YIq6Qp 9W7ZXBfVSwLYA2AEVSvGFeZPUxgBwHrSZ62wya4uFeB1jyoodPk= =p12E -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull rdma updates from Jason Gunthorpe: "This has been a slightly more active cycle than normal with ongoing core changes and quite a lot of collected driver updates. - Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe - A new data transfer mode for HFI1 giving higher performance - Significant functional and bug fix update to the mlx5 On-Demand-Paging MR feature - A chip hang reset recovery system for hns - Change mm->pinned_vm to an atomic64 - Update bnxt_re to support a new 57500 chip - A sane netlink 'rdma link add' method for creating rxe devices and fixing the various unregistration race conditions in rxe's unregister flow - Allow lookup up objects by an ID over netlink - Various reworking of the core to driver interface: - drivers should not assume umem SGLs are in PAGE_SIZE chunks - ucontext is accessed via udata not other means - start to make the core code responsible for object memory allocation - drivers should convert struct device to struct ib_device via a helper - drivers have more tools to avoid use after unregister problems" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (280 commits) net/mlx5: ODP support for XRC transport is not enabled by default in FW IB/hfi1: Close race condition on user context disable and close RDMA/umem: Revert broken 'off by one' fix RDMA/umem: minor bug fix in error handling path RDMA/hns: Use GFP_ATOMIC in hns_roce_v2_modify_qp cxgb4: kfree mhp after the debug print IB/rdmavt: Fix concurrency panics in QP post_send and modify to error IB/rdmavt: Fix loopback send with invalidate ordering IB/iser: Fix dma_nents type definition IB/mlx5: Set correct write permissions for implicit ODP MR bnxt_re: Clean cq for kernel consumers only RDMA/uverbs: Don't do double free of allocated PD RDMA: Handle ucontext allocations by IB/core RDMA/core: Fix a WARN() message bnxt_re: fix the regression due to changes in alloc_pbl IB/mlx4: Increase the timeout for CM cache IB/core: Abort page fault handler silently during owning process exit IB/mlx5: Validate correct PD before prefetch MR IB/mlx5: Protect against prefetch of invalid MR RDMA/uverbs: Store PR pointer before it is overwritten ... |
||
|
f86727f8bd |
Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm cleanup from Ingo Molnar: "A single GUP cleanup" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: mm/gup: Remove the 'write' parameter from gup_fast_permitted() |
||
|
8d521d94da |
Merge branch 'for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou: "There are 2 minor changes to the percpu allocator this merge window: - for loop condition that could be out of bounds on multi-socket UP - cosmetic removal of pcpu_group_offsets[0] in UP code as it is 0 There has been an interest in having better alignment with percpu allocations. This has caused a performance regression in at least one reported workload. I have a series out which adds scan hints to the allocator as well as some other performance oriented changes. I hope to have this queued for v5.2 soon" * 'for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu: percpu: km: no need to consider pcpu_group_offsets[0] percpu: use nr_groups as check condition |
||
|
ea2c3f6f55 |
mm,mremap: bail out earlier in mremap_to under map pressure
When using mremap() syscall in addition to MREMAP_FIXED flag, mremap() calls mremap_to() which does the following: 1) unmaps the destination region where we are going to move the map 2) If the new region is going to be smaller, we unmap the last part of the old region Then, we will eventually call move_vma() to do the actual move. move_vma() checks whether we are at least 4 maps below max_map_count before going further, otherwise it bails out with -ENOMEM. The problem is that we might have already unmapped the vma's in steps 1) and 2), so it is not possible for userspace to figure out the state of the vmas after it gets -ENOMEM, and it gets tricky for userspace to clean up properly on error path. While it is true that we can return -ENOMEM for more reasons (e.g: see may_expand_vm() or move_page_tables()), I think that we can avoid this scenario if we check early in mremap_to() if the operation has high chances to succeed map-wise. Should that not be the case, we can bail out before we even try to unmap anything, so we make sure the vma's are left untouched in case we are likely to be short of maps. The thumb-rule now is to rely on the worst-scenario case we can have. That is when both vma's (old region and new region) are going to be split in 3, so we get two more maps to the ones we already hold (one per each). If current map count + 2 maps still leads us to 4 maps below the threshold, we are going to pass the check in move_vma(). Of course, this is not free, as it might generate false positives when it is true that we are tight map-wise, but the unmap operation can release several vma's leading us to a good state. Another approach was also investigated [1], but it may be too much hassle for what it brings. [1] https://lore.kernel.org/lkml/20190219155320.tkfkwvqk53tfdojt@d104.suse.de/ Link: http://lkml.kernel.org/r/20190226091314.18446-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d778015ac9 |
mm/sparse: fix a bad comparison
next_present_section_nr() could only return an unsigned number -1, so
just check it specifically where compilers will convert -1 to unsigned
if needed.
mm/sparse.c: In function 'sparse_init_nid':
mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
((section_nr >= 0) && \
^~
mm/sparse.c:478:2: note: in expansion of macro
'for_each_present_section_nr'
for_each_present_section_nr(pnum_begin, pnum) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
((section_nr >= 0) && \
^~
mm/sparse.c:497:2: note: in expansion of macro
'for_each_present_section_nr'
for_each_present_section_nr(pnum_begin, pnum) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/sparse.c: In function 'sparse_init':
mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
((section_nr >= 0) && \
^~
mm/sparse.c:520:2: note: in expansion of macro
'for_each_present_section_nr'
for_each_present_section_nr(pnum_begin + 1, pnum_end) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~
Link: http://lkml.kernel.org/r/20190228181839.86504-1-cai@lca.pw
Fixes:
|
||
|
fc8efd2ddf |
mm/memory.c: do_fault: avoid usage of stale vm_area_struct
LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8. This is a stress test, where one thread mmaps/writes/munmaps memory area and other thread is trying to read from it: CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51 Hardware name: IBM 2964 N63 400 (z/VM 6.4.0) Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8) Call Trace: ([<0000000000000000>] (null)) [<00000000001adae4>] lock_acquire+0xec/0x258 [<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98 [<000000000012a780>] page_table_free+0x48/0x1a8 [<00000000002f6e54>] do_fault+0xdc/0x670 [<00000000002fadae>] __handle_mm_fault+0x416/0x5f0 [<00000000002fb138>] handle_mm_fault+0x1b0/0x320 [<00000000001248cc>] do_dat_exception+0x19c/0x2c8 [<000000000080e5ee>] pgm_check_handler+0x19e/0x200 page_table_free() is called with NULL mm parameter, but because "0" is a valid address on s390 (see S390_lowcore), it keeps going until it eventually crashes in lockdep's lock_acquire. This crash is reproducible at least since 4.14. Problem is that "vmf->vma" used in do_fault() can become stale. Because mmap_sem may be released, other threads can come in, call munmap() and cause "vma" be returned to kmem cache, and get zeroed/re-initialized and re-used: handle_mm_fault | __handle_mm_fault | do_fault | vma = vmf->vma | do_read_fault | __do_fault | vma->vm_ops->fault(vmf); | mmap_sem is released | | | do_munmap() | remove_vma_list() | remove_vma() | vm_area_free() | # vma is released | ... | # same vma is allocated | # from kmem cache | do_mmap() | vm_area_alloc() | memset(vma, 0, ...) | pte_free(vma->vm_mm, ...); | page_table_free | spin_lock_bh(&mm->context.lock);| <crash> | Cache mm_struct to avoid using potentially stale "vma". [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c Link: http://lkml.kernel.org/r/5b3fdf19e2a5be460a384b936f5b56e13733f1b8.1551595137.git.jstancek@redhat.com Signed-off-by: Jan Stancek <jstancek@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Acked-by: Rafael Aquini <aquini@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
70516b936b |
mm/huge_memory.c: fix "orig_pud" set but not used
Commit
|
||
|
cd02cf1ace |
mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC
When onlining a memory block with DEBUG_PAGEALLOC, it unmaps the pages in the block from kernel, However, it does not map those pages while offlining at the beginning. As the result, it triggers a panic below while onlining on ppc64le as it checks if the pages are mapped before unmapping. However, the imbalance exists for all arches where double-unmappings could happen. Therefore, let kernel map those pages in generic_online_page() before they have being freed into the page allocator for the first time where it will set the page count to one. On the other hand, it works fine during the boot, because at least for IBM POWER8, it does, early_setup early_init_mmu harsh__early_init_mmu htab_initialize [1] htab_bolt_mapping [2] where it effectively map all memblock regions just like kernel_map_linear_page(), so later mem_init() -> memblock_free_all() will unmap them just fine without any imbalance. On other arches without this imbalance checking, it still unmap them once at the most. [1] for_each_memblock(memory, reg) { base = (unsigned long)__va(reg->base); size = reg->size; DBG("creating mapping for region: %lx..%lx (prot: %lx)\n", base, size, prot); BUG_ON(htab_bolt_mapping(base, base + size, __pa(base), prot, mmu_linear_psize, mmu_kernel_ssize)); } [2] linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80; kernel BUG at arch/powerpc/mm/hash_utils_64.c:1815! Oops: Exception in kernel mode, sig: 5 [#1] LE SMP NR_CPUS=256 DEBUG_PAGEALLOC NUMA pSeries CPU: 2 PID: 4298 Comm: bash Not tainted 5.0.0-rc7+ #15 NIP: c000000000062670 LR: c00000000006265c CTR: 0000000000000000 REGS: c0000005bf8a75b0 TRAP: 0700 Not tainted (5.0.0-rc7+) MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28422842 XER: 00000000 CFAR: c000000000804f44 IRQMASK: 1 NIP [c000000000062670] __kernel_map_pages+0x2e0/0x4f0 LR [c00000000006265c] __kernel_map_pages+0x2cc/0x4f0 Call Trace: __kernel_map_pages+0x2cc/0x4f0 free_unref_page_prepare+0x2f0/0x4d0 free_unref_page+0x44/0x90 __online_page_free+0x84/0x110 online_pages_range+0xc0/0x150 walk_system_ram_range+0xc8/0x120 online_pages+0x280/0x5a0 memory_subsys_online+0x1b4/0x270 device_online+0xc0/0xf0 state_store+0xc0/0x180 dev_attr_store+0x3c/0x60 sysfs_kf_write+0x70/0xb0 kernfs_fop_write+0x10c/0x250 __vfs_write+0x48/0x240 vfs_write+0xd8/0x210 ksys_write+0x70/0x120 system_call+0x5c/0x70 Link: http://lkml.kernel.org/r/20190301220814.97339-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
82ede7ee38 |
mm/memcontrol.c: fix bad line in comment
Commit
|
||
|
0d3bd18a5e |
mm/cma.c: cma_declare_contiguous: correct err handling
In case cma_init_reserved_mem failed, need to free the memblock allocated by memblock_reserve or memblock_alloc_range. Quote Catalin's comments: https://lkml.org/lkml/2019/2/26/482 Kmemleak is supposed to work with the memblock_{alloc,free} pair and it ignores the memblock_reserve() as a memblock_alloc() implementation detail. It is, however, tolerant to memblock_free() being called on a sub-range or just a different range from a previous memblock_alloc(). So the original patch looks fine to me. FWIW: Link: http://lkml.kernel.org/r/20190227144631.16708-1-peng.fan@nxp.com Signed-off-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
0c81585499 |
mm/page_ext.c: fix an imbalance with kmemleak
After offlining a memory block, kmemleak scan will trigger a crash, as it encounters a page ext address that has already been freed during memory offlining. At the beginning in alloc_page_ext(), it calls kmemleak_alloc(), but it does not call kmemleak_free() in free_page_ext(). BUG: unable to handle kernel paging request at ffff888453d00000 PGD 128a01067 P4D 128a01067 PUD 128a04067 PMD 47e09e067 PTE 800ffffbac2ff060 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI CPU: 1 PID: 1594 Comm: bash Not tainted 5.0.0-rc8+ #15 Hardware name: HP ProLiant DL180 Gen9/ProLiant DL180 Gen9, BIOS U20 10/25/2017 RIP: 0010:scan_block+0xb5/0x290 Code: 85 6e 01 00 00 48 b8 00 00 30 f5 81 88 ff ff 48 39 c3 0f 84 5b 01 00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 0f 85 87 01 00 00 <4c> 8b 3b e8 f3 0c fa ff 4c 39 3d 0c 6b 4c 01 0f 87 08 01 00 00 4c RSP: 0018:ffff8881ec57f8e0 EFLAGS: 00010082 RAX: 0000000000000000 RBX: ffff888453d00000 RCX: ffffffffa61e5a54 RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff888453d00000 RBP: ffff8881ec57f920 R08: fffffbfff4ed588d R09: fffffbfff4ed588c R10: fffffbfff4ed588c R11: ffffffffa76ac463 R12: dffffc0000000000 R13: ffff888453d00ff9 R14: ffff8881f80cef48 R15: ffff8881f80cef48 FS: 00007f6c0e3f8740(0000) GS:ffff8881f7680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff888453d00000 CR3: 00000001c4244003 CR4: 00000000001606a0 Call Trace: scan_gray_list+0x269/0x430 kmemleak_scan+0x5a8/0x10f0 kmemleak_write+0x541/0x6ca full_proxy_write+0xf8/0x190 __vfs_write+0xeb/0x980 vfs_write+0x15a/0x4f0 ksys_write+0xd2/0x1b0 __x64_sys_write+0x73/0xb0 do_syscall_64+0xeb/0xaaa entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f6c0dad73b8 Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 63 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55 RSP: 002b:00007ffd5b863cb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f6c0dad73b8 RDX: 0000000000000005 RSI: 000055a9216e1710 RDI: 0000000000000001 RBP: 000055a9216e1710 R08: 000000000000000a R09: 00007ffd5b863840 R10: 000000000000000a R11: 0000000000000246 R12: 00007f6c0dda9780 R13: 0000000000000005 R14: 00007f6c0dda4740 R15: 0000000000000005 Modules linked in: nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm irqbypass efivars ip_tables x_tables xfs sd_mod ahci libahci igb i2c_algo_bit libata i2c_core dm_mirror dm_region_hash dm_log dm_mod efivarfs CR2: ffff888453d00000 ---[ end trace ccf646c7456717c5 ]--- Kernel panic - not syncing: Fatal exception Shutting down cpus with NMI Kernel Offset: 0x24c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]--- Link: http://lkml.kernel.org/r/20190227173147.75650-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
5f438eee8f |
mm/compaction: pass pgdat to too_many_isolated() instead of zone
too_many_isolated() in mm/compaction.c looks only at node state, so it makes more sense to change argument to pgdat instead of zone. Link: http://lkml.kernel.org/r/20190228083329.31892-3-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
f4b7e272b5 |
mm: remove zone_lru_lock() function, access ->lru_lock directly
We have common pattern to access lru_lock from a page pointer: zone_lru_lock(page_zone(page)) Which is silly, because it unfolds to this: &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock while we can simply do &NODE_DATA(page_to_nid(page))->lru_lock Remove zone_lru_lock() function, since it's only complicate things. Use 'page_pgdat(page)->lru_lock' pattern instead. [aryabinin@virtuozzo.com: a slightly better version of __split_huge_page()] Link: http://lkml.kernel.org/r/20190301121651.7741-1-aryabinin@virtuozzo.com Link: http://lkml.kernel.org/r/20190228083329.31892-2-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
a7ca12f9d9 |
mm/workingset: remove unused @mapping argument in workingset_eviction()
workingset_eviction() doesn't use and never did use the @mapping argument. Remove it. Link: http://lkml.kernel.org/r/20190228083329.31892-1-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
960087445c |
mm/swapfile.c: use struct_size() in kvzalloc()
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; size = sizeof(struct foo) + count * sizeof(struct boo); instance = kvzalloc(size, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kvzalloc(struct_size(instance, entry, count), GFP_KERNEL); Notice that, in this case, variable size is not necessary, hence it is removed. This code was detected with the help of Coccinelle. Link: http://lkml.kernel.org/r/20190221154622.GA19599@embeddedor Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
5a7f1b2f2f |
mm/cma_debug.c: remove static scoped cma_debugfs_root
Currently cma_debugfs_root is static storage. That is unnecessary since it will be only used by next cma_debugfs_add_one(). We can just pass it to following calling to save thisspace. Also remove useless idx parameter. Link: http://lkml.kernel.org/r/20190221040130.8940-1-zbestahu@gmail.com Signed-off-by: Yue Hu <huyue2@yulong.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
5d3ee42f8f |
mm/shmem: make find_get_pages_range() work for huge page
find_get_pages_range() and find_get_pages_range_tag() already correctly increment reference count on head when seeing compound page, but they may still use page index from tail. Page index from tail is always zero, so these functions don't work on huge shmem. This hasn't been a problem because, AFAIK, nobody calls these functions on (huge) shmem. Fix them anyway just in case. Link: http://lkml.kernel.org/r/20190110030838.84446-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Cc: "Darrick J . Wong" <darrick.wong@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
afa0011289 |
mm: unexport free_reserved_area
This function is only used by built-in code, which makes perfect sense given the purpose of it. Link: http://lkml.kernel.org/r/20190213174621.29297-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
daf3538ad5 |
mm,memory_hotplug: explicitly pass the head to isolate_huge_page
isolate_huge_page() expects we pass the head of hugetlb page to it: bool isolate_huge_page(...) { ... VM_BUG_ON_PAGE(!PageHead(page), page); ... } While I really cannot think of any situation where we end up with a non-head page between hands in do_migrate_range(), let us make sure the code is as sane as possible by explicitly passing the Head. Since we already got the pointer, it does not take us extra effort. Link: http://lkml.kernel.org/r/20190208090604.975-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
f900482da5 |
mm/migrate.c: cleanup expected_page_refs()
Andrea has noted that page migration code propagates page_mapping(page) through the whole migration stack down to migrate_page() function so it seems stupid to then use page_mapping(page) in expected_page_refs() instead of passed down 'mapping' argument. I agree so let's make expected_page_refs() more in line with the rest of the migration stack. Link: http://lkml.kernel.org/r/20190207112314.24872-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
a862f68a8b |
docs/core-api/mm: fix return value descriptions in mm/
Many kernel-doc comments in mm/ have the return value descriptions either misformatted or omitted at all which makes kernel-doc script unhappy: $ make V=1 htmldocs ... ./mm/util.c:36: info: Scanning doc for kstrdup ./mm/util.c:41: warning: No description found for return value of 'kstrdup' ./mm/util.c:57: info: Scanning doc for kstrdup_const ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const' ./mm/util.c:75: info: Scanning doc for kstrndup ./mm/util.c:83: warning: No description found for return value of 'kstrndup' ... Fixing the formatting and adding the missing return value descriptions eliminates ~100 such warnings. Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
92eac16819 |
docs/mm: vmalloc: re-indent kernel-doc comemnts
Some kernel-doc comments in mm/vmalloc.c have leading tab in indentation. This leads to excessive indentation in the generated HTML and to the inconsistency of its layout ([1] vs [2]). Besides, multi-line Note: sections are not handled properly with extra indentation. [1] https://www.kernel.org/doc/html/v4.20/core-api/mm-api.html?#c.vm_map_ram [2] https://www.kernel.org/doc/html/v4.20/core-api/mm-api.html?#c.vfree Link: http://lkml.kernel.org/r/1549549644-4903-2-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
ce0725f78a |
numa: make "nr_online_nodes" unsigned int
Number of online NUMA nodes can't be negative as well. This doesn't save space as the variable is used only in 32-bit context, but do it anyway for consistency. Link: http://lkml.kernel.org/r/20190201223151.GB15820@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
b9726c26dc |
numa: make "nr_node_ids" unsigned int
Number of NUMA nodes can't be negative. This saves a few bytes on x86_64: add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238) Function old new delta hv_synic_alloc.cold 88 110 +22 prealloc_shrinker 260 262 +2 bootstrap 249 251 +2 sched_init_numa 1566 1567 +1 show_slab_objects 778 777 -1 s_show 1201 1200 -1 kmem_cache_init 346 345 -1 __alloc_workqueue_key 1146 1145 -1 mem_cgroup_css_alloc 1614 1612 -2 __do_sys_swapon 4702 4699 -3 __list_lru_init 655 651 -4 nic_probe 2379 2374 -5 store_user_store 118 111 -7 red_zone_store 106 99 -7 poison_store 106 99 -7 wq_numa_init 348 338 -10 __kmem_cache_empty 75 65 -10 task_numa_free 186 173 -13 merge_across_nodes_store 351 336 -15 irq_create_affinity_masks 1261 1246 -15 do_numa_crng_init 343 321 -22 task_numa_fault 4760 4737 -23 swapfile_init 179 156 -23 hv_synic_alloc 536 492 -44 apply_wqattrs_prepare 746 695 -51 Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d342a0b386 |
mm,oom: don't kill global init via memory.oom.group
Since setting global init process to some memory cgroup is technically
possible, oom_kill_memcg_member() must check it.
Tasks in /test1 are going to be killed due to memory.oom.group set
Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main(int argc, char *argv[])
{
static char buffer[10485760];
static int pipe_fd[2] = { EOF, EOF };
unsigned int i;
int fd;
char buf[64] = { };
if (pipe(pipe_fd))
return 1;
if (chdir("/sys/fs/cgroup/"))
return 1;
fd = open("cgroup.subtree_control", O_WRONLY);
write(fd, "+memory", 7);
close(fd);
mkdir("test1", 0755);
fd = open("test1/memory.oom.group", O_WRONLY);
write(fd, "1", 1);
close(fd);
fd = open("test1/cgroup.procs", O_WRONLY);
write(fd, "1", 1);
snprintf(buf, sizeof(buf) - 1, "%d", getpid());
write(fd, buf, strlen(buf));
close(fd);
snprintf(buf, sizeof(buf) - 1, "%lu", sizeof(buffer) * 5);
fd = open("test1/memory.max", O_WRONLY);
write(fd, buf, strlen(buf));
close(fd);
for (i = 0; i < 10; i++)
if (fork() == 0) {
char c;
close(pipe_fd[1]);
read(pipe_fd[0], &c, 1);
memset(buffer, 0, sizeof(buffer));
sleep(3);
_exit(0);
}
close(pipe_fd[0]);
close(pipe_fd[1]);
sleep(3);
return 0;
}
[ 37.052923][ T9185] a.out invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[ 37.056169][ T9185] CPU: 4 PID: 9185 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
[ 37.059205][ T9185] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 37.062954][ T9185] Call Trace:
[ 37.063976][ T9185] dump_stack+0x67/0x95
[ 37.065263][ T9185] dump_header+0x51/0x570
[ 37.066619][ T9185] ? trace_hardirqs_on+0x3f/0x110
[ 37.068171][ T9185] ? _raw_spin_unlock_irqrestore+0x3d/0x70
[ 37.069967][ T9185] oom_kill_process+0x18d/0x210
[ 37.071515][ T9185] out_of_memory+0x11b/0x380
[ 37.072936][ T9185] mem_cgroup_out_of_memory+0xb6/0xd0
[ 37.074601][ T9185] try_charge+0x790/0x820
[ 37.076021][ T9185] mem_cgroup_try_charge+0x42/0x1d0
[ 37.077629][ T9185] mem_cgroup_try_charge_delay+0x11/0x30
[ 37.079370][ T9185] do_anonymous_page+0x105/0x5e0
[ 37.080939][ T9185] __handle_mm_fault+0x9cb/0x1070
[ 37.082485][ T9185] handle_mm_fault+0x1b2/0x3a0
[ 37.083819][ T9185] ? handle_mm_fault+0x47/0x3a0
[ 37.085181][ T9185] __do_page_fault+0x255/0x4c0
[ 37.086529][ T9185] do_page_fault+0x28/0x260
[ 37.087788][ T9185] ? page_fault+0x8/0x30
[ 37.088978][ T9185] page_fault+0x1e/0x30
[ 37.090142][ T9185] RIP: 0033:0x7f8b183aefe0
[ 37.091433][ T9185] Code: 20 f3 44 0f 7f 44 17 d0 f3 44 0f 7f 47 30 f3 44 0f 7f 44 17 c0 48 01 fa 48 83 e2 c0 48 39 d1 74 a3 66 0f 1f 84 00 00 00 00 00 <66> 44 0f 7f 01 66 44 0f 7f 41 10 66 44 0f 7f 41 20 66 44 0f 7f 41
[ 37.096917][ T9185] RSP: 002b:00007fffc5d329e8 EFLAGS: 00010206
[ 37.098615][ T9185] RAX: 00000000006010e0 RBX: 0000000000000008 RCX: 0000000000c30000
[ 37.100905][ T9185] RDX: 00000000010010c0 RSI: 0000000000000000 RDI: 00000000006010e0
[ 37.103349][ T9185] RBP: 0000000000000000 R08: 00007f8b188f4740 R09: 0000000000000000
[ 37.105797][ T9185] R10: 00007fffc5d32420 R11: 00007f8b183aef40 R12: 0000000000000005
[ 37.108228][ T9185] R13: 0000000000000000 R14: ffffffffffffffff R15: 0000000000000000
[ 37.110840][ T9185] memory: usage 51200kB, limit 51200kB, failcnt 125
[ 37.113045][ T9185] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[ 37.115808][ T9185] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[ 37.117660][ T9185] Memory cgroup stats for /test1: cache:0KB rss:49484KB rss_huge:30720KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:49700KB inactive_file:0KB active_file:0KB unevictable:0KB
[ 37.123371][ T9185] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test1,task_memcg=/test1,task=a.out,pid=9188,uid=0
[ 37.128158][ T9185] Memory cgroup out of memory: Killed process 9188 (a.out) total-vm:14456kB, anon-rss:10324kB, file-rss:504kB, shmem-rss:0kB
[ 37.132710][ T9185] Tasks in /test1 are going to be killed due to memory.oom.group set
[ 37.132833][ T54] oom_reaper: reaped process 9188 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 37.135498][ T9185] Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
[ 37.143434][ T9185] Memory cgroup out of memory: Killed process 9182 (a.out) total-vm:14456kB, anon-rss:76kB, file-rss:588kB, shmem-rss:0kB
[ 37.144328][ T54] oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 37.147585][ T9185] Memory cgroup out of memory: Killed process 9183 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
[ 37.157222][ T9185] Memory cgroup out of memory: Killed process 9184 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:508kB, shmem-rss:0kB
[ 37.157259][ T9185] Memory cgroup out of memory: Killed process 9185 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
[ 37.157291][ T9185] Memory cgroup out of memory: Killed process 9186 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:508kB, shmem-rss:0kB
[ 37.157306][ T54] oom_reaper: reaped process 9183 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 37.157328][ T9185] Memory cgroup out of memory: Killed process 9187 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
[ 37.157452][ T9185] Memory cgroup out of memory: Killed process 9189 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
[ 37.158733][ T9185] Memory cgroup out of memory: Killed process 9190 (a.out) total-vm:14456kB, anon-rss:552kB, file-rss:512kB, shmem-rss:0kB
[ 37.160083][ T54] oom_reaper: reaped process 9186 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 37.160187][ T54] oom_reaper: reaped process 9189 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 37.206941][ T54] oom_reaper: reaped process 9185 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 37.212300][ T9185] Memory cgroup out of memory: Killed process 9191 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
[ 37.212317][ T54] oom_reaper: reaped process 9190 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 37.218860][ T9185] Memory cgroup out of memory: Killed process 9192 (a.out) total-vm:14456kB, anon-rss:1080kB, file-rss:512kB, shmem-rss:0kB
[ 37.227667][ T54] oom_reaper: reaped process 9192 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 37.292323][ T9193] abrt-hook-ccpp (9193) used greatest stack depth: 10480 bytes left
[ 37.351843][ T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
[ 37.354833][ T1] CPU: 7 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
[ 37.357876][ T1] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 37.361685][ T1] Call Trace:
[ 37.363239][ T1] dump_stack+0x67/0x95
[ 37.365010][ T1] panic+0xfc/0x2b0
[ 37.366853][ T1] do_exit+0xd55/0xd60
[ 37.368595][ T1] do_group_exit+0x47/0xc0
[ 37.370415][ T1] get_signal+0x32a/0x920
[ 37.372449][ T1] ? _raw_spin_unlock_irqrestore+0x3d/0x70
[ 37.374596][ T1] do_signal+0x32/0x6e0
[ 37.376430][ T1] ? exit_to_usermode_loop+0x26/0x9b
[ 37.378418][ T1] ? prepare_exit_to_usermode+0xa8/0xd0
[ 37.380571][ T1] exit_to_usermode_loop+0x3e/0x9b
[ 37.382588][ T1] prepare_exit_to_usermode+0xa8/0xd0
[ 37.384594][ T1] ? page_fault+0x8/0x30
[ 37.386453][ T1] retint_user+0x8/0x18
[ 37.388160][ T1] RIP: 0033:0x7f42c06974a8
[ 37.389922][ T1] Code: Bad RIP value.
[ 37.391788][ T1] RSP: 002b:00007ffc3effd388 EFLAGS: 00010213
[ 37.394075][ T1] RAX: 000000000000000e RBX: 00007ffc3effd390 RCX: 0000000000000000
[ 37.396963][ T1] RDX: 000000000000002a RSI: 00007ffc3effd390 RDI: 0000000000000004
[ 37.399550][ T1] RBP: 00007ffc3effd680 R08: 0000000000000000 R09: 0000000000000000
[ 37.402334][ T1] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001
[ 37.404890][ T1] R13: ffffffffffffffff R14: 0000000000000884 R15: 000056460b1ac3b0
Link: http://lkml.kernel.org/r/201902010336.x113a4EO027170@www262.sakura.ne.jp
Fixes:
|
||
|
c10d38cc8d |
mm, swap: bounds check swap_info array accesses to avoid NULL derefs
Dan Carpenter reports a potential NULL dereference in
get_swap_page_of_type:
Smatch complains that the NULL checks on "si" aren't consistent. This
seems like a real bug because we have not ensured that the type is
valid and so "si" can be NULL.
Add the missing check for NULL, taking care to use a read barrier to
ensure CPU1 observes CPU0's updates in the correct order:
CPU0 CPU1
alloc_swap_info() if (type >= nr_swapfiles)
swap_info[type] = p /* handle invalid entry */
smp_wmb() smp_rmb()
++nr_swapfiles p = swap_info[type]
Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before
CPU0's write to swap_info[type] and read NULL from swap_info[type].
Ying Huang noticed other places in swapfile.c don't order these reads
properly. Introduce swap_type_to_swap_info to encourage correct usage.
Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model
(see tools/memory-model/Documentation/explanation.txt).
This ordering need not be enforced in places where swap_lock is held
(e.g. si_swapinfo) because swap_lock serializes updates to nr_swapfiles
and the swap_info array.
Link: http://lkml.kernel.org/r/20190131024410.29859-1-daniel.m.jordan@oracle.com
Fixes:
|
||
|
060f005f07 |
mm/vmscan.c: do not allocate duplicate stack variables in shrink_page_list()
On path shrink_inactive_list() ---> shrink_page_list() we allocate stack variables for the statistics twice. This is completely useless, and this just consumes stack much more, then we really need. The patch kills duplicate stack variables from shrink_page_list(), and this reduce stack usage and object file size significantly: Stack usage: Before: vmscan.c:1122:22:shrink_page_list 648 static After: vmscan.c:1122:22:shrink_page_list 616 static Size of vmscan.o: text data bss dec hex filename Before: 56866 4720 128 61714 f112 mm/vmscan.o After: 56770 4720 128 61618 f0b2 mm/vmscan.o Link: http://lkml.kernel.org/r/154894900030.5211.12104993874109647641.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2cee57d1b0 |
mm: ksm: do not block on page lock when searching stable tree
ksmd needs to search the stable tree to look for the suitable KSM page,
but the KSM page might be locked for a while due to i.e. KSM page rmap
walk. Basically it is not a big deal since commit
|
||
|
1ff9e6e179 |
mm: memcontrol: expose THP events on a per-memcg basis
Currently THP allocation events data is fairly opaque, since you can only get it system-wide. This patch makes it easier to reason about transparent hugepage behaviour on a per-memcg basis. For anonymous THP-backed pages, we already have MEMCG_RSS_HUGE in v1, which is used for v1's rss_huge [sic]. This is reused here as it's fairly involved to untangle NR_ANON_THPS right now to make it per-memcg, since right now some of this is delegated to rmap before we have any memcg actually assigned to the page. It's a good idea to rework that, but let's leave untangling THP allocation for a future patch. [akpm@linux-foundation.org: fix build] [chris@chrisdown.name: fix memcontrol build when THP is disabled] Link: http://lkml.kernel.org/r/20190131160802.GA5777@chrisdown.name Link: http://lkml.kernel.org/r/20190129205852.GA7310@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2bb0f34fe3 |
mm: vmscan: do not iterate all mem cgroups for global direct reclaim
In current implementation, both kswapd and direct reclaim has to iterate all mem cgroups. It is not a problem before offline mem cgroups could be iterated. But, currently with iterating offline mem cgroups, it could be very time consuming. In our workloads, we saw over 400K mem cgroups accumulated in some cases, only a few hundred are online memcgs. Although kswapd could help out to reduce the number of memcgs, direct reclaim still get hit with iterating a number of offline memcgs in some cases. We experienced the responsiveness problems due to this occassionally. A simple test with pref shows it may take around 220ms to iterate 8K memcgs in direct reclaim: dd 13873 [011] 578.542919: vmscan:mm_vmscan_direct_reclaim_begin dd 13873 [011] 578.758689: vmscan:mm_vmscan_direct_reclaim_end So for 400K, it may take around 11 seconds to iterate all memcgs. Here just break the iteration once it reclaims enough pages as what memcg direct reclaim does. This may hurt the fairness among memcgs. But the cached iterator cookie could help to achieve the fairness more or less. Link: http://lkml.kernel.org/r/1548799877-10949-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
ab3948f58f |
mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd
Android uses ashmem for sharing memory regions. We are looking forward to migrating all usecases of ashmem to memfd so that we can possibly remove the ashmem driver in the future from staging while also benefiting from using memfd and contributing to it. Note staging drivers are also not ABI and generally can be removed at anytime. One of the main usecases Android has is the ability to create a region and mmap it as writeable, then add protection against making any "future" writes while keeping the existing already mmap'ed writeable-region active. This allows us to implement a usecase where receivers of the shared memory buffer can get a read-only view, while the sender continues to write to the buffer. See CursorWindow documentation in Android for more details: https://developer.android.com/reference/android/database/CursorWindow This usecase cannot be implemented with the existing F_SEAL_WRITE seal. To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal which prevents any future mmap and write syscalls from succeeding while keeping the existing mmap active. A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week where we don't need to modify core VFS structures to get the same behavior of the seal. This solves several side-effects pointed by Andy. self-tests are provided in later patch to verify the expected semantics. [1] https://lore.kernel.org/lkml/20181111173650.GA256781@google.com/ Thanks a lot to Andy for suggestions to improve code. Link: http://lkml.kernel.org/r/20190112203816.85534-2-joel@joelfernandes.org Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: J. Bruce Fields <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
9a4e9f3b2d |
mm: update get_user_pages_longterm to migrate pages allocated from CMA region
This patch updates get_user_pages_longterm to migrate pages allocated out of CMA region. This makes sure that we don't keep non-movable pages (due to page reference count) in the CMA area. This will be used by ppc64 in a later patch to avoid pinning pages in the CMA region. ppc64 uses CMA region for allocation of the hardware page table (hash page table) and not able to migrate pages out of CMA region results in page table allocation failures. One case where we hit this easy is when a guest using a VFIO passthrough device. VFIO locks all the guest's memory and if the guest memory is backed by CMA region, it becomes unmovable resulting in fragmenting the CMA and possibly preventing other guests from allocation a large enough hash page table. NOTE: We allocate the new page without using __GFP_THISNODE Link: http://lkml.kernel.org/r/20190114095438.32470-3-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
b56a2d8af9 |
mm: rid swapoff of quadratic complexity
This patch was initially posted by Kelley Nielsen. Reposting the patch with all review comments addressed and with minor modifications and optimizations. Also, folding in the fixes offered by Hugh Dickins and Huang Ying. Tests were rerun and commit message updated with new results. try_to_unuse() is of quadratic complexity, with a lot of wasted effort. It unuses swap entries one by one, potentially iterating over all the page tables for all the processes in the system for each one. This new proposed implementation of try_to_unuse simplifies its complexity to linear. It iterates over the system's mms once, unusing all the affected entries as it walks each set of page tables. It also makes similar changes to shmem_unuse. Improvement swapoff was called on a swap partition containing about 6G of data, in a VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted. Present implementation....about 1200M calls(8min, avg 80% cpu util). Prototype.................about 9.0K calls(3min, avg 5% cpu util). Details In shmem_unuse(), iterate over the shmem_swaplist and, for each shmem_inode_info that contains a swap entry, pass it to shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(), iterate over its associated xarray, and store the index and value of each swap entry in an array for passing to shmem_swapin_page() outside of the RCU critical section. In try_to_unuse(), instead of iterating over the entries in the type and unusing them one by one, perhaps walking all the page tables for all the processes for each one, iterate over the mmlist, making one pass. Pass each mm to unuse_mm() to begin its page table walk, and during the walk, unuse all the ptes that have backing store in the swap type received by try_to_unuse(). After the walk, check the type for orphaned swap entries with find_next_to_unuse(), and remove them from the swap cache. If find_next_to_unuse() starts over at the beginning of the type, repeat the check of the shmem_swaplist and the walk a maximum of three times. Change unuse_mm() and the intervening walk functions down to unuse_pte_range() to take the type as a parameter, and to iterate over their entire range, calling the next function down on every iteration. In unuse_pte_range(), make a swap entry from each pte in the range using the passed in type. If it has backing store in the type, call swapin_readahead() to retrieve the page and pass it to unuse_pte(). Pass the count of pages_to_unuse down the page table walks in try_to_unuse(), and return from the walk when the desired number of pages has been swapped back in. Link: http://lkml.kernel.org/r/20190114153129.4852-2-vpillai@digitalocean.com Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
c5bf121e43 |
mm: refactor swap-in logic out of shmem_getpage_gfp
swapin logic can be reused independently without rest of the logic in shmem_getpage_gfp. So lets refactor it out as an independent function. Link: http://lkml.kernel.org/r/20190114153129.4852-1-vpillai@digitalocean.com Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kelley Nielsen <kelleynnn@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
a9e7c39fa9 |
mm/vmscan.c: remove 7th argument of isolate_lru_pages()
We may simply check for sc->may_unmap in isolate_lru_pages() instead of doing that in both of its callers. Link: http://lkml.kernel.org/r/154748280735.29962.15867846875217618569.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2e25644e8d |
mm, mempolicy: fix uninit memory access
Syzbot with KMSAN reports (excerpt): ================================================================== BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:353 [inline] BUG: KMSAN: uninit-value in mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384 CPU: 1 PID: 17420 Comm: syz-executor4 Not tainted 4.20.0-rc7+ #15 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x173/0x1d0 lib/dump_stack.c:113 kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613 __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:295 mpol_rebind_policy mm/mempolicy.c:353 [inline] mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384 update_tasks_nodemask+0x608/0xca0 kernel/cgroup/cpuset.c:1120 update_nodemasks_hier kernel/cgroup/cpuset.c:1185 [inline] update_nodemask kernel/cgroup/cpuset.c:1253 [inline] cpuset_write_resmask+0x2a98/0x34b0 kernel/cgroup/cpuset.c:1728 ... Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:204 [inline] kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:158 kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176 kmem_cache_alloc+0x572/0xb90 mm/slub.c:2777 mpol_new mm/mempolicy.c:276 [inline] do_mbind mm/mempolicy.c:1180 [inline] kernel_mbind+0x8a7/0x31a0 mm/mempolicy.c:1347 __do_sys_mbind mm/mempolicy.c:1354 [inline] As it's difficult to report where exactly the uninit value resides in the mempolicy object, we have to guess a bit. mm/mempolicy.c:353 contains this part of mpol_rebind_policy(): if (!mpol_store_user_nodemask(pol) && nodes_equal(pol->w.cpuset_mems_allowed, *newmask)) "mpol_store_user_nodemask(pol)" is testing pol->flags, which I couldn't ever see being uninitialized after leaving mpol_new(). So I'll guess it's actually about accessing pol->w.cpuset_mems_allowed on line 354, but still part of statement starting on line 353. For w.cpuset_mems_allowed to be not initialized, and the nodes_equal() reachable for a mempolicy where mpol_set_nodemask() is called in do_mbind(), it seems the only possibility is a MPOL_PREFERRED policy with empty set of nodes, i.e. MPOL_LOCAL equivalent, with MPOL_F_LOCAL flag. Let's exclude such policies from the nodes_equal() check. Note the uninit access should be benign anyway, as rebinding this kind of policy is always a no-op. Therefore no actual need for stable inclusion. Link: http://lkml.kernel.org/r/a71997c3-e8ae-a787-d5ce-3db05768b27c@suse.cz Link: http://lkml.kernel.org/r/73da3e9c-cc84-509e-17d9-0c434bb9967d@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: syzbot+b19c2dc2c990ea657a71@syzkaller.appspotmail.com Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Yisheng Xie <xieyisheng1@huawei.com> Cc: zhong jiang <zhongjiang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
7775face20 |
memcg: killed threads should not invoke memcg OOM killer
If a memory cgroup contains a single process with many threads (including different process group sharing the mm) then it is possible to trigger a race when the oom killer complains that there are no oom elible tasks and complain into the log which is both annoying and confusing because there is no actual problem. The race looks as follows: P1 oom_reaper P2 try_charge try_charge mem_cgroup_out_of_memory mutex_lock(oom_lock) out_of_memory oom_kill_process(P1,P2) wake_oom_reaper mutex_unlock(oom_lock) oom_reap_task mutex_lock(oom_lock) select_bad_process # no victim The problem is more visible with many threads. Fix this by checking for fatal_signal_pending from mem_cgroup_out_of_memory when the oom_lock is already held. The oom bypass is safe because we do the same early in the try_charge path already. The situation migh have changed in the mean time. It should be safe to check for fatal_signal_pending and tsk_is_oom_victim but for a better code readability abstract the current charge bypass condition into should_force_charge and reuse it from that path. " Link: http://lkml.kernel.org/r/01370f70-e1f6-ebe4-b95e-0df21a0bc15e@i-love.sakura.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
23a7052a5d |
mm/page_alloc.c: check return value of memblock_alloc_node_nopanic()
There are two early memory allocations that use memblock_alloc_node_nopanic() and do not check its return value. While this happens very early during boot and chances that the allocation will fail are diminishing, it is still worth to have proper checks for the allocation errors. Link: http://lkml.kernel.org/r/1547734941-944-1-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
023bdd0023 |
mm/hugetlb: add prot_modify_start/commit sequence for hugetlb update
Architectures like ppc64 require to do a conditional tlb flush based on the old and new value of pte. Follow the regular pte change protection sequence for hugetlb too. This allows the architectures to override the update sequence. Link: http://lkml.kernel.org/r/20190116085035.29729-5-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
04a8645304 |
mm: update ptep_modify_prot_commit to take old pte value as arg
Architectures like ppc64 require to do a conditional tlb flush based on the old and new value of pte. Enable that by passing old pte value as the arg. Link: http://lkml.kernel.org/r/20190116085035.29729-3-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
0cbe3e26ab |
mm: update ptep_modify_prot_start/commit to take vm_area_struct as arg
Patch series "NestMMU pte upgrade workaround for mprotect", v5.
We can upgrade pte access (R -> RW transition) via mprotect. We need to
make sure we follow the recommended pte update sequence as outlined in
commit
|
||
|
8bb4e7a2ee |
mm: fix some typos in mm directory
No functional change. Link: http://lkml.kernel.org/r/20190118235123.27843-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
8aa49762db |
mm/page_owner: move config option to mm/Kconfig.debug
Move the PAGE_OWNER option from submenu "Compile-time checks and compiler options" to dedicated submenu "Memory Debugging". Link: http://lkml.kernel.org/r/20190120024254.6270-1-changbin.du@gmail.com Signed-off-by: Changbin Du <changbin.du@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
43cca0b1c5 |
mm/mmap.c: remove some redundancy in arch_get_unmapped_area_topdown()
The variable 'addr' is redundant in arch_get_unmapped_area_topdown(), just use parameter 'addr0' directly. Then remove the const qualifier of the parameter, and change its name to 'addr'. And in according with other functions, remove the const qualifier of all other no-pointer parameters in function arch_get_unmapped_area_topdown(). Link: http://lkml.kernel.org/r/20190127041112.25599-1-nullptr.cpp@gmail.com Signed-off-by: Yang Fan <nullptr.cpp@gmail.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
bbbe480297 |
mm, oom: remove 'prefer children over parent' heuristic
Since the start of the git history of Linux, the kernel after selecting the worst process to be oom-killed, prefer to kill its child (if the child does not share mm with the parent). Later it was changed to prefer to kill a child who is worst. If the parent is still the worst then the parent will be killed. This heuristic assumes that the children did less work than their parent and by killing one of them, the work lost will be less. However this is very workload dependent. If there is a workload which can benefit from this heuristic, can use oom_score_adj to prefer children to be killed before the parent. The select_bad_process() has already selected the worst process in the system/memcg. There is no need to recheck the badness of its children and hoping to find a worse candidate. That's a lot of unneeded racy work. Also the heuristic is dangerous because it make fork bomb like workloads to recover much later because we constantly pick and kill processes which are not memory hogs. So, let's remove this whole heuristic. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d9f7979c92 |
mm: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Laura Abbott <labbott@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
0ee930e6ca |
mm/memory.c: prevent mapping typed pages to userspace
Pages which use page_type must never be mapped to userspace as it would destroy their page type. Add an explicit check for this instead of assuming that kernel drivers always get this right. Link: http://lkml.kernel.org/r/20190129053830.3749-1-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2d432cb709 |
mm: prevent mapping slab pages to userspace
It's never appropriate to map a page allocated by SLAB into userspace. A buggy device driver might try this, or an attacker might be able to find a way to make it happen. Christoph said: : Let's just fail the code. Currently this may work with SLUB. But SLAB : and SLOB overlay fields with mapcount. So you would have a corrupted page : struct if you mapped a slab page to user space. Link: http://lkml.kernel.org/r/20190125173827.2658-1-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
afd07389d3 |
mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512!
One of the vmalloc stress test case triggers the kernel BUG(): <snip> [60.562151] ------------[ cut here ]------------ [60.562154] kernel BUG at mm/vmalloc.c:512! [60.562206] invalid opcode: 0000 [#1] PREEMPT SMP PTI [60.562247] CPU: 0 PID: 430 Comm: vmalloc_test/0 Not tainted 4.20.0+ #161 [60.562293] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [60.562351] RIP: 0010:alloc_vmap_area+0x36f/0x390 <snip> it can happen due to big align request resulting in overflowing of calculated address, i.e. it becomes 0 after ALIGN()'s fixup. Fix it by checking if calculated address is within vstart/vend range. Link: http://lkml.kernel.org/r/20190124115648.9433-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joel Fernandes <joelaf@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
677dc9731b |
mm, memcg: extract memcg maxable seq_file logic to seq_show_memcg_tunable
memcg has a significant number of files exposed to kernfs where their value is either exposed directly or is "max" in the case of PAGE_COUNTER_MAX. This patch makes this generic by providing a single function to do this work. In combination with the previous patch adding mem_cgroup_from_seq, this makes all of the seq_show feeder functions significantly more simple. Link: http://lkml.kernel.org/r/20190124194100.GA31425@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
aa9694bb78 |
mm, memcg: create mem_cgroup_from_seq
This is the start of a series of patches similar to my earlier DEFINE_MEMCG_MAX_OR_VAL work, but with less Macro Magic(tm). There are a bunch of places we go from seq_file to mem_cgroup, which currently requires manually getting the css, then getting the mem_cgroup from the css. It's in enough places now that having mem_cgroup_from_seq makes sense (and also makes the next patch a bit nicer). Link: http://lkml.kernel.org/r/20190124194050.GA31341@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
5e1f0f098b |
mm, compaction: capture a page under direct compaction
Compaction is inherently race-prone as a suitable page freed during compaction can be allocated by any parallel task. This patch uses a capture_control structure to isolate a page immediately when it is freed by a direct compactor in the slow path of the page allocator. The intent is to avoid redundant scanning. 5.0.0-rc1 5.0.0-rc1 selective-v3r17 capture-v3r19 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%) Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%) Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%) Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%) Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%) Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%* Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%) Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%) Latency is only moderately affected but the devil is in the details. A closer examination indicates that base page fault latency is reduced but latency of huge pages is increased as it takes creater care to succeed. Part of the "problem" is that allocation success rates are close to 100% even when under pressure and compaction gets harder 5.0.0-rc1 5.0.0-rc1 selective-v3r17 capture-v3r19 Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%) Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%) Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%) Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%) Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%) Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%) Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%) Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%) And scan rates are reduced as expected by 6% for the migration scanner and 29% for the free scanner indicating that there is less redundant work. Compaction migrate scanned 20815362 19573286 Compaction free scanned 16352612 11510663 [mgorman@techsingularity.net: remove redundant check] Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
e332f741a8 |
mm, compaction: be selective about what pageblocks to clear skip hints
Pageblock hints are cleared when compaction restarts or kswapd makes enough progress that it can sleep but it's over-eager in that the bit is cleared for migration sources with no LRU pages and migration targets with no free pages. As pageblock skip hint flushes are relatively rare and out-of-band with respect to kswapd, this patch makes a few more expensive checks to see if it's appropriate to even clear the bit. Every pageblock that is not cleared will avoid 512 pages being scanned unnecessarily on x86-64. The impact is variable with different workloads showing small differences in latency, success rates and scan rates. This is expected as clearing the hints is not that common but doing a small amount of work out-of-band to avoid a large amount of work in-band later is generally a good thing. Link: http://lkml.kernel.org/r/20190118175136.31341-22-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> [cai@lca.pw: no stuck in __reset_isolation_pfn()] Link: http://lkml.kernel.org/r/20190206034732.75687-1-cai@lca.pw Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
4fca9730c5 |
mm, compaction: sample pageblocks for free pages
Once fast searching finishes, there is a possibility that the linear scanner is scanning full blocks found by the fast scanner earlier. This patch uses an adaptive stride to sample pageblocks for free pages. The more consecutive full pageblocks encountered, the larger the stride until a pageblock with free pages is found. The scanners might meet slightly sooner but it is an acceptable risk given that the search of the free lists may still encounter the pages and adjust the cached PFN of the free scanner accordingly. 5.0.0-rc1 5.0.0-rc1 roundrobin-v3r17 samplefree-v3r17 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 2752.37 ( 0.00%) 2729.95 ( 0.81%) Amean fault-both-5 4341.69 ( 0.00%) 4397.80 ( -1.29%) Amean fault-both-7 6308.75 ( 0.00%) 6097.61 ( 3.35%) Amean fault-both-12 10241.81 ( 0.00%) 9407.15 ( 8.15%) Amean fault-both-18 13736.09 ( 0.00%) 10857.63 * 20.96%* Amean fault-both-24 16853.95 ( 0.00%) 13323.24 * 20.95%* Amean fault-both-30 15862.61 ( 0.00%) 17345.44 ( -9.35%) Amean fault-both-32 18450.85 ( 0.00%) 16892.00 ( 8.45%) The latency is mildly improved offseting some overhead from earlier patches that are prerequisites for the rest of the series. However, a major impact is on the free scan rate with an 82% reduction. 5.0.0-rc1 5.0.0-rc1 roundrobin-v3r17 samplefree-v3r17 Compaction migrate scanned 21607271 20116887 Compaction free scanned 95336406 16668703 It's also the first time in the series where the number of pages scanned by the migration scanner is greater than the free scanner due to the increased search efficiency. Link: http://lkml.kernel.org/r/20190118175136.31341-21-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
dbe2d4e4f1 |
mm, compaction: round-robin the order while searching the free lists for a target
As compaction proceeds and creates high-order blocks, the free list search gets less efficient as the larger blocks are used as compaction targets. Eventually, the larger blocks will be behind the migration scanner for partially migrated pageblocks and the search fails. This patch round-robins what orders are searched so that larger blocks can be ignored and find smaller blocks that can be used as migration targets. The overall impact was small on 1-socket but it avoids corner cases where the migration/free scanners meet prematurely or situations where many of the pageblocks encountered by the free scanner are almost full instead of being properly packed. Previous testing had indicated that without this patch there were occasional large spikes in the free scanner without this patch. [dan.carpenter@oracle.com: fix static checker warning] Link: http://lkml.kernel.org/r/20190118175136.31341-20-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d097a6f635 |
mm, compaction: reduce premature advancement of the migration target scanner
The fast isolation of free pages allows the cached PFN of the free scanner to advance faster than necessary depending on the contents of the free list. The key is that fast_isolate_freepages() can update zone->compact_cached_free_pfn via isolate_freepages_block(). When the fast search fails, the linear scan can start from a point that has skipped valid migration targets, particularly pageblocks with just low-order free pages. This can cause the migration source/target scanners to meet prematurely causing a reset. This patch starts by avoiding an update of the pageblock skip information and cached PFN from isolate_freepages_block() and puts the responsibility of updating that information in the callers. The fast scanner will update the cached PFN if and only if it finds a block that is higher than the existing cached PFN and sets the skip if the pageblock is full or nearly full. The linear scanner will update skipped information and the cached PFN only when a block is completely scanned. The total impact is that the free scanner advances more slowly as it is primarily driven by the linear scanner instead of the fast search. 5.0.0-rc1 5.0.0-rc1 noresched-v3r17 slowfree-v3r17 Amean fault-both-3 2965.68 ( 0.00%) 3036.75 ( -2.40%) Amean fault-both-5 3995.90 ( 0.00%) 4522.24 * -13.17%* Amean fault-both-7 5842.12 ( 0.00%) 6365.35 ( -8.96%) Amean fault-both-12 9550.87 ( 0.00%) 10340.93 ( -8.27%) Amean fault-both-18 13304.72 ( 0.00%) 14732.46 ( -10.73%) Amean fault-both-24 14618.59 ( 0.00%) 16288.96 ( -11.43%) Amean fault-both-30 16650.96 ( 0.00%) 16346.21 ( 1.83%) Amean fault-both-32 17145.15 ( 0.00%) 19317.49 ( -12.67%) The impact to latency is higher than the last version but it appears to be due to a slight increase in the free scan rates which is a potential side-effect of the patch. However, this is necessary for later patches that are more careful about how pageblocks are treated as earlier iterations of those patches hit corner cases where the restarts were punishing and very visible. Link: http://lkml.kernel.org/r/20190118175136.31341-19-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
cf66f0700c |
mm, compaction: do not consider a need to reschedule as contention
Scanning on large machines can take a considerable length of time and eventually need to be rescheduled. This is treated as an abort event but that's not appropriate as the attempt is likely to be retried after making numerous checks and taking another cycle through the page allocator. This patch will check the need to reschedule if necessary but continue the scanning. The main benefit is reduced scanning when compaction is taking a long time or the machine is over-saturated. It also avoids an unnecessary exit of compaction that ends up being retried by the page allocator in the outer loop. 5.0.0-rc1 5.0.0-rc1 synccached-v3r16 noresched-v3r17 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 2958.27 ( 0.00%) 2965.68 ( -0.25%) Amean fault-both-5 4091.90 ( 0.00%) 3995.90 ( 2.35%) Amean fault-both-7 5803.05 ( 0.00%) 5842.12 ( -0.67%) Amean fault-both-12 9481.06 ( 0.00%) 9550.87 ( -0.74%) Amean fault-both-18 14141.51 ( 0.00%) 13304.72 ( 5.92%) Amean fault-both-24 16438.00 ( 0.00%) 14618.59 ( 11.07%) Amean fault-both-30 17531.72 ( 0.00%) 16650.96 ( 5.02%) Amean fault-both-32 17101.96 ( 0.00%) 17145.15 ( -0.25%) Link: http://lkml.kernel.org/r/20190118175136.31341-18-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
cb810ad294 |
mm, compaction: rework compact_should_abort as compact_check_resched
With incremental changes, compact_should_abort no longer makes any documented sense. Rename to compact_check_resched and update the associated comments. There is no benefit other than reducing redundant code and making the intent slightly clearer. It could potentially be merged with earlier patches but it just makes the review slightly harder. Link: http://lkml.kernel.org/r/20190118175136.31341-17-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
8854c55f54 |
mm, compaction: keep cached migration PFNs synced for unusable pageblocks
Migrate has separate cached PFNs for ASYNC and SYNC* migration on the basis that some migrations will fail in ASYNC mode. However, if the cached PFNs match at the start of scanning and pageblocks are skipped due to having no isolation candidates, then the sync state does not matter. This patch keeps matching cached PFNs in sync until a pageblock with isolation candidates is found. The actual benefit is marginal given that the sync scanner following the async scanner will often skip a number of pageblocks but it's useless work. Any benefit depends heavily on whether the scanners restarted recently. Link: http://lkml.kernel.org/r/20190118175136.31341-16-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
9bebefd590 |
mm, compaction: check early for huge pages encountered by the migration scanner
When scanning for sources or targets, PageCompound is checked for huge pages as they can be skipped quickly but it happens relatively late after a lot of setup and checking. This patch short-cuts the check to make it earlier. It might still change when the lock is acquired but this has less overhead overall. The free scanner advances but the migration scanner does not. Typically the free scanner encounters more movable blocks that change state over the lifetime of the system and also tends to scan more aggressively as it's actively filling its portion of the physical address space with data. This could change in the future but for the moment, this worked better in practice and incurred fewer scan restarts. The impact on latency and allocation success rates is marginal but the free scan rates are reduced by 15% and system CPU usage is reduced by 3.3%. The 2-socket results are not materially different. Link: http://lkml.kernel.org/r/20190118175136.31341-15-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
cb2dcaf023 |
mm, compaction: finish pageblock scanning on contention
Async migration aborts on spinlock contention but contention can be high when there are multiple compaction attempts and kswapd is active. The consequence is that the migration scanners move forward uselessly while still contending on locks for longer while leaving suitable migration sources behind. This patch will acquire the lock but track when contention occurs. When it does, the current pageblock will finish as compaction may succeed for that block and then abort. This will have a variable impact on latency as in some cases useless scanning is avoided (reduces latency) but a lock will be contended (increase latency) or a single contended pageblock is scanned that would otherwise have been skipped (increase latency). 5.0.0-rc1 5.0.0-rc1 norescan-v3r16 finishcontend-v3r16 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 3002.07 ( 0.00%) 3153.17 ( -5.03%) Amean fault-both-5 4684.47 ( 0.00%) 4280.52 ( 8.62%) Amean fault-both-7 6815.54 ( 0.00%) 5811.50 * 14.73%* Amean fault-both-12 10864.02 ( 0.00%) 9276.85 ( 14.61%) Amean fault-both-18 12247.52 ( 0.00%) 11032.67 ( 9.92%) Amean fault-both-24 15683.99 ( 0.00%) 14285.70 ( 8.92%) Amean fault-both-30 18620.02 ( 0.00%) 16293.76 * 12.49%* Amean fault-both-32 19250.28 ( 0.00%) 16721.02 * 13.14%* 5.0.0-rc1 5.0.0-rc1 norescan-v3r16 finishcontend-v3r16 Percentage huge-1 0.00 ( 0.00%) 0.00 ( 0.00%) Percentage huge-3 95.00 ( 0.00%) 96.82 ( 1.92%) Percentage huge-5 94.22 ( 0.00%) 95.40 ( 1.26%) Percentage huge-7 92.35 ( 0.00%) 95.92 ( 3.86%) Percentage huge-12 91.90 ( 0.00%) 96.73 ( 5.25%) Percentage huge-18 89.58 ( 0.00%) 96.77 ( 8.03%) Percentage huge-24 90.03 ( 0.00%) 96.05 ( 6.69%) Percentage huge-30 89.14 ( 0.00%) 96.81 ( 8.60%) Percentage huge-32 90.58 ( 0.00%) 97.41 ( 7.54%) There is a variable impact that is mostly good on latency while allocation success rates are slightly higher. System CPU usage is reduced by about 10% but scan rate impact is mixed Compaction migrate scanned 27997659.00 20148867 Compaction free scanned 120782791.00 118324914 Migration scan rates are reduced 28% which is expected as a pageblock is used by the async scanner instead of skipped. The impact on the free scanner is known to be variable. Overall the primary justification for this patch is that completing scanning of a pageblock is very important for later patches. [yuehaibing@huawei.com: fix unused variable warning] Link: http://lkml.kernel.org/r/20190118175136.31341-14-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: YueHaibing <yuehaibing@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
804d3121ba |
mm, compaction: avoid rescanning the same pageblock multiple times
Pageblocks are marked for skip when no pages are isolated after a scan. However, it's possible to hit corner cases where the migration scanner gets stuck near the boundary between the source and target scanner. Due to pages being migrated in blocks of COMPACT_CLUSTER_MAX, pages that are migrated can be reallocated before the pageblock is complete. The pageblock is not necessarily skipped so it can be rescanned multiple times. Similarly, a pageblock with some dirty/writeback pages may fail to migrate and be rescanned until writeback completes which is wasteful. This patch tracks if a pageblock is being rescanned. If so, then the entire pageblock will be migrated as one operation. This narrows the race window during which pages can be reallocated during migration. Secondly, if there are pages that cannot be isolated then the pageblock will still be fully scanned and marked for skipping. On the second rescan, the pageblock skip is set and the migration scanner makes progress. 5.0.0-rc1 5.0.0-rc1 findfree-v3r16 norescan-v3r16 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 3200.68 ( 0.00%) 3002.07 ( 6.21%) Amean fault-both-5 4847.75 ( 0.00%) 4684.47 ( 3.37%) Amean fault-both-7 6658.92 ( 0.00%) 6815.54 ( -2.35%) Amean fault-both-12 11077.62 ( 0.00%) 10864.02 ( 1.93%) Amean fault-both-18 12403.97 ( 0.00%) 12247.52 ( 1.26%) Amean fault-both-24 15607.10 ( 0.00%) 15683.99 ( -0.49%) Amean fault-both-30 18752.27 ( 0.00%) 18620.02 ( 0.71%) Amean fault-both-32 21207.54 ( 0.00%) 19250.28 * 9.23%* 5.0.0-rc1 5.0.0-rc1 findfree-v3r16 norescan-v3r16 Percentage huge-3 96.86 ( 0.00%) 95.00 ( -1.91%) Percentage huge-5 93.72 ( 0.00%) 94.22 ( 0.53%) Percentage huge-7 94.31 ( 0.00%) 92.35 ( -2.08%) Percentage huge-12 92.66 ( 0.00%) 91.90 ( -0.82%) Percentage huge-18 91.51 ( 0.00%) 89.58 ( -2.11%) Percentage huge-24 90.50 ( 0.00%) 90.03 ( -0.52%) Percentage huge-30 91.57 ( 0.00%) 89.14 ( -2.65%) Percentage huge-32 91.00 ( 0.00%) 90.58 ( -0.46%) Negligible difference but this was likely a case when the specific corner case was not hit. A previous run of the same patch based on an earlier iteration of the series showed large differences where migration rates could be halved when the corner case was hit. The specific corner case where migration scan rates go through the roof was due to a dirty/writeback pageblock located at the boundary of the migration/free scanner did not happen in this case. When it does happen, the scan rates multipled by massive margins. Link: http://lkml.kernel.org/r/20190118175136.31341-13-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
5a811889de |
mm, compaction: use free lists to quickly locate a migration target
Similar to the migration scanner, this patch uses the free lists to quickly locate a migration target. The search is different in that lower orders will be searched for a suitable high PFN if necessary but the search is still bound. This is justified on the grounds that the free scanner typically scans linearly much more than the migration scanner. If a free page is found, it is isolated and compaction continues if enough pages were isolated. For SYNC* scanning, the full pageblock is scanned for any remaining free pages so that is can be marked for skipping in the near future. 1-socket thpfioscale 5.0.0-rc1 5.0.0-rc1 isolmig-v3r15 findfree-v3r16 Amean fault-both-3 3024.41 ( 0.00%) 3200.68 ( -5.83%) Amean fault-both-5 4749.30 ( 0.00%) 4847.75 ( -2.07%) Amean fault-both-7 6454.95 ( 0.00%) 6658.92 ( -3.16%) Amean fault-both-12 10324.83 ( 0.00%) 11077.62 ( -7.29%) Amean fault-both-18 12896.82 ( 0.00%) 12403.97 ( 3.82%) Amean fault-both-24 13470.60 ( 0.00%) 15607.10 * -15.86%* Amean fault-both-30 17143.99 ( 0.00%) 18752.27 ( -9.38%) Amean fault-both-32 17743.91 ( 0.00%) 21207.54 * -19.52%* The impact on latency is variable but the search is optimistic and sensitive to the exact system state. Success rates are similar but the major impact is to the rate of scanning 5.0.0-rc1 5.0.0-rc1 isolmig-v3r15 findfree-v3r16 Compaction migrate scanned 25646769 29507205 Compaction free scanned 201558184 100359571 The free scan rates are reduced by 50%. The 2-socket reductions for the free scanner are more dramatic which is a likely reflection that the machine has more memory. [dan.carpenter@oracle.com: fix static checker warning] [vbabka@suse.cz: correct number of pages scanned for lower orders] Link: http://lkml.kernel.org/r/20190118175136.31341-12-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
e380bebe47 |
mm, compaction: keep migration source private to a single compaction instance
Due to either a fast search of the free list or a linear scan, it is possible for multiple compaction instances to pick the same pageblock for migration. This is lucky for one scanner and increased scanning for all the others. It also allows a race between requests on which first allocates the resulting free block. This patch tests and updates the pageblock skip for the migration scanner carefully. When isolating a block, it will check and skip if the block is already in use. Once the zone lock is acquired, it will be rechecked so that only one scanner can set the pageblock skip for exclusive use. Any scanner contending will continue with a linear scan. The skip bit is still set if no pages can be isolated in a range. While this may result in redundant scanning, it avoids unnecessarily acquiring the zone lock when there are no suitable migration sources. 1-socket thpscale Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 3390.40 ( 0.00%) 3024.41 ( 10.80%) Amean fault-both-5 5082.28 ( 0.00%) 4749.30 ( 6.55%) Amean fault-both-7 7012.51 ( 0.00%) 6454.95 ( 7.95%) Amean fault-both-12 11346.63 ( 0.00%) 10324.83 ( 9.01%) Amean fault-both-18 15324.19 ( 0.00%) 12896.82 * 15.84%* Amean fault-both-24 16088.50 ( 0.00%) 13470.60 * 16.27%* Amean fault-both-30 18723.42 ( 0.00%) 17143.99 ( 8.44%) Amean fault-both-32 18612.01 ( 0.00%) 17743.91 ( 4.66%) 5.0.0-rc1 5.0.0-rc1 findmig-v3r15 isolmig-v3r15 Percentage huge-3 89.83 ( 0.00%) 92.96 ( 3.48%) Percentage huge-5 91.96 ( 0.00%) 93.26 ( 1.41%) Percentage huge-7 92.85 ( 0.00%) 93.63 ( 0.84%) Percentage huge-12 92.74 ( 0.00%) 92.80 ( 0.07%) Percentage huge-18 91.71 ( 0.00%) 91.62 ( -0.10%) Percentage huge-24 92.13 ( 0.00%) 91.50 ( -0.69%) Percentage huge-30 93.79 ( 0.00%) 92.73 ( -1.13%) Percentage huge-32 91.27 ( 0.00%) 91.94 ( 0.74%) This shows a reasonable reduction in latency as multiple compaction scanners do not operate on the same blocks with a similar allocation success rate. Compaction migrate scanned 41093126 25646769 Migration scan rates are reduced by 38%. Link: http://lkml.kernel.org/r/20190118175136.31341-11-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
70b44595ea |
mm, compaction: use free lists to quickly locate a migration source
The migration scanner is a linear scan of a zone with a potentiall large search space. Furthermore, many pageblocks are unusable such as those filled with reserved pages or partially filled with pages that cannot migrate. These still get scanned in the common case of allocating a THP and the cost accumulates. The patch uses a partial search of the free lists to locate a migration source candidate that is marked as MOVABLE when allocating a THP. It prefers picking a block with a larger number of free pages already on the basis that there are fewer pages to migrate to free the entire block. The lowest PFN found during searches is tracked as the basis of the start for the linear search after the first search of the free list fails. After the search, the free list is shuffled so that the next search will not encounter the same page. If the search fails then the subsequent searches will be shorter and the linear scanner is used. If this search fails, or if the request is for a small or unmovable/reclaimable allocation then the linear scanner is still used. It is somewhat pointless to use the list search in those cases. Small free pages must be used for the search and there is no guarantee that movable pages are located within that block that are contiguous. 5.0.0-rc1 5.0.0-rc1 noboost-v3r10 findmig-v3r15 Amean fault-both-3 3771.41 ( 0.00%) 3390.40 ( 10.10%) Amean fault-both-5 5409.05 ( 0.00%) 5082.28 ( 6.04%) Amean fault-both-7 7040.74 ( 0.00%) 7012.51 ( 0.40%) Amean fault-both-12 11887.35 ( 0.00%) 11346.63 ( 4.55%) Amean fault-both-18 16718.19 ( 0.00%) 15324.19 ( 8.34%) Amean fault-both-24 21157.19 ( 0.00%) 16088.50 * 23.96%* Amean fault-both-30 21175.92 ( 0.00%) 18723.42 * 11.58%* Amean fault-both-32 21339.03 ( 0.00%) 18612.01 * 12.78%* 5.0.0-rc1 5.0.0-rc1 noboost-v3r10 findmig-v3r15 Percentage huge-3 86.50 ( 0.00%) 89.83 ( 3.85%) Percentage huge-5 92.52 ( 0.00%) 91.96 ( -0.61%) Percentage huge-7 92.44 ( 0.00%) 92.85 ( 0.44%) Percentage huge-12 92.98 ( 0.00%) 92.74 ( -0.25%) Percentage huge-18 91.70 ( 0.00%) 91.71 ( 0.02%) Percentage huge-24 91.59 ( 0.00%) 92.13 ( 0.60%) Percentage huge-30 90.14 ( 0.00%) 93.79 ( 4.04%) Percentage huge-32 90.03 ( 0.00%) 91.27 ( 1.37%) This shows an improvement in allocation latencies with similar allocation success rates. While not presented, there was a 31% reduction in migration scanning and a 8% reduction on system CPU usage. A 2-socket machine showed similar benefits. [mgorman@techsingularity.net: several fixes] Link: http://lkml.kernel.org/r/20190204120111.GL9565@techsingularity.net [vbabka@suse.cz: migrate block that was found-fast, some optimisations] Link: http://lkml.kernel.org/r/20190118175136.31341-10-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <Vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
fd1444b272 |
mm, compaction: ignore the fragmentation avoidance boost for isolation and compaction
When pageblocks get fragmented, watermarks are artifically boosted to reclaim pages to avoid further fragmentation events. However, compaction is often either fragmentation-neutral or moving movable pages away from unmovable/reclaimable pages. As the true watermarks are preserved, allow compaction to ignore the boost factor. The expected impact is very slight as the main benefit is that compaction is slightly more likely to succeed when the system has been fragmented very recently. On both 1-socket and 2-socket machines for THP-intensive allocation during fragmentation the success rate was increased by less than 1% which is marginal. However, detailed tracing indicated that failure of migration due to a premature ENOMEM triggered by watermark checks were eliminated. Link: http://lkml.kernel.org/r/20190118175136.31341-9-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
efe771c760 |
mm, compaction: always finish scanning of a full pageblock
When compaction is finishing, it uses a flag to ensure the pageblock is complete but it makes sense to always complete migration of a pageblock. Minimally, skip information is based on a pageblock and partially scanned pageblocks may incur more scanning in the future. The pageblock skip handling also becomes more strict later in the series and the hint is more useful if a complete pageblock was always scanned. The potentially impacts latency as more scanning is done but it's not a consistent win or loss as the scanning is not always a high percentage of the pageblock and sometimes it is offset by future reductions in scanning. Hence, the results are not presented this time due to a misleading mix of gains/losses without any clear pattern. However, full scanning of the pageblock is important for later patches. Link: http://lkml.kernel.org/r/20190118175136.31341-8-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
806031bb5e |
mm, migrate: immediately fail migration of a page with no migration handler
Pages with no migration handler use a fallback handler which sometimes works and sometimes persistently retries. A historical example was blockdev pages but there are others such as odd refcounting when page->private is used. These are retried multiple times which is wasteful during compaction so this patch will fail migration faster unless the caller specifies MIGRATE_SYNC. This is not expected to help THP allocation success rates but it did reduce latencies very slightly in some cases. 1-socket thpfioscale 4.20.0 4.20.0 noreserved-v2r15 failfast-v2r15 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 3839.67 ( 0.00%) 3833.72 ( 0.15%) Amean fault-both-5 5177.47 ( 0.00%) 4967.15 ( 4.06%) Amean fault-both-7 7245.03 ( 0.00%) 7139.19 ( 1.46%) Amean fault-both-12 11534.89 ( 0.00%) 11326.30 ( 1.81%) Amean fault-both-18 16241.10 ( 0.00%) 16270.70 ( -0.18%) Amean fault-both-24 19075.91 ( 0.00%) 19839.65 ( -4.00%) Amean fault-both-30 22712.11 ( 0.00%) 21707.05 ( 4.43%) Amean fault-both-32 21692.92 ( 0.00%) 21968.16 ( -1.27%) The 2-socket results are not materially different. Scan rates are similar as expected. Link: http://lkml.kernel.org/r/20190118175136.31341-7-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
4469ab9847 |
mm, compaction: rename map_pages to split_map_pages
It's non-obvious that high-order free pages are split into order-0 pages from the function name. Fix it. Link: http://lkml.kernel.org/r/20190118175136.31341-6-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
40cacbcb32 |
mm, compaction: remove unnecessary zone parameter in some instances
A zone parameter is passed into a number of top-level compaction functions despite the fact that it's already in compact_control. This is harmless but it did need an audit to check if zone actually ever changes meaningfully. This patches removes the parameter in a number of top-level functions. The change could be much deeper but this was enough to briefly clarify the flow. No functional change. Link: http://lkml.kernel.org/r/20190118175136.31341-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
566e54e113 |
mm, compaction: remove last_migrated_pfn from compact_control
The last_migrated_pfn field is a bit dubious as to whether it really helps but either way, the information from it can be inferred without increasing the size of compact_control so remove the field. Link: http://lkml.kernel.org/r/20190118175136.31341-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
c5943b9c53 |
mm, compaction: rearrange compact_control
compact_control spans two cache lines with write-intensive lines on both. Rearrange so the most write-intensive fields are in the same cache line. This has a negligible impact on the overall performance of compaction and is more a tidying exercise than anything. Link: http://lkml.kernel.org/r/20190118175136.31341-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
c5fbd937b6 |
mm, compaction: shrink compact_control
Patch series "Increase success rates and reduce latency of compaction", v3. This series reduces scan rates and success rates of compaction, primarily by using the free lists to shorten scans, better controlling of skip information and whether multiple scanners can target the same block and capturing pageblocks before being stolen by parallel requests. The series is based on mmotm from January 9th, 2019 with the previous compaction series reverted. I'm mostly using thpscale to measure the impact of the series. The benchmark creates a large file, maps it, faults it, punches holes in the mapping so that the virtual address space is fragmented and then tries to allocate THP. It re-executes for different numbers of threads. From a fragmentation perspective, the workload is relatively benign but it does stress compaction. The overall impact on latencies for a 1-socket machine is baseline patches Amean fault-both-3 3832.09 ( 0.00%) 2748.56 * 28.28%* Amean fault-both-5 4933.06 ( 0.00%) 4255.52 ( 13.73%) Amean fault-both-7 7017.75 ( 0.00%) 6586.93 ( 6.14%) Amean fault-both-12 11610.51 ( 0.00%) 9162.34 * 21.09%* Amean fault-both-18 17055.85 ( 0.00%) 11530.06 * 32.40%* Amean fault-both-24 19306.27 ( 0.00%) 17956.13 ( 6.99%) Amean fault-both-30 22516.49 ( 0.00%) 15686.47 * 30.33%* Amean fault-both-32 23442.93 ( 0.00%) 16564.83 * 29.34%* The allocation success rates are much improved baseline patches Percentage huge-3 85.99 ( 0.00%) 97.96 ( 13.92%) Percentage huge-5 88.27 ( 0.00%) 96.87 ( 9.74%) Percentage huge-7 85.87 ( 0.00%) 94.53 ( 10.09%) Percentage huge-12 82.38 ( 0.00%) 98.44 ( 19.49%) Percentage huge-18 83.29 ( 0.00%) 99.14 ( 19.04%) Percentage huge-24 81.41 ( 0.00%) 97.35 ( 19.57%) Percentage huge-30 80.98 ( 0.00%) 98.05 ( 21.08%) Percentage huge-32 80.53 ( 0.00%) 97.06 ( 20.53%) That's a nearly perfect allocation success rate. The biggest impact is on the scan rates Compaction migrate scanned 55893379 19341254 Compaction free scanned 474739990 11903963 The number of pages scanned for migration was reduced by 65% and the free scanner was reduced by 97.5%. So much less work in exchange for lower latency and better success rates. The series was also evaluated using a workload that heavily fragments memory but the benefits there are also significant, albeit not presented. It was commented that we should be rethinking scanning entirely and to a large extent I agree. However, to achieve that you need a lot of this series in place first so it's best to make the linear scanners as best as possible before ripping them out. This patch (of 22): The isolate and migrate scanners should never isolate more than a pageblock of pages so unsigned int is sufficient saving 8 bytes on a 64-bit build. Link: http://lkml.kernel.org/r/20190118175136.31341-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
35f12f0f5c |
mm/filemap: pass inclusive 'end_byte' parameter to filemap_range_has_page
The 'end_byte' parameter of filemap_range_has_page is required to be
inclusive, so follow the rule.
Link: http://lkml.kernel.org/r/1548678679-18122-1-git-send-email-zhengbin13@huawei.com
Fixes:
|
||
|
e9f598730e |
mm: swap: add comment for swap_vma_readahead
swap_vma_readahead()'s comment is missing, just add it. Link: http://lkml.kernel.org/r/1546543673-108536-2-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Hugh Dickins <hughd@google.com Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
8fd2e0b505 |
mm: swap: check if swap backing device is congested or not
Swap readahead would read in a few pages regardless if the underlying device is busy or not. It may incur long waiting time if the device is congested, and it may also exacerbate the congestion. Use inode_read_congested() to check if the underlying device is busy or not like what file page readahead does. Get inode from swap_info_struct. Although we can add inode information in swap_address_space (address_space->host), it may lead some unexpected side effect, i.e. it may break mapping_cap_account_dirty(). Using inode from swap_info_struct seems simple and good enough. Just does the check in vma_cluster_readahead() since swap_vma_readahead() is just used for non-rotational device which much less likely has congestion than traditional HDD. Although swap slots may be consecutive on swap partition, it still may be fragmented on swap file. This check would help to reduce excessive stall for such case. The test with page_fault1 of will-it-scale (sometimes tracing may just show runtest.py that is the wrapper script of page_fault1), which basically launches NR_CPU threads to generate 128MB anonymous pages for each thread, on my virtual machine with congested HDD shows long tail latency is reduced significantly. Without the patch page_fault1_thr-1490 [023] 129.311706: funcgraph_entry: #57377.796 us | do_swap_page(); page_fault1_thr-1490 [023] 129.369103: funcgraph_entry: 5.642us | do_swap_page(); page_fault1_thr-1490 [023] 129.369119: funcgraph_entry: #1289.592 us | do_swap_page(); page_fault1_thr-1490 [023] 129.370411: funcgraph_entry: 4.957us | do_swap_page(); page_fault1_thr-1490 [023] 129.370419: funcgraph_entry: 1.940us | do_swap_page(); page_fault1_thr-1490 [023] 129.378847: funcgraph_entry: #1411.385 us | do_swap_page(); page_fault1_thr-1490 [023] 129.380262: funcgraph_entry: 3.916us | do_swap_page(); page_fault1_thr-1490 [023] 129.380275: funcgraph_entry: #4287.751 us | do_swap_page(); With the patch runtest.py-1417 [020] 301.925911: funcgraph_entry: #9870.146 us | do_swap_page(); runtest.py-1417 [020] 301.935785: funcgraph_entry: 9.802us | do_swap_page(); runtest.py-1417 [020] 301.935799: funcgraph_entry: 3.551us | do_swap_page(); runtest.py-1417 [020] 301.935806: funcgraph_entry: 2.142us | do_swap_page(); runtest.py-1417 [020] 301.935853: funcgraph_entry: 6.938us | do_swap_page(); runtest.py-1417 [020] 301.935864: funcgraph_entry: 3.765us | do_swap_page(); runtest.py-1417 [020] 301.935871: funcgraph_entry: 3.600us | do_swap_page(); runtest.py-1417 [020] 301.935878: funcgraph_entry: 7.202us | do_swap_page(); [akpm@linux-foundation.org: code cleanup] [yang.shi@linux.alibaba.com: add comment] Link: http://lkml.kernel.org/r/bbc7bda7-62d0-df1a-23ef-d369e865bdca@linux.alibaba.com Link: http://lkml.kernel.org/r/1546543673-108536-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Tim Chen <tim.c.chen@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Hugh Dickins <hughd@google.com Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
14ef1fc72a |
mm/filemap.c: remove redundant test from find_get_pages_contig
After we establish a reference on the page, we check the pointer
continues to be in the correct position in i_pages. Checking
page->index afterwards is unnecessary; if it were to change, then the
pointer to it from the page cache would also move. The check used to be
done before grabbing a reference on the page which was racy (see commit
|
||
|
67b8046f42 |
mm/memcontrol.c: use struct_size() in kmalloc()
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; void *entry[]; }; instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Link: http://lkml.kernel.org/r/20190104183726.GA6374@embeddedor Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
c52e75935f |
mm: remove extra drain pages on pcp list
In the current implementation, there are two places to isolate a range of page: __offline_pages() and alloc_contig_range(). During this procedure, it will drain pages on pcp list. Below is a brief call flow: __offline_pages()/alloc_contig_range() start_isolate_page_range() set_migratetype_isolate() drain_all_pages() drain_all_pages() <--- A This snippet shows the current logic is isolate and drain pcp list for each pageblock and drain pcp list again for the whole range. start_isolate_page_range is responsible for isolating the given pfn range. One part of that job is to make sure that also pages that are on the allocator pcp lists are properly isolated. Otherwise they could be reused and the range wouldn't be completely isolated until the memory is freed back. While there is no strict guarantee here because pages might get allocated at any time before drain_all_pages is called there doesn't seem to be any strong demand for such a guarantee. In any case, draining is already done at the isolation level and there is no need to do it again later by start_isolate_page_range callers (memory hotplug and CMA allocator currently). Therefore remove pointless draining in existing callers to make the code more clear and functionally correct. [mhocko@suse.com: provide a clearer changelog for the last two paragraphs] Link: http://lkml.kernel.org/r/20190105233141.2329-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
7ed2c31dab |
mm/hugetlb: distinguish between migratability and movability
Patch series "arm64/mm: Enable HugeTLB migration", v4. This patch series enables HugeTLB migration support for all supported huge page sizes at all levels including contiguous bit implementation. Following HugeTLB migration support matrix has been enabled with this patch series. All permutations have been tested except for the 16GB. CONT PTE PMD CONT PMD PUD -------- --- -------- --- 4K: 64K 2M 32M 1G 16K: 2M 32M 1G 64K: 2M 512M 16G First the series adds migration support for PUD based huge pages. It then adds a platform specific hook to query an architecture if a given huge page size is supported for migration while also providing a default fallback option preserving the existing semantics which just checks for (PMD|PUD|PGDIR)_SHIFT macros. The last two patches enables HugeTLB migration on arm64 and subscribe to this new platform specific hook by defining an override. The second patch differentiates between movability and migratability aspects of huge pages and implements hugepage_movable_supported() which can then be used during allocation to decide whether to place the huge page in movable zone or not. This patch (of 5): During huge page allocation it's migratability is checked to determine if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE. But the movability aspect of the huge page could depend on other factors than just migratability. Movability in itself is a distinct property which should not be tied with migratability alone. This differentiates these two and implements an enhanced movability check which also considers huge page size to determine if it is feasible to be placed under a movable zone. At present it just checks for gigantic pages but going forward it can incorporate other enhanced checks. Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Steve Capper <steve.capper@arm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
6b7e5cad65 |
mm: remove sysctl_extfrag_handler()
sysctl_extfrag_handler() neglects to propagate the return value from proc_dointvec_minmax() to its caller. It's a wrapper that doesn't need to exist, so just use proc_dointvec_minmax() directly. Link: http://lkml.kernel.org/r/20190104032557.3056-1-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reported-by: Aditya Pakki <pakki001@umn.edu> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
153178edc7 |
vmalloc: export __vmalloc_node_range for CONFIG_TEST_VMALLOC_MODULE
Export __vmaloc_node_range() function if CONFIG_TEST_VMALLOC_MODULE is enabled. Some test cases in vmalloc test suite module require and make use of that function. Please note, that it is not supposed to be used for other purposes. We need it only for performance analysis, stressing and stability check of vmalloc allocator. Link: http://lkml.kernel.org/r/20190103142108.20744-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
bc84c53525 |
mm/vmalloc: pass VM_USERMAP flags directly to __vmalloc_node_range()
vmalloc_user*() calls differ from normal vmalloc() only in that they set VM_USERMAP flags for the area. During the whole history of vmalloc.c changes now it is possible simply to pass VM_USERMAP flags directly to __vmalloc_node_range() call instead of finding the area (which obviously takes time) after the allocation. Link: http://lkml.kernel.org/r/20190103145954.16942-4-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joe Perches <joe@perches.com> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
c67dc62475 |
mm/vmalloc: do not call kmemleak_free() on not yet accounted memory
__vmalloc_area_node() calls vfree() on error path, which in turn calls kmemleak_free(), but area is not yet accounted by kmemleak_vmalloc(). Link: http://lkml.kernel.org/r/20190103145954.16942-3-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joe Perches <joe@perches.com> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
401592d2e0 |
mm/vmalloc: fix size check for remap_vmalloc_range_partial()
When VM_NO_GUARD is not set area->size includes adjacent guard page, thus for correct size checking get_vm_area_size() should be used, but not area->size. This fixes possible kernel oops when userspace tries to mmap an area on 1 page bigger than was allocated by vmalloc_user() call: the size check inside remap_vmalloc_range_partial() accounts non-existing guard page also, so check successfully passes but vmalloc_to_page() returns NULL (guard page does not physically exist). The following code pattern example should trigger an oops: static int oops_mmap(struct file *file, struct vm_area_struct *vma) { void *mem; mem = vmalloc_user(4096); BUG_ON(!mem); /* Do not care about mem leak */ return remap_vmalloc_range(vma, mem, 0); } And userspace simply mmaps size + PAGE_SIZE: mmap(NULL, 8192, PROT_WRITE|PROT_READ, MAP_PRIVATE, fd, 0); Possible candidates for oops which do not have any explicit size checks: *** drivers/media/usb/stkwebcam/stk-webcam.c: v4l_stk_mmap[789] ret = remap_vmalloc_range(vma, sbuf->buffer, 0); Or the following one: *** drivers/video/fbdev/core/fbmem.c static int fb_mmap(struct file *file, struct vm_area_struct * vma) ... res = fb->fb_mmap(info, vma); Where fb_mmap callback calls remap_vmalloc_range() directly without any explicit checks: *** drivers/video/fbdev/vfb.c static int vfb_mmap(struct fb_info *info, struct vm_area_struct *vma) { return remap_vmalloc_range(vma, (void *)info->fix.smem_start, vma->vm_pgoff); } Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joe Perches <joe@perches.com> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
5a82ac715d |
mm/vmalloc.c: make vmalloc_32_user() align base kernel virtual address to SHMLBA
This patch repeats the original one from David S Miller:
|
||
|
60cd4bcd62 |
memcg: localize memcg_kmem_enabled() check
Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge functions, so, the users don't have to explicitly check that condition. This is purely code cleanup patch without any functional change. Only the order of checks in memcg_charge_slab() can potentially be changed but the functionally it will be same. This should not matter as memcg_charge_slab() is not in the hot path. Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
9234bae9b2 |
mm, slub: make the comment of put_cpu_partial() complete
There are two cases when put_cpu_partial() is invoked. * __slab_free * get_partial_node This patch just makes it cover these two cases. Link: http://lkml.kernel.org/r/20181025094437.18951-3-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
52d1e606ee |
mm: reuse only-pte-mapped KSM page in do_wp_page()
Add an optimization for KSM pages almost in the same way that we have for ordinary anonymous pages. If there is a write fault in a page, which is mapped to an only pte, and it is not related to swap cache; the page may be reused without copying its content. [ Note that we do not consider PageSwapCache() pages at least for now, since we don't want to complicate __get_ksm_page(), which has nice optimization based on this (for the migration case). Currenly it is spinning on PageSwapCache() pages, waiting for when they have unfreezed counters (i.e., for the migration finish). But we don't want to make it also spinning on swap cache pages, which we try to reuse, since there is not a very high probability to reuse them. So, for now we do not consider PageSwapCache() pages at all. ] So in reuse_ksm_page() we check for 1) PageSwapCache() and 2) page_stable_node(), to skip a page, which KSM is currently trying to link to stable tree. Then we do page_ref_freeze() to prohibit KSM to merge one more page into the page, we are reusing. After that, nobody can refer to the reusing page: KSM skips !PageSwapCache() pages with zero refcount; and the protection against of all other participants is the same as for reused ordinary anon pages pte lock, page lock and mmap_sem. [akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s] Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Cc: Rik van Riel <riel@surriel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
98fa15f34c |
mm: replace all open encodings for NUMA_NO_NODE
Patch series "Replace all open encodings for NUMA_NO_NODE", v3. All these places for replacement were found by running the following grep patterns on the entire kernel code. Please let me know if this might have missed some instances. This might also have replaced some false positives. I will appreciate suggestions, inputs and review. 1. git grep "nid == -1" 2. git grep "node == -1" 3. git grep "nid = -1" 4. git grep "node = -1" This patch (of 2): At present there are multiple places where invalid node number is encoded as -1. Even though implicitly understood it is always better to have macros in there. Replace these open encodings for an invalid node number with the global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like 'invalid node' from various places redirecting them to a common definition. Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [ixgbe] Acked-by: Jens Axboe <axboe@kernel.dk> [mtip32xx] Acked-by: Vinod Koul <vkoul@kernel.org> [dmaengine.c] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Doug Ledford <dledford@redhat.com> [drivers/infiniband] Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
6ade20327d |
mm/vmalloc.c: don't dereference possible NULL pointer in __vunmap()
find_vmap_area() can return a NULL pointer and we're going to
dereference it without checking it first. Use the existing
find_vm_area() function which does exactly what we want and checks for
the NULL pointer.
Link: http://lkml.kernel.org/r/20181228171009.22269-1-liviu@dudau.co.uk
Fixes:
|
||
|
a9cd410a3d |
mm/page_alloc.c: memory hotplug: free pages as higher order
When freeing pages are done with higher order, time spent on coalescing pages by buddy allocator can be reduced. With section size of 256MB, hot add latency of a single section shows improvement from 50-60 ms to less than 1 ms, hence improving the hot add latency by 60 times. Modify external providers of online callback to align with the change. [arunks@codeaurora.org: v11] Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org [akpm@linux-foundation.org: remove unused local, per Arun] [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar] [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch] [arunks@codeaurora.org: v8] Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org [arunks@codeaurora.org: v9] Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mathieu Malaterre <malat@debian.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
278d7756df |
mm/slub.c: remove an unused addr argument
"addr" function argument is not used in alloc_consistency_checks() at
all, so remove it.
Link: http://lkml.kernel.org/r/20190211123214.35592-1-cai@lca.pw
Fixes:
|
||
|
92d1d07daa |
mm/slab.c: kmemleak no scan alien caches
Kmemleak throws endless warnings during boot due to in
__alloc_alien_cache(),
alc = kmalloc_node(memsize, gfp, node);
init_arraycache(&alc->ac, entries, batch);
kmemleak_no_scan(ac);
Kmemleak does not track the array cache (alc->ac) but the alien cache
(alc) instead, so let it track the latter by lifting kmemleak_no_scan()
out of init_arraycache().
There is another place that calls init_arraycache(), but
alloc_kmem_cache_cpus() uses the percpu allocation where will never be
considered as a leak.
kmemleak: Found object by alias at 0xffff8007b9aa7e38
CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
Call trace:
dump_backtrace+0x0/0x168
show_stack+0x24/0x30
dump_stack+0x88/0xb0
lookup_object+0x84/0xac
find_and_get_object+0x84/0xe4
kmemleak_no_scan+0x74/0xf4
setup_kmem_cache_node+0x2b4/0x35c
__do_tune_cpucache+0x250/0x2d4
do_tune_cpucache+0x4c/0xe4
enable_cpucache+0xc8/0x110
setup_cpu_cache+0x40/0x1b8
__kmem_cache_create+0x240/0x358
create_cache+0xc0/0x198
kmem_cache_create_usercopy+0x158/0x20c
kmem_cache_create+0x50/0x64
fsnotify_init+0x58/0x6c
do_one_initcall+0x194/0x388
kernel_init_freeable+0x668/0x688
kernel_init+0x18/0x124
ret_from_fork+0x10/0x18
kmemleak: Object 0xffff8007b9aa7e00 (size 256):
kmemleak: comm "swapper/0", pid 1, jiffies 4294697137
kmemleak: min_count = 1
kmemleak: count = 0
kmemleak: flags = 0x1
kmemleak: checksum = 0
kmemleak: backtrace:
kmemleak_alloc+0x84/0xb8
kmem_cache_alloc_node_trace+0x31c/0x3a0
__kmalloc_node+0x58/0x78
setup_kmem_cache_node+0x26c/0x35c
__do_tune_cpucache+0x250/0x2d4
do_tune_cpucache+0x4c/0xe4
enable_cpucache+0xc8/0x110
setup_cpu_cache+0x40/0x1b8
__kmem_cache_create+0x240/0x358
create_cache+0xc0/0x198
kmem_cache_create_usercopy+0x158/0x20c
kmem_cache_create+0x50/0x64
fsnotify_init+0x58/0x6c
do_one_initcall+0x194/0x388
kernel_init_freeable+0x668/0x688
kernel_init+0x18/0x124
kmemleak: Not scanning unknown object at 0xffff8007b9aa7e38
CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
Call trace:
dump_backtrace+0x0/0x168
show_stack+0x24/0x30
dump_stack+0x88/0xb0
kmemleak_no_scan+0x90/0xf4
setup_kmem_cache_node+0x2b4/0x35c
__do_tune_cpucache+0x250/0x2d4
do_tune_cpucache+0x4c/0xe4
enable_cpucache+0xc8/0x110
setup_cpu_cache+0x40/0x1b8
__kmem_cache_create+0x240/0x358
create_cache+0xc0/0x198
kmem_cache_create_usercopy+0x158/0x20c
kmem_cache_create+0x50/0x64
fsnotify_init+0x58/0x6c
do_one_initcall+0x194/0x388
kernel_init_freeable+0x668/0x688
kernel_init+0x18/0x124
ret_from_fork+0x10/0x18
Link: http://lkml.kernel.org/r/20190129184518.39808-1-cai@lca.pw
Fixes:
|
||
|
edde82b6df |
mm/slub.c: freelist is ensured to be NULL when new_slab() fails
new_slab_objects() will return immediately if freelist is not NULL. if (freelist) return freelist; One more assignment operation could be avoided. Link: http://lkml.kernel.org/r/20181229062512.30469-1-rocking@whu.edu.cn Signed-off-by: Peng Wang <rocking@whu.edu.cn> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
5c0198b6fb |
kasan: fix coccinelle warnings in kasan_p*_table
kasan_p4d_table(), kasan_pmd_table() and kasan_pud_table() are declared
as returning bool, but return 0 instead of false, which produces a
coccinelle warning. Fix it.
Link: http://lkml.kernel.org/r/1fa6fadf644859e8a6a8ecce258444b49be8c7ee.1551716733.git.andreyknvl@google.com
Fixes:
|
||
|
bcf6f55a0d |
kasan: fix kasan_check_read/write definitions
Building little-endian allmodconfig kernels on arm64 started failing with the generated atomic.h implementation, since we now try to call kasan helpers from the EFI stub: aarch64-linux-gnu-ld: drivers/firmware/efi/libstub/arm-stub.stub.o: in function `atomic_set': include/generated/atomic-instrumented.h:44: undefined reference to `__efistub_kasan_check_write' I suspect that we get similar problems in other files that explicitly disable KASAN for some reason but call atomic_t based helper functions. We can fix this by checking the predefined __SANITIZE_ADDRESS__ macro that the compiler sets instead of checking CONFIG_KASAN, but this in turn requires a small hack in mm/kasan/common.c so we do see the extern declaration there instead of the inline function. Link: http://lkml.kernel.org/r/20181211133453.2835077-1-arnd@arndb.de Fixes: b1864b828644 ("locking/atomics: build atomic headers as required") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reported-by: Anders Roxell <anders.roxell@linaro.org> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au>, Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
4117992df6 |
page_poison: play nicely with KASAN
KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING). It triggers false positives in the allocation path: BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330 Read of size 8 at addr ffff88881f800000 by task swapper/0 CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54 Call Trace: dump_stack+0xe0/0x19a print_address_description.cold.2+0x9/0x28b kasan_report.cold.3+0x7a/0xb5 __asan_report_load8_noabort+0x19/0x20 memchr_inv+0x2ea/0x330 kernel_poison_pages+0x103/0x3d5 get_page_from_freelist+0x15e7/0x4d90 because KASAN has not yet unpoisoned the shadow page for allocation before it checks memchr_inv() but only found a stale poison pattern. Also, false positives in free path, BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5 Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1 CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55 Call Trace: dump_stack+0xe0/0x19a print_address_description.cold.2+0x9/0x28b kasan_report.cold.3+0x7a/0xb5 check_memory_region+0x22d/0x250 memset+0x28/0x40 kernel_poison_pages+0x29e/0x3d5 __free_pages_ok+0x75f/0x13e0 due to KASAN adds poisoned redzones around slab objects, but the page poisoning needs to poison the whole page. Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
7771bdbbfd |
kasan: remove use after scope bugs detection.
Use after scope bugs detector seems to be almost entirely useless for the linux kernel. It exists over two years, but I've seen only one valid bug so far [1]. And the bug was fixed before it has been reported. There were some other use-after-scope reports, but they were false-positives due to different reasons like incompatibility with structleak plugin. This feature significantly increases stack usage, especially with GCC < 9 version, and causes a 32K stack overflow. It probably adds performance penalty too. Given all that, let's remove use-after-scope detector entirely. While preparing this patch I've noticed that we mistakenly enable use-after-scope detection for clang compiler regardless of CONFIG_KASAN_EXTRA setting. This is also fixed now. [1] http://lkml.kernel.org/r/<20171129052106.rhgbjhhis53hkgfn@wfg-t540p.sh.intel.com> Link: http://lkml.kernel.org/r/20190111185842.13978-1-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Will Deacon <will.deacon@arm.com> [arm64] Cc: Qian Cai <cai@lca.pw> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
46612b751c |
mm: hwpoison: fix thp split handing in soft_offline_in_use_page()
When soft_offline_in_use_page() runs on a thp tail page after pmd is split, we trigger the following VM_BUG_ON_PAGE(): Memory failure: 0x3755ff: non anonymous thp __get_any_page: 0x3755ff: unknown zero refcount page type 2fffff80000000 Soft offlining pfn 0x34d805 at process virtual address 0x20fff000 page:ffffea000d360140 count:0 mapcount:0 mapping:0000000000000000 index:0x1 flags: 0x2fffff80000000() raw: 002fffff80000000 ffffea000d360108 ffffea000d360188 0000000000000000 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) ------------[ cut here ]------------ kernel BUG at ./include/linux/mm.h:519! soft_offline_in_use_page() passed refcount and page lock from tail page to head page, which is not needed because we can pass any subpage to split_huge_page(). Naoya had fixed a similar issue in |
||
|
2d28e01dca |
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton: "2 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: hugetlbfs: fix races and page leaks during migration kasan: turn off asan-stack for clang-8 and earlier |
||
|
cb6acd01e2 |
hugetlbfs: fix races and page leaks during migration
hugetlb pages should only be migrated if they are 'active'. The
routines set/clear_page_huge_active() modify the active state of hugetlb
pages.
When a new hugetlb page is allocated at fault time, set_page_huge_active
is called before the page is locked. Therefore, another thread could
race and migrate the page while it is being added to page table by the
fault code. This race is somewhat hard to trigger, but can be seen by
strategically adding udelay to simulate worst case scheduling behavior.
Depending on 'how' the code races, various BUG()s could be triggered.
To address this issue, simply delay the set_page_huge_active call until
after the page is successfully added to the page table.
Hugetlb pages can also be leaked at migration time if the pages are
associated with a file in an explicitly mounted hugetlbfs filesystem.
For example, consider a two node system with 4GB worth of huge pages
available. A program mmaps a 2G file in a hugetlbfs filesystem. It
then migrates the pages associated with the file from one node to
another. When the program exits, huge page counts are as follows:
node0
1024 free_hugepages
1024 nr_hugepages
node1
0 free_hugepages
1024 nr_hugepages
Filesystem Size Used Avail Use% Mounted on
nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool
That is as expected. 2G of huge pages are taken from the free_hugepages
counts, and 2G is the size of the file in the explicitly mounted
filesystem. If the file is then removed, the counts become:
node0
1024 free_hugepages
1024 nr_hugepages
node1
1024 free_hugepages
1024 nr_hugepages
Filesystem Size Used Avail Use% Mounted on
nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool
Note that the filesystem still shows 2G of pages used, while there
actually are no huge pages in use. The only way to 'fix' the filesystem
accounting is to unmount the filesystem
If a hugetlb page is associated with an explicitly mounted filesystem,
this information in contained in the page_private field. At migration
time, this information is not preserved. To fix, simply transfer
page_private from old to new page at migration time if necessary.
There is a related race with removing a huge page from a file and
migration. When a huge page is removed from the pagecache, the
page_mapping() field is cleared, yet page_private remains set until the
page is actually freed by free_huge_page(). A page could be migrated
while in this state. However, since page_mapping() is not set the
hugetlbfs specific routine to transfer page_private is not called and we
leak the page count in the filesystem.
To fix that, check for this condition before migrating a huge page. If
the condition is detected, return EBUSY for the page.
Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
Fixes:
|
||
|
2794129e90 |
mm/memory-hotplug: Allow memory resources to be children
The mm/resource.c code is used to manage the physical address space. The current resource configuration can be viewed in /proc/iomem. An example of this is at the bottom of this description. The nvdimm subsystem "owns" the physical address resources which map to persistent memory and has resources inserted for them as "Persistent Memory". The best way to repurpose this for volatile use is to leave the existing resource in place, but add a "System RAM" resource underneath it. This clearly communicates the ownership relationship of this memory. The request_resource_conflict() API only deals with the top-level resources. Replace it with __request_region() which will search for !IORESOURCE_BUSY areas lower in the resource tree than the top level. We *could* also simply truncate the existing top-level "Persistent Memory" resource and take over the released address space. But, this means that if we ever decide to hot-unplug the "RAM" and give it back, we need to recreate the original setup, which may mean going back to the BIOS tables. This should have no real effect on the existing collision detection because the areas that truly conflict should be marked IORESOURCE_BUSY. 00000000-00000fff : Reserved 00001000-0009fbff : System RAM 0009fc00-0009ffff : Reserved 000a0000-000bffff : PCI Bus 0000:00 000c0000-000c97ff : Video ROM 000c9800-000ca5ff : Adapter ROM 000f0000-000fffff : Reserved 000f0000-000fffff : System ROM 00100000-9fffffff : System RAM 01000000-01e071d0 : Kernel code 01e071d1-027dfdff : Kernel data 02dc6000-0305dfff : Kernel bss a0000000-afffffff : Persistent Memory (legacy) a0000000-a7ffffff : System RAM b0000000-bffdffff : System RAM bffe0000-bfffffff : Reserved c0000000-febfffff : PCI Bus 0000:00 Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: linux-nvdimm@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Huang Ying <ying.huang@intel.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Keith Busch <keith.busch@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> |
||
|
b926b7f3ba |
mm/resource: Move HMM pr_debug() deeper into resource code
HMM consumes physical address space for its own use, even though nothing is mapped or accessible there. It uses a special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY) to uniquely identify these areas. When HMM consumes address space, it makes a best guess about what to consume. However, it is possible that a future memory or device hotplug can collide with the reserved area. In the case of these conflicts, there is an error message in register_memory_resource(). Later patches in this series move register_memory_resource() from using request_resource_conflict() to __request_region(). Unfortunately, __request_region() does not return the conflict like the previous function did, which makes it impossible to check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting resource. Instead of warning in register_memory_resource(), move the check into the core resource code itself (__request_region()) where the conflicting resource _is_ available. This has the added bonus of producing a warning in case of HMM conflicts with devices *or* RAM address space, as opposed to the RAM- only warnings that were there previously. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: linux-nvdimm@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Huang Ying <ying.huang@intel.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Keith Busch <keith.busch@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> |
||
|
0a1d52994d |
mm: enforce min addr even if capable() in expand_downwards()
security_mmap_addr() does a capability check with current_cred(), but
we can reach this code from contexts like a VFS write handler where
current_cred() must not be used.
This can be abused on systems without SMAP to make NULL pointer
dereferences exploitable again.
Fixes:
|
||
|
1b046b445c |
percpu: km: no need to consider pcpu_group_offsets[0]
percpu-km is used on UP systems which only has one group, so the group offset will be always 0, there is no need to subtract pcpu_group_offsets[0] when assigning chunk->base_addr Signed-off-by: Peng Fan <peng.fan@nxp.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> |
||
|
29b00e6099 |
tmpfs: fix uninitialized return value in shmem_link
When we made the shmem_reserve_inode call in shmem_link conditional, we
forgot to update the declaration for ret so that it always has a known
value. Dan Carpenter pointed out this deficiency in the original patch.
Fixes:
|
||
|
53a41cb7ed |
Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses"
This reverts commit
|
||
|
2de7852fe9 |
percpu: use nr_groups as check condition
group_cnt array is defined with NR_CPUS entries, but normally nr_groups will not reach up to NR_CPUS. So there is no issue to the current code. Checking other parts of pcpu_build_alloc_info, use nr_groups as check condition, so make it consistent to use 'group < nr_groups' as for loop check. In case we do have nr_groups equals with NR_CPUS, we could also avoid memory access out of bounds. Signed-off-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> |
||
|
891cb2a72d |
mm, memory_hotplug: fix off-by-one in is_pageblock_removable
Rong Chen has reported the following boot crash: PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 239 Comm: udevd Not tainted 5.0.0-rc4-00149-gefad4e4 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 RIP: 0010:page_mapping+0x12/0x80 Code: 5d c3 48 89 df e8 0e ad 02 00 85 c0 75 da 89 e8 5b 5d c3 0f 1f 44 00 00 53 48 89 fb 48 8b 43 08 48 8d 50 ff a8 01 48 0f 45 da <48> 8b 53 08 48 8d 42 ff 83 e2 01 48 0f 44 c3 48 83 38 ff 74 2f 48 RSP: 0018:ffff88801fa87cd8 EFLAGS: 00010202 RAX: ffffffffffffffff RBX: fffffffffffffffe RCX: 000000000000000a RDX: fffffffffffffffe RSI: ffffffff820b9a20 RDI: ffff88801e5c0000 RBP: 6db6db6db6db6db7 R08: ffff88801e8bb000 R09: 0000000001b64d13 R10: ffff88801fa87cf8 R11: 0000000000000001 R12: ffff88801e640000 R13: ffffffff820b9a20 R14: ffff88801f145258 R15: 0000000000000001 FS: 00007fb2079817c0(0000) GS:ffff88801dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000006 CR3: 000000001fa82000 CR4: 00000000000006a0 Call Trace: __dump_page+0x14/0x2c0 is_mem_section_removable+0x24c/0x2c0 removable_show+0x87/0xa0 dev_attr_show+0x25/0x60 sysfs_kf_seq_show+0xba/0x110 seq_read+0x196/0x3f0 __vfs_read+0x34/0x180 vfs_read+0xa0/0x150 ksys_read+0x44/0xb0 do_syscall_64+0x5e/0x4a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe and bisected it down to commit |
||
|
6c8fcc096b |
mm: don't let userspace spam allocations warnings
memdump_user usually gets fed unchecked userspace input. Blasting a full backtrace into dmesg every time is a bit excessive - I'm not sure on the kernel rule in general, but at least in drm we're trying not to let unpriviledge userspace spam the logs freely. Definitely not entire warning backtraces. It also means more filtering for our CI, because our testsuite exercises these corner cases and so hits these a lot. Link: http://lkml.kernel.org/r/20190220204058.11676-1-daniel.vetter@ffwll.ch Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Roman Gushchin <guro@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Stancek <jstancek@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Bartosz Golaszewski <brgl@bgdev.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
6373dca16c |
slub: fix a crash with SLUB_DEBUG + KASAN_SW_TAGS
In process_slab(), "p = get_freepointer()" could return a tagged pointer, but "addr = page_address()" always return a native pointer. As the result, slab_index() is messed up here, return (p - addr) / s->size; All other callers of slab_index() have the same situation where "addr" is from page_address(), so just need to untag "p". # cat /sys/kernel/slab/hugetlbfs_inode_cache/alloc_calls Unable to handle kernel paging request at virtual address 2bff808aa4856d48 Mem abort info: ESR = 0x96000007 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000007 CM = 0, WnR = 0 swapper pgtable: 64k pages, 48-bit VAs, pgdp = 0000000002498338 [2bff808aa4856d48] pgd=00000097fcfd0003, pud=00000097fcfd0003, pmd=00000097fca30003, pte=00e8008b24850712 Internal error: Oops: 96000007 [#1] SMP CPU: 3 PID: 79210 Comm: read_all Tainted: G L 5.0.0-rc7+ #84 Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.0.6 07/10/2018 pstate: 00400089 (nzcv daIf +PAN -UAO) pc : get_map+0x78/0xec lr : get_map+0xa0/0xec sp : aeff808989e3f8e0 x29: aeff808989e3f940 x28: ffff800826200000 x27: ffff100012d47000 x26: 9700000000002500 x25: 0000000000000001 x24: 52ff8008200131f8 x23: 52ff8008200130a0 x22: 52ff800820013098 x21: ffff800826200000 x20: ffff100013172ba0 x19: 2bff808a8971bc00 x18: ffff1000148f5538 x17: 000000000000001b x16: 00000000000000ff x15: ffff1000148f5000 x14: 00000000000000d2 x13: 0000000000000001 x12: 0000000000000000 x11: 0000000020000002 x10: 2bff808aa4856d48 x9 : 0000020000000000 x8 : 68ff80082620ebb0 x7 : 0000000000000000 x6 : ffff1000105da1dc x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000010 x2 : 2bff808a8971bc00 x1 : ffff7fe002098800 x0 : ffff80082620ceb0 Process read_all (pid: 79210, stack limit = 0x00000000f65b9361) Call trace: get_map+0x78/0xec process_slab+0x7c/0x47c list_locations+0xb0/0x3c8 alloc_calls_show+0x34/0x40 slab_attr_show+0x34/0x48 sysfs_kf_seq_show+0x2e4/0x570 kernfs_seq_show+0x12c/0x1a0 seq_read+0x48c/0xf84 kernfs_fop_read+0xd4/0x448 __vfs_read+0x94/0x5d4 vfs_read+0xcc/0x194 ksys_read+0x6c/0xe8 __arm64_sys_read+0x68/0xb0 el0_svc_handler+0x230/0x3bc el0_svc+0x8/0xc Code: d3467d2a 9ac92329 8b0a0e6a f9800151 (c85f7d4b) ---[ end trace a383a9a44ff13176 ]--- Kernel panic - not syncing: Fatal exception SMP: stopping secondary CPUs SMP: failed to stop secondary CPUs 1-7,32,40,127 Kernel Offset: disabled CPU features: 0x002,20000c18 Memory Limit: none ---[ end Kernel panic - not syncing: Fatal exception ]--- Link: http://lkml.kernel.org/r/20190220020251.82039-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
557ea25383 |
kasan, slab: remove redundant kasan_slab_alloc hooks
kasan_slab_alloc() calls in kmem_cache_alloc() and kmem_cache_alloc_node() are redundant as they are already called via slab_alloc/slab_alloc_node()-> slab_post_alloc_hook()->kasan_slab_alloc(). Remove them. Link: http://lkml.kernel.org/r/4ca1655cdcfc4379c49c50f7bf80f81c4ad01485.1550602886.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Qian Cai <cai@lca.pw> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
51dedad06b |
kasan, slab: make freelist stored without tags
Similarly to "kasan, slub: move kasan_poison_slab hook before page_address", move kasan_poison_slab() before alloc_slabmgmt(), which calls page_address(), to make page_address() return value to be non-tagged. This, combined with calling kasan_reset_tag() for off-slab slab management object, leads to freelist being stored non-tagged. Link: http://lkml.kernel.org/r/dfb53b44a4d00de3879a05a9f04c1f55e584f7a1.1550602886.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Qian Cai <cai@lca.pw> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
219667c23c |
kasan, slab: fix conflicts with CONFIG_HARDENED_USERCOPY
Similarly to commit
|
||
|
dc15a8a254 |
kasan: prevent tracing of tags.c
Similarly to commit
|
||
|
3f41b60938 |
kasan: fix random seed generation for tag-based mode
There are two issues with assigning random percpu seeds right now: 1. We use for_each_possible_cpu() to iterate over cpus, but cpumask is not set up yet at the moment of kasan_init(), and thus we only set the seed for cpu #0. 2. A call to get_random_u32() always returns the same number and produces a message in dmesg, since the random subsystem is not yet initialized. Fix 1 by calling kasan_init_tags() after cpumask is set up. Fix 2 by using get_cycles() instead of get_random_u32(). This gives us lower quality random numbers, but it's good enough, as KASAN is meant to be used as a debugging tool and not a mitigation. Link: http://lkml.kernel.org/r/1f815cc914b61f3516ed4cc9bfd9eeca9bd5d9de.1550677973.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
1062af920c |
tmpfs: fix link accounting when a tmpfile is linked in
tmpfs has a peculiarity of accounting hard links as if they were
separate inodes: so that when the number of inodes is limited, as it is
by default, a user cannot soak up an unlimited amount of unreclaimable
dcache memory just by repeatedly linking a file.
But when v3.11 added O_TMPFILE, and the ability to use linkat() on the
fd, we missed accommodating this new case in tmpfs: "df -i" shows that
an extra "inode" remains accounted after the file is unlinked and the fd
closed and the actual inode evicted. If a user repeatedly links
tmpfiles into a tmpfs, the limit will be hit (ENOSPC) even after they
are deleted.
Just skip the extra reservation from shmem_link() in this case: there's
a sense in which this first link of a tmpfile is then cheaper than a
hard link of another file, but the accounting works out, and there's
still good limiting, so no need to do anything more complicated.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1902182134370.7035@eggly.anvils
Fixes:
|
||
|
6ea183d60c |
mm: handle lru_add_drain_all for UP properly
Since for_each_cpu(cpu, mask) added by commit |
||
|
94b3334cbe |
mm, page_alloc: fix a division by zero error when boosting watermarks v2
Yury Norov reported that an arm64 KVM instance could not boot since after v5.0-rc1 and could addressed by reverting the patches |
||
|
311ade0eab |
mm/debug.c: fix __dump_page() for poisoned pages
Evaluating page_mapping() on a poisoned page ends up dereferencing junk
and making PF_POISONED_CHECK() considerably crashier than intended:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000006
Mem abort info:
ESR = 0x96000005
Exception class = DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000005
CM = 0, WnR = 0
user pgtable: 4k pages, 39-bit VAs, pgdp = 00000000c2f6ac38
[0000000000000006] pgd=0000000000000000, pud=0000000000000000
Internal error: Oops: 96000005 [#1] PREEMPT SMP
Modules linked in:
CPU: 2 PID: 491 Comm: bash Not tainted 5.0.0-rc1+ #1
Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Dec 17 2018
pstate: 00000005 (nzcv daif -PAN -UAO)
pc : page_mapping+0x18/0x118
lr : __dump_page+0x1c/0x398
Process bash (pid: 491, stack limit = 0x000000004ebd4ecd)
Call trace:
page_mapping+0x18/0x118
__dump_page+0x1c/0x398
dump_page+0xc/0x18
remove_store+0xbc/0x120
dev_attr_store+0x18/0x28
sysfs_kf_write+0x40/0x50
kernfs_fop_write+0x130/0x1d8
__vfs_write+0x30/0x180
vfs_write+0xb4/0x1a0
ksys_write+0x60/0xd0
__arm64_sys_write+0x18/0x20
el0_svc_common+0x94/0xf8
el0_svc_handler+0x68/0x70
el0_svc+0x8/0xc
Code: f9400401 d1000422 f240003f 9a801040 (f9400402)
---[ end trace cdb5eb5bf435cecb ]---
Fix that by not inspecting the mapping until we've determined that it's
likely to be valid. Now the above condition still ends up stopping the
kernel, but in the correct manner:
page:ffffffbf20000000 is uninitialized and poisoned
raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
------------[ cut here ]------------
kernel BUG at ./include/linux/mm.h:1006!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
Modules linked in:
CPU: 1 PID: 483 Comm: bash Not tainted 5.0.0-rc1+ #3
Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Dec 17 2018
pstate: 40000005 (nZcv daif -PAN -UAO)
pc : remove_store+0xbc/0x120
lr : remove_store+0xbc/0x120
...
Link: http://lkml.kernel.org/r/03b53ee9d7e76cda4b9b5e1e31eea080db033396.1550071778.git.robin.murphy@arm.com
Fixes:
|
||
|
338cfaad49 |
slub: fix SLAB_CONSISTENCY_CHECKS + KASAN_SW_TAGS
Enabling SLUB_DEBUG's SLAB_CONSISTENCY_CHECKS with KASAN_SW_TAGS triggers endless false positives during boot below due to check_valid_pointer() checks tagged pointers which have no addresses that is valid within slab pages: BUG radix_tree_node (Tainted: G B ): Freelist Pointer check fails ----------------------------------------------------------------------------- INFO: Slab objects=69 used=69 fp=0x (null) flags=0x7ffffffc000200 INFO: Object @offset=15060037153926966016 fp=0x Redzone: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 18 6b 06 00 08 80 ff d0 .........k...... Object : 18 6b 06 00 08 80 ff d0 00 00 00 00 00 00 00 00 .k.............. Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Redzone: bb bb bb bb bb bb bb bb ........ Padding: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B 5.0.0-rc5+ #18 Call trace: dump_backtrace+0x0/0x450 show_stack+0x20/0x2c __dump_stack+0x20/0x28 dump_stack+0xa0/0xfc print_trailer+0x1bc/0x1d0 object_err+0x40/0x50 alloc_debug_processing+0xf0/0x19c ___slab_alloc+0x554/0x704 kmem_cache_alloc+0x2f8/0x440 radix_tree_node_alloc+0x90/0x2fc idr_get_free+0x1e8/0x6d0 idr_alloc_u32+0x11c/0x2a4 idr_alloc+0x74/0xe0 worker_pool_assign_id+0x5c/0xbc workqueue_init_early+0x49c/0xd50 start_kernel+0x52c/0xac4 FIX radix_tree_node: Marking all objects used Link: http://lkml.kernel.org/r/20190209044128.3290-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d36a63a943 |
kasan, slub: fix more conflicts with CONFIG_SLAB_FREELIST_HARDENED
When CONFIG_KASAN_SW_TAGS is enabled, ptr_addr might be tagged. Normally, this doesn't cause any issues, as both set_freepointer() and get_freepointer() are called with a pointer with the same tag. However, there are some issues with CONFIG_SLUB_DEBUG code. For example, when __free_slub() iterates over objects in a cache, it passes untagged pointers to check_object(). check_object() in turns calls get_freepointer() with an untagged pointer, which causes the freepointer to be restored incorrectly. Add kasan_reset_tag to freelist_ptr(). Also add a detailed comment. Link: http://lkml.kernel.org/r/bf858f26ef32eb7bd24c665755b3aee4bc58d0e4.1550103861.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Qian Cai <cai@lca.pw> Tested-by: Qian Cai <cai@lca.pw> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
18e5066102 |
kasan, slub: fix conflicts with CONFIG_SLAB_FREELIST_HARDENED
CONFIG_SLAB_FREELIST_HARDENED hashes freelist pointer with the address of the object where the pointer gets stored. With tag based KASAN we don't account for that when building freelist, as we call set_freepointer() with the first argument untagged. This patch changes the code to properly propagate tags throughout the loop. Link: http://lkml.kernel.org/r/3df171559c52201376f246bf7ce3184fe21c1dc7.1549921721.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Qian Cai <cai@lca.pw> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
a710122428 |
kasan, slub: move kasan_poison_slab hook before page_address
With tag based KASAN page_address() looks at the page flags to see whether the resulting pointer needs to have a tag set. Since we don't want to set a tag when page_address() is called on SLAB pages, we call page_kasan_tag_reset() in kasan_poison_slab(). However in allocate_slab() page_address() is called before kasan_poison_slab(). Fix it by changing the order. [andreyknvl@google.com: fix compilation error when CONFIG_SLUB_DEBUG=n] Link: http://lkml.kernel.org/r/ac27cc0bbaeb414ed77bcd6671a877cf3546d56e.1550066133.git.andreyknvl@google.com Link: http://lkml.kernel.org/r/cd895d627465a3f1c712647072d17f10883be2a1.1549921721.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Qian Cai <cai@lca.pw> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
a2f775751d |
kmemleak: account for tagged pointers when calculating pointer range
kmemleak keeps two global variables, min_addr and max_addr, which store the range of valid (encountered by kmemleak) pointer values, which it later uses to speed up pointer lookup when scanning blocks. With tagged pointers this range will get bigger than it needs to be. This patch makes kmemleak untag pointers before saving them to min_addr and max_addr and when performing a lookup. Link: http://lkml.kernel.org/r/16e887d442986ab87fe87a755815ad92fa431a5f.1550066133.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Qian Cai <cai@lca.pw> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
53128245b4 |
kasan, kmemleak: pass tagged pointers to kmemleak
Right now we call kmemleak hooks before assigning tags to pointers in KASAN hooks. As a result, when an objects gets allocated, kmemleak sees a differently tagged pointer, compared to the one it sees when the object gets freed. Fix it by calling KASAN hooks before kmemleak's ones. Link: http://lkml.kernel.org/r/cd825aa4897b0fc37d3316838993881daccbe9f5.1549921721.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Qian Cai <cai@lca.pw> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
e1db95befb |
kasan: fix assigning tags twice
When an object is kmalloc()'ed, two hooks are called: kasan_slab_alloc() and kasan_kmalloc(). Right now we assign a tag twice, once in each of the hooks. Fix it by assigning a tag only in the former hook. Link: http://lkml.kernel.org/r/ce8c6431da735aa7ec051fd6497153df690eb021.1549921721.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Qian Cai <cai@lca.pw> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
050c17f239 |
numa: change get_mempolicy() to use nr_node_ids instead of MAX_NUMNODES
The system call, get_mempolicy() [1], passes an unsigned long *nodemask pointer and an unsigned long maxnode argument which specifies the length of the user's nodemask array in bits (which is rounded up). The manual page says that if the maxnode value is too small, get_mempolicy will return EINVAL but there is no system call to return this minimum value. To determine this value, some programs search /proc/<pid>/status for a line starting with "Mems_allowed:" and use the number of digits in the mask to determine the minimum value. A recent change to the way this line is formatted [2] causes these programs to compute a value less than MAX_NUMNODES so get_mempolicy() returns EINVAL. Change get_mempolicy(), the older compat version of get_mempolicy(), and the copy_nodes_to_user() function to use nr_node_ids instead of MAX_NUMNODES, thus preserving the defacto method of computing the minimum size for the nodemask array and the maxnode argument. [1] http://man7.org/linux/man-pages/man2/get_mempolicy.2.html [2] https://lore.kernel.org/lkml/1545405631-6808-1-git-send-email-longman@redhat.com Link: http://lkml.kernel.org/r/20190211180245.22295-1-rcampbell@nvidia.com Fixes: 4fb8e5b89bcbbbb ("include/linux/nodemask.h: use nr_node_ids (not MAX_NUMNODES) in __nodemask_pr_numnodes()") Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Suggested-by: Alexander Duyck <alexander.duyck@gmail.com> Cc: Waiman Long <longman@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
40e196a906 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller: 1) Fix suspend and resume in mt76x0u USB driver, from Stanislaw Gruszka. 2) Missing memory barriers in xsk, from Magnus Karlsson. 3) rhashtable fixes in mac80211 from Herbert Xu. 4) 32-bit MIPS eBPF JIT fixes from Paul Burton. 5) Fix for_each_netdev_feature() on big endian, from Hauke Mehrtens. 6) GSO validation fixes from Willem de Bruijn. 7) Endianness fix for dwmac4 timestamp handling, from Alexandre Torgue. 8) More strict checks in tcp_v4_err(), from Eric Dumazet. 9) af_alg_release should NULL out the sk after the sock_put(), from Mao Wenan. 10) Missing unlock in mac80211 mesh error path, from Wei Yongjun. 11) Missing device put in hns driver, from Salil Mehta. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits) sky2: Increase D3 delay again vhost: correctly check the return value of translate_desc() in log_used() net: netcp: Fix ethss driver probe issue net: hns: Fixes the missing put_device in positive leg for roce reset net: stmmac: Fix a race in EEE enable callback qed: Fix iWARP syn packet mac address validation. qed: Fix iWARP buffer size provided for syn packet processing. r8152: Add support for MAC address pass through on RTL8153-BD mac80211: mesh: fix missing unlock on error in table_path_del() net/mlx4_en: fix spelling mistake: "quiting" -> "quitting" net: crypto set sk to NULL when af_alg_release. net: Do not allocate page fragments that are not skb aligned mm: Use fixed constant in page_frag_alloc instead of size + 1 tcp: tcp_v4_err() should be more careful tcp: clear icsk_backoff in tcp_write_queue_purge() net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe() qmi_wwan: apply SET_DTR quirk to Sierra WP7607 net: stmmac: handle endianness in dwmac4_get_timestamp doc: Mention MSG_ZEROCOPY implementation for UDP mlxsw: __mlxsw_sp_port_headroom_set(): Fix a use of local variable ... |
||
|
357b4da50a |
x86: respect memory size limiting via mem= parameter
When limiting memory size via kernel parameter "mem=" this should be respected even in case of memory made accessible via a PCI card. Today this kind of memory won't be made usable in initial memory setup as the memory won't be visible in E820 map, but it might be added when adding PCI devices due to corresponding ACPI table entries. Not respecting "mem=" can be corrected by adding a global max_mem_size variable set by parse_memopt() which will result in rejecting adding memory areas resulting in a memory size above the allowed limit. Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com> |
||
|
8644772637 |
mm: Use fixed constant in page_frag_alloc instead of size + 1
This patch replaces the size + 1 value introduced with the recent fix for 1
byte allocs with a constant value.
The idea here is to reduce code overhead as the previous logic would have
to read size into a register, then increment it, and write it back to
whatever field was being used. By using a constant we can avoid those
memory reads and arithmetic operations in favor of just encoding the
maximum value into the operation itself.
Fixes:
|
||
|
8a5b403d71 |
arm64, mm, efi: Account for GICv3 LPI tables in static memblock reserve table
In the irqchip and EFI code, we have what basically amounts to a quirk
to work around a peculiarity in the GICv3 architecture, which permits
the system memory address of LPI tables to be programmable only once
after a CPU reset. This means kexec kernels must use the same memory
as the first kernel, and thus ensure that this memory has not been
given out for other purposes by the time the ITS init code runs, which
is not very early for secondary CPUs.
On systems with many CPUs, these reservations could overflow the
memblock reservation table, and this was addressed in commit:
|
||
|
6e7bd3b549 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller: 1) Fix MAC address setting in mac80211 pmsr code, from Johannes Berg. 2) Probe SFP modules after being attached, from Russell King. 3) Byte ordering bug in SMC rx_curs_confirmed code, from Ursula Braun. 4) Revert some r8169 changes that are causing regressions, from Heiner Kallweit. 5) Fix spurious connection timeouts in netfilter nat code, from Florian Westphal. 6) SKB leak in tipc, from Hoang Le. 7) Short packet checkum issue in mlx4, similar to a previous mlx5 change, from Saeed Mahameed. The issue is that whilst padding bytes are usually zero, it is not guarateed and the hardware doesn't take the padding bytes into consideration when generating the checksum. 8) Fix various races in cls_tcindex, from Cong Wang. 9) Need to set stream ext to NULL before freeing in SCTP code, from Xin Long. 10) Fix locking in phy_is_started, from Heiner Kallweit. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (54 commits) net: ethernet: freescale: set FEC ethtool regs version net: hns: Fix object reference leaks in hns_dsaf_roce_reset() mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs net: phy: fix potential race in the phylib state machine net: phy: don't use locking in phy_is_started selftests: fix timestamping Makefile net: dsa: bcm_sf2: potential array overflow in bcm_sf2_sw_suspend() net: fix possible overflow in __sk_mem_raise_allocated() dsa: mv88e6xxx: Ensure all pending interrupts are handled prior to exit net: phy: fix interrupt handling in non-started states sctp: set stream ext to NULL after freeing it in sctp_stream_outq_migrate sctp: call gso_reset_checksum when computing checksum in sctp_gso_segment net/mlx5e: XDP, fix redirect resources availability check net/mlx5: Fix a compilation warning in events.c net/mlx5: No command allowed when command interface is not ready net/mlx5e: Fix NULL pointer derefernce in set channels error flow netfilter: nft_compat: use-after-free when deleting targets team: avoid complex list operations in team_nl_cmd_options_set() net_sched: fix two more memory leaks in cls_tcindex net_sched: fix a memory leak in cls_tcindex ... |
||
|
2c2ade8174 |
mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum number of references that we might need to create in the fastpath later, the bump-allocation fastpath only has to modify the non-atomic bias value that tracks the number of extra references we hold instead of the atomic refcount. The maximum number of allocations we can serve (under the assumption that no allocation is made with size 0) is nc->size, so that's the bias used. However, even when all memory in the allocation has been given away, a reference to the page is still held; and in the `offset < 0` slowpath, the page may be reused if everyone else has dropped their references. This means that the necessary number of references is actually `nc->size+1`. Luckily, from a quick grep, it looks like the only path that can call page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which requires CAP_NET_ADMIN in the init namespace and is only intended to be used for kernel testing and fuzzing. To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the `offset < 0` path, below the virt_to_page() call, and then repeatedly call writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI, with a vector consisting of 15 elements containing 1 byte each. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
2f1ee0913c |
Revert "mm: use early_pfn_to_nid in page_ext_init"
This reverts commit |
||
|
414fd080d1 |
mm/gup: fix gup_pmd_range() for dax
For dax pmd, pmd_trans_huge() returns false but pmd_huge() returns true on x86. So the function works as long as hugetlb is configured. However, dax doesn't depend on hugetlb. Link: http://lkml.kernel.org/r/20190111034033.601-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Keith Busch <keith.busch@intel.com> Cc: "Michael S . Tsirkin" <mst@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
a9a238e83f |
Revert "mm: slowly shrink slabs with a relatively small number of objects"
This reverts commit |
||
|
ad8cfb9c42 |
mm/gup: Remove the 'write' parameter from gup_fast_permitted()
The 'write' parameter is unused in gup_fast_permitted() so remove it. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20190210223424.13934-1-ira.weiny@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
70f8a3ca68 |
mm: make mm->pinned_vm an atomic64 counter
Taking a sleeping lock to _only_ increment a variable is quite the overkill, and pretty much all users do this. Furthermore, some drivers (ie: infiniband and scif) that need pinned semantics can go to quite some trouble to actually delay via workqueue (un)accounting for pinned pages when not possible to acquire it. By making the counter atomic we no longer need to hold the mmap_sem and can simply some code around it for pinned_vm users. The counter is 64-bit such that we need not worry about overflows such as rdma user input controlled from userspace. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> |
||
|
e0a352fabc |
mm: migrate: don't rely on __PageMovable() of newpage after unlocking it
We had a race in the old balloon compaction code before |
||
|
e3df4c6e48 |
mm, memory_hotplug: __offline_pages fix wrong locking
Jan has noticed that we do double unlock on some failure paths when
offlining a page range. This is indeed the case when
test_pages_in_a_zone respp. start_isolate_page_range fail. This was an
omission when forward porting the debugging patch from an older kernel.
Fix the issue by dropping mem_hotplug_done from the failure condition
and keeping the single unlock in the catch all failure path.
Link: http://lkml.kernel.org/r/20190115120307.22768-1-mhocko@kernel.org
Fixes:
|
||
|
6376360ecb |
mm: hwpoison: use do_send_sig_info() instead of force_sig()
Currently memory_failure() is racy against process's exiting, which
results in kernel crash by null pointer dereference.
The root cause is that memory_failure() uses force_sig() to forcibly
kill asynchronous (meaning not in the current context) processes. As
discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM
fixes, this is not a right thing to do. OOM solves this issue by using
do_send_sig_info() as done in commit
|
||
|
0d0c8de878 |
kasan: mark file common so ftrace doesn't trace it
When option CONFIG_KASAN is enabled toghether with ftrace, function ftrace_graph_caller() gets in to a recursion, via functions kasan_check_read() and kasan_check_write(). Breakpoint 2, ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179 179 mcount_get_pc x0 // function's pc (gdb) bt #0 ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179 #1 0xffffff90101406c8 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:151 #2 0xffffff90106fd084 in kasan_check_write (p=0xffffffc06c170878, size=4) at ../mm/kasan/common.c:105 #3 0xffffff90104a2464 in atomic_add_return (v=<optimized out>, i=<optimized out>) at ./include/generated/atomic-instrumented.h:71 #4 atomic_inc_return (v=<optimized out>) at ./include/generated/atomic-fallback.h:284 #5 trace_graph_entry (trace=0xffffffc03f5ff380) at ../kernel/trace/trace_functions_graph.c:441 #6 0xffffff9010481774 in trace_graph_entry_watchdog (trace=<optimized out>) at ../kernel/trace/trace_selftest.c:741 #7 0xffffff90104a185c in function_graph_enter (ret=<optimized out>, func=<optimized out>, frame_pointer=18446743799894897728, retp=<optimized out>) at ../kernel/trace/trace_functions_graph.c:196 #8 0xffffff9010140628 in prepare_ftrace_return (self_addr=18446743592948977792, parent=0xffffffc03f5ff418, frame_pointer=18446743799894897728) at ../arch/arm64/kernel/ftrace.c:231 #9 0xffffff90101406f4 in ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:182 Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) Rework so that the kasan implementation isn't traced. Link: http://lkml.kernel.org/r/20181212183447.15890-1-anders.roxell@linaro.org Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Acked-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
cefc7ef3c8 |
mm, oom: fix use-after-free in oom_kill_process
Syzbot instance running on upstream kernel found a use-after-free bug in
oom_kill_process. On further inspection it seems like the process
selected to be oom-killed has exited even before reaching
read_lock(&tasklist_lock) in oom_kill_process(). More specifically the
tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task()
and the put_task_struct within for_each_thread() frees the tsk and
for_each_thread() tries to access the tsk. The easiest fix is to do
get/put across the for_each_thread() on the selected task.
Now the next question is should we continue with the oom-kill as the
previously selected task has exited? However before adding more
complexity and heuristics, let's answer why we even look at the children
of oom-kill selected task? The select_bad_process() has already selected
the worst process in the system/memcg. Due to race, the selected
process might not be the worst at the kill time but does that matter?
The userspace can use the oom_score_adj interface to prefer children to
be killed before the parent. I looked at the history but it seems like
this is there before git history.
Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com
Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com
Fixes:
|
||
|
eeb0efd071 |
mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages
This is the same sort of error we saw in commit
|
||
|
24feb47c5f |
mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zone
If memory end is not aligned with the sparse memory section boundary, the mapping of such a section is only partly initialized. This may lead to VM_BUG_ON due to uninitialized struct pages access from test_pages_in_a_zone() function triggered by memory_hotplug sysfs handlers. Here are the the panic examples: CONFIG_DEBUG_VM_PGFLAGS=y kernel parameter mem=2050M -------------------------- page:000003d082008000 is uninitialized and poisoned page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) Call Trace: test_pages_in_a_zone+0xde/0x160 show_valid_zones+0x5c/0x190 dev_attr_show+0x34/0x70 sysfs_kf_seq_show+0xc8/0x148 seq_read+0x204/0x480 __vfs_read+0x32/0x178 vfs_read+0x82/0x138 ksys_read+0x5a/0xb0 system_call+0xdc/0x2d8 Last Breaking-Event-Address: test_pages_in_a_zone+0xde/0x160 Kernel panic - not syncing: Fatal exception: panic_on_oops Fix this by checking whether the pfn to check is within the zone. [mhocko@suse.com: separated this change from http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com] Link: http://lkml.kernel.org/r/20190128144506.15603-3-mhocko@kernel.org [mhocko@suse.com: separated this change from http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com] Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
efad4e475c |
mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone
Patch series "mm, memory_hotplug: fix uninitialized pages fallouts", v2.
Mikhail Zaslonko has posted fixes for the two bugs quite some time ago
[1]. I have pushed back on those fixes because I believed that it is
much better to plug the problem at the initialization time rather than
play whack-a-mole all over the hotplug code and find all the places
which expect the full memory section to be initialized.
We have ended up with commit
|
||
|
9bcdeb51bd |
oom, oom_reaper: do not enqueue same task twice
Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0. It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak. This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, T1@P1 |T2@P1 |T3@P1 |OOM reaper ----------+----------+----------+------------ # Processing an OOM victim in a different memcg domain. try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) out_of_memory() oom_kill_process(P1) do_send_sig_info(SIGKILL, @P1) mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T2@P1) wake_oom_reaper(T2@P1) # T2@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL. mutex_unlock(&oom_lock) # Completed processing an OOM victim in a different memcg domain. spin_lock(&oom_reaper_lock) # T1P1 is dequeued. spin_unlock(&oom_reaper_lock) but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request. Fix this bug using an approach used by commit |
||
|
80409c65e2 |
mm: migrate: make buffer_migrate_page_norefs() actually succeed
Currently, buffer_migrate_page_norefs() was constantly failing because
buffer_migrate_lock_buffers() grabbed reference on each buffer. In
fact, there's no reason for buffer_migrate_lock_buffers() to grab any
buffer references as the page is locked during all our operation and
thus nobody can reclaim buffers from the page.
So remove grabbing of buffer references which also makes
buffer_migrate_page_norefs() succeed.
Link: http://lkml.kernel.org/r/20190116131217.7226-1-jack@suse.cz
Fixes:
|
||
|
1ac25013fb |
mm/hugetlb.c: teach follow_hugetlb_page() to handle FOLL_NOWAIT
hugetlb needs the same fix as faultin_nopage (which was applied in commit |
||
|
1723058eab |
mm, memory_hotplug: don't bail out in do_migrate_range() prematurely
do_migrate_range() takes a memory range and tries to isolate the pages to put them into a list. This list will be later on used in migrate_pages() to know the pages we need to migrate. Currently, if we fail to isolate a single page, we put all already isolated pages back to their LRU and we bail out from the function. This is quite suboptimal, as this will force us to start over again because scan_movable_pages will give us the same range. If there is no chance that we can isolate that page, we will loop here forever. Issue debugged in [1] has proved that. During the debugging of that issue, it was noticed that if do_migrate_ranges() fails to isolate a single page, we will just discard the work we have done so far and bail out, which means that scan_movable_pages() will find again the same set of pages. Instead, we can just skip the error, keep isolating as much pages as possible and then proceed with the call to migrate_pages(). This will allow us to do as much work as possible at once. [1] https://lkml.org/lkml/2018/12/6/324 Michal said: : I still think that this doesn't give us a whole picture. Looping for : ever is a bug. Failing the isolation is quite possible and it should : be a ephemeral condition (e.g. a race with freeing the page or : somebody else isolating the page for whatever reason). And here comes : the disadvantage of the current implementation. We simply throw : everything on the floor just because of a ephemeral condition. The : racy page_count check is quite dubious to prevent from that. Link: http://lkml.kernel.org/r/20181211135312.27034-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
4aa9fc2a43 |
Revert "mm, memory_hotplug: initialize struct pages for the full memory section"
This reverts commit |
||
|
6b8f915916 |
for-linus-20190125
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlxLdgsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpoVGD/4sGYQqfiXogQIJYbPH2RRPrJuLIIITjiAv lPXX1wx/tvz/ktwKJE2OiWTES0JjdH1HlC+7/0L/fLb8DXiKBmFuUHlwhureoL9Y o//BIQKuaje35kyHITTy2UAJOOqnNtJTaAP2AfkL+eOcj/V/G5rJIfLGs9QtuAR7 sRJ+uhg1EbW/+uO0bULDmG6WUjxFu8mqcw3i6g0VVLVOnXB2EKcZTl3KPrdAXrUp XtmouERga6OfAUSJyZmTSV136mL+opRB2WFFVeIzjQfLmyItDGbSX/YPS8oJ2pow v7630F+CMrd4aKpqqtnAhfWpGqd0Xw7cYfZ9MKTJmZPmGzf9a1fQFpmgZosD4Dh3 7MrhboU4TUt9PdXESA7CmE7LkTp99ghfj5/ysKrSV5h3HsH2RbLxJk91Rx3vmAWD u1xWRYL+GYLH6ZwOLvM1iqBrrLN3mUyrx98SaMgoXuqNzmQmgz9LPeA0Gt09FJbo uj+ebg4dRwuThjni4xQhl3zL2RQy7nlTDFDdKOz/XoiYk2NUVksss+sxGjNarHj0 b5pCD4HOp57OreGExaOARpBRah5HSNdQtBRsIOnbyEq6f/e1LsIY23Z9nNF0deGO sZzgsbnsn+zg8bC6T/Gk4UY6XdCcgaS3SL04SVKAE3lO6A4C/Awo8DgD9Bl1zpC1 HQlNkl5fBg== =iucY -----END PGP SIGNATURE----- Merge tag 'for-linus-20190125' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A collection of fixes for this release. This contains: - Silence sparse rightfully complaining about non-static wbt functions (Bart) - Fixes for the zoned comments/ioctl documentation (Damien) - direct-io fix that's been lingering for a while (Ernesto) - cgroup writeback fix (Tejun) - Set of NVMe patches for nvme-rdma/tcp (Sagi, Hannes, Raju) - Block recursion tracking fix (Ming) - Fix debugfs command flag naming for a few flags (Jianchao)" * tag 'for-linus-20190125' of git://git.kernel.dk/linux-block: block: Fix comment typo uapi: fix ioctl documentation blk-wbt: Declare local functions static blk-mq: fix the cmd_flag_name array nvme-multipath: drop optimization for static ANA group IDs nvmet-rdma: fix null dereference under heavy load nvme-rdma: rework queue maps handling nvme-tcp: fix timeout handler nvme-rdma: fix timeout handler writeback: synchronize sync(2) against cgroup writeback membership switches block: cover another queue enter recursion via BIO_QUEUE_ENTERED direct-io: allow direct writes to empty inodes |
||
|
30bac164ac |
Revert "Change mincore() to count "mapped" pages rather than "cached" pages"
This reverts commit
|
||
|
7fc5854f8c |
writeback: synchronize sync(2) against cgroup writeback membership switches
sync_inodes_sb() can race against cgwb (cgroup writeback) membership switches and fail to writeback some inodes. For example, if an inode switches to another wb while sync_inodes_sb() is in progress, the new wb might not be visible to bdi_split_work_to_wbs() at all or the inode might jump from a wb which hasn't issued writebacks yet to one which already has. This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb switch path against sync_inodes_sb() so that sync_inodes_sb() is guaranteed to see all the target wbs and inodes can't jump wbs to escape syncing. v2: Fixed misplaced rwsem init. Spotted by Jiufei. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jiufei Xue <xuejiufei@gmail.com> Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.com Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
ba42273131 |
mm/mmu_notifier: mm/rmap.c: Fix a mmu_notifier range bug in try_to_unmap_one
The conversion to use a structure for mmu_notifier_invalidate_range_*()
unintentionally changed the usage in try_to_unmap_one() to init the
'struct mmu_notifier_range' with vma->vm_start instead of @address,
i.e. it invalidates the wrong address range. Revert to the correct
address range.
Manifests as KVM use-after-free WARNINGs and subsequent "BUG: Bad page
state in process X" errors when reclaiming from a KVM guest due to KVM
removing the wrong pages from its own mappings.
Reported-by: leozinho29_eu@hotmail.com
Reported-by: Mike Galbraith <efault@gmx.de>
Reported-and-tested-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Pankaj gupta <pagupta@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes:
|
||
|
73444bc4d8 |
mm, page_alloc: do not wake kswapd with zone lock held
syzbot reported the following regression in the latest merge window and
it was confirmed by Qian Cai that a similar bug was visible from a
different context.
======================================================
WARNING: possible circular locking dependency detected
4.20.0+ #297 Not tainted
------------------------------------------------------
syz-executor0/8529 is trying to acquire lock:
000000005e7fb829 (&pgdat->kswapd_wait){....}, at:
__wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120
but task is already holding lock:
000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: spin_lock
include/linux/spinlock.h:329 [inline]
000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_bulk
mm/page_alloc.c:2548 [inline]
000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: __rmqueue_pcplist
mm/page_alloc.c:3021 [inline]
000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_pcplist
mm/page_alloc.c:3050 [inline]
000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue
mm/page_alloc.c:3072 [inline]
000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at:
get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491
It appears to be a false positive in that the only way the lock ordering
should be inverted is if kswapd is waking itself and the wakeup
allocates debugging objects which should already be allocated if it's
kswapd doing the waking. Nevertheless, the possibility exists and so
it's best to avoid the problem.
This patch flags a zone as needing a kswapd using the, surprisingly,
unused zone flag field. The flag is read without the lock held to do
the wakeup. It's possible that the flag setting context is not the same
as the flag clearing context or for small races to occur. However, each
race possibility is harmless and there is no visible degredation in
fragmentation treatment.
While zone->flag could have continued to be unused, there is potential
for moving some existing fields into the flags field instead.
Particularly read-mostly ones like zone->initialized and
zone->contiguous.
Link: http://lkml.kernel.org/r/20190103225712.GJ31517@techsingularity.net
Fixes:
|
||
|
ddeaab32a8 |
hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization"
This reverts
|
||
|
e7c5809779 |
hugetlbfs: revert "Use i_mmap_rwsem to fix page fault/truncate race"
This reverts
|
||
|
8ab88c7169 |
mm: page_mapped: don't assume compound page is huge or THP
LTP proc01 testcase has been observed to rarely trigger crashes
on arm64:
page_mapped+0x78/0xb4
stable_page_flags+0x27c/0x338
kpageflags_read+0xfc/0x164
proc_reg_read+0x7c/0xb8
__vfs_read+0x58/0x178
vfs_read+0x90/0x14c
SyS_read+0x60/0xc0
The issue is that page_mapped() assumes that if compound page is not
huge, then it must be THP. But if this is 'normal' compound page
(COMPOUND_PAGE_DTOR), then following loop can keep running (for
HPAGE_PMD_NR iterations) until it tries to read from memory that isn't
mapped and triggers a panic:
for (i = 0; i < hpage_nr_pages(page); i++) {
if (atomic_read(&page[i]._mapcount) >= 0)
return true;
}
I could replicate this on x86 (v4.20-rc4-98-g60b548237fed) only
with a custom kernel module [1] which:
- allocates compound page (PAGEC) of order 1
- allocates 2 normal pages (COPY), which are initialized to 0xff (to
satisfy _mapcount >= 0)
- 2 PAGEC page structs are copied to address of first COPY page
- second page of COPY is marked as not present
- call to page_mapped(COPY) now triggers fault on access to 2nd COPY
page at offset 0x30 (_mapcount)
[1] https://github.com/jstancek/reproducers/blob/master/kernel/page_mapped_crash/repro.c
Fix the loop to iterate for "1 << compound_order" pages.
Kirrill said "IIRC, sound subsystem can producuce custom mapped compound
pages".
Link: http://lkml.kernel.org/r/c440d69879e34209feba21e12d236d06bc0a25db.1543577156.git.jstancek@redhat.com
Fixes:
|
||
|
1ed7293ac4 |
mm/memory.c: initialise mmu_notifier_range correctly
One of the paths in follow_pte_pmd() initialised the mmu_notifier_range
incorrectly.
Link: http://lkml.kernel.org/r/20190103002126.GM6310@bombadil.infradead.org
Fixes:
|
||
|
a3fe7cdf02 |
kasan: fix krealloc handling for tag-based mode
Right now tag-based KASAN can retag the memory that is reallocated via krealloc and return a differently tagged pointer even if the same slab object gets used and no reallocated technically happens. There are a few issues with this approach. One is that krealloc callers can't rely on comparing the return value with the passed argument to check whether reallocation happened. Another is that if a caller knows that no reallocation happened, that it can access object memory through the old pointer, which leads to false positives. Look at nf_ct_ext_add() to see an example. Fix this by keeping the same tag if the memory don't actually gets reallocated during krealloc. Link: http://lkml.kernel.org/r/bb2a71d17ed072bcc528cbee46fcbd71a6da3be4.1546540962.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
96fedce27e |
kasan: make tag based mode work with CONFIG_HARDENED_USERCOPY
With CONFIG_HARDENED_USERCOPY enabled __check_heap_object() compares and then subtracts a potentially tagged pointer with a non-tagged address of the page that this pointer belongs to, which leads to unexpected behavior. Untag the pointer in __check_heap_object() before doing any of these operations. Link: http://lkml.kernel.org/r/7e756a298d514c4482f52aea6151db34818d395d.1546540962.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
eb214f2dda |
kasan, arm64: use ARCH_SLAB_MINALIGN instead of manual aligning
Instead of changing cache->align to be aligned to KASAN_SHADOW_SCALE_SIZE in kasan_cache_create() we can reuse the ARCH_SLAB_MINALIGN macro. Link: http://lkml.kernel.org/r/52ddd881916bcc153a9924c154daacde78522227.1546540962.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Suggested-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
63f3655f95 |
mm, memcg: fix reclaim deadlock with writeback
Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
ext4 writeback
task1:
wait_on_page_bit+0x82/0xa0
shrink_page_list+0x907/0x960
shrink_inactive_list+0x2c7/0x680
shrink_node_memcg+0x404/0x830
shrink_node+0xd8/0x300
do_try_to_free_pages+0x10d/0x330
try_to_free_mem_cgroup_pages+0xd5/0x1b0
try_charge+0x14d/0x720
memcg_kmem_charge_memcg+0x3c/0xa0
memcg_kmem_charge+0x7e/0xd0
__alloc_pages_nodemask+0x178/0x260
alloc_pages_current+0x95/0x140
pte_alloc_one+0x17/0x40
__pte_alloc+0x1e/0x110
alloc_set_pte+0x5fe/0xc20
do_fault+0x103/0x970
handle_mm_fault+0x61e/0xd10
__do_page_fault+0x252/0x4d0
do_page_fault+0x30/0x80
page_fault+0x28/0x30
task2:
__lock_page+0x86/0xa0
mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
ext4_writepages+0x479/0xd60
do_writepages+0x1e/0x30
__writeback_single_inode+0x45/0x320
writeback_sb_inodes+0x272/0x600
__writeback_inodes_wb+0x92/0xc0
wb_writeback+0x268/0x300
wb_workfn+0xb4/0x390
process_one_work+0x189/0x420
worker_thread+0x4e/0x4b0
kthread+0xe6/0x100
ret_from_fork+0x41/0x50
He adds
"task1 is waiting for the PageWriteback bit of the page that task2 has
collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
LOCKED bit the page which tasks1 has locked"
More precisely task1 is handling a page fault and it has a page locked
while it charges a new page table to a memcg. That in turn hits a
memory limit reclaim and the memcg reclaim for legacy controller is
waiting on the writeback but that is never going to finish because the
writeback itself is waiting for the page locked in the #PF path. So
this is essentially ABBA deadlock:
lock_page(A)
SetPageWriteback(A)
unlock_page(A)
lock_page(B)
lock_page(B)
pte_alloc_pne
shrink_page_list
wait_on_page_writeback(A)
SetPageWriteback(B)
unlock_page(B)
# flush A, B to clear the writeback
This accumulating of more pages to flush is used by several filesystems
to generate a more optimal IO patterns.
Waiting for the writeback in legacy memcg controller is a workaround for
pre-mature OOM killer invocations because there is no dirty IO
throttling available for the controller. There is no easy way around
that unfortunately. Therefore fix this specific issue by pre-allocating
the page table outside of the page lock. We have that handy
infrastructure for that already so simply reuse the fault-around pattern
which already does this.
There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
from under a fs page locked but they should be really rare. I am not
aware of a better solution unfortunately.
[akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
[akpm@linux-foundation.org: coding-style fixes]
[mhocko@kernel.org: enhance comment, per Johannes]
Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
Fixes:
|
||
|
7bff3c0699 |
mm/usercopy.c: no check page span for stack objects
It is easy to trigger this with CONFIG_HARDENED_USERCOPY_PAGESPAN=y, usercopy: Kernel memory overwrite attempt detected to spans multiple pages (offset 0, size 23)! kernel BUG at mm/usercopy.c:102! For example, print_worker_info char name[WQ_NAME_LEN] = { }; char desc[WORKER_DESC_LEN] = { }; probe_kernel_read(name, wq->name, sizeof(name) - 1); probe_kernel_read(desc, worker->desc, sizeof(desc) - 1); __copy_from_user_inatomic check_object_size check_heap_object check_page_span This is because on-stack variables could cross PAGE_SIZE boundary, and failed this check, if (likely(((unsigned long)ptr & (unsigned long)PAGE_MASK) == ((unsigned long)end & (unsigned long)PAGE_MASK))) ptr = FFFF889007D7EFF8 end = FFFF889007D7F00E Hence, fix it by checking if it is a stack object first. [keescook@chromium.org: improve comments after reorder] Link: http://lkml.kernel.org/r/20190103165151.GA32845@beast Link: http://lkml.kernel.org/r/20181231030254.99441-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
09c2e76ed7 |
slab: alien caches must not be initialized if the allocation of the alien cache failed
Callers of __alloc_alien() check for NULL. We must do the same check in __alloc_alien_cache to avoid NULL pointer dereferences on allocation failures. Link: http://lkml.kernel.org/r/010001680f42f192-82b4e12e-1565-4ee0-ae1f-1e98974906aa-000000@email.amazonses.com Fixes: |
||
|
574823bfab |
Change mincore() to count "mapped" pages rather than "cached" pages
The semantics of what "in core" means for the mincore() system call are somewhat unclear, but Linux has always (since 2.3.52, which is when mincore() was initially done) treated it as "page is available in page cache" rather than "page is mapped in the mapping". The problem with that traditional semantic is that it exposes a lot of system cache state that it really probably shouldn't, and that users shouldn't really even care about. So let's try to avoid that information leak by simply changing the semantics to be that mincore() counts actual mapped pages, not pages that might be cheaply mapped if they were faulted (note the "might be" part of the old semantics: being in the cache doesn't actually guarantee that you can access them without IO anyway, since things like network filesystems may have to revalidate the cache before use). In many ways the old semantics were somewhat insane even aside from the information leak issue. From the very beginning (and that beginning is a long time ago: 2.3.52 was released in March 2000, I think), the code had a comment saying Later we can get more picky about what "in core" means precisely. and this is that "later". Admittedly it is much later than is really comfortable. NOTE! This is a real semantic change, and it is for example known to change the output of "fincore", since that program literally does a mmmap without populating it, and then doing "mincore()" on that mapping that doesn't actually have any pages in it. I'm hoping that nobody actually has any workflow that cares, and the info leak is real. We may have to do something different if it turns out that people have valid reasons to want the old semantics, and if we can limit the information leak sanely. Cc: Kevin Easton <kevin@guarana.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Masatake YAMATO <yamato@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
a65981109f |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton: - procfs updates - various misc bits - lib/ updates - epoll updates - autofs - fatfs - a few more MM bits * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (58 commits) mm/page_io.c: fix polled swap page in checkpatch: add Co-developed-by to signature tags docs: fix Co-Developed-by docs drivers/base/platform.c: kmemleak ignore a known leak fs: don't open code lru_to_page() fs/: remove caller signal_pending branch predictions mm/: remove caller signal_pending branch predictions arch/arc/mm/fault.c: remove caller signal_pending_branch predictions kernel/sched/: remove caller signal_pending branch predictions kernel/locking/mutex.c: remove caller signal_pending branch predictions mm: select HAVE_MOVE_PMD on x86 for faster mremap mm: speed up mremap by 20x on large regions mm: treewide: remove unused address argument from pte_alloc functions initramfs: cleanup incomplete rootfs scripts/gdb: fix lx-version string output kernel/kcov.c: mark write_comp_data() as notrace kernel/sysctl: add panic_print into sysctl panic: add options to print system info when panic happens bfs: extra sanity checking and static inode bitmap exec: separate MM_ANONPAGES and RLIMIT_STACK accounting ... |
||
|
b685a7350a |
mm/page_io.c: fix polled swap page in
swap_readpage() wants to do polling to bring in pages if asked to, but it doesn't mark the bio as being polled. Additionally, the looping around the blk_poll() check isn't correct - if we get a zero return, we should call io_schedule(), we can't just assume that the bio has completed. The regular bio->bi_private check should be used for that. Link: http://lkml.kernel.org/r/e15243a8-2cdf-c32c-ecee-f289377c8ef9@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
f86196ea87 |
fs: don't open code lru_to_page()
Multiple filesystems open code lru_to_page(). Rectify this by moving the macro from mm_inline (which is specific to lru stuff) to the more generic mm.h header and start using the macro where appropriate. No functional changes. Link: http://lkml.kernel.org/r/20181129104810.23361-1-nborisov@suse.com Link: https://lkml.kernel.org/r/20181129075301.29087-1-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Pankaj gupta <pagupta@redhat.com> Acked-by: "Yan, Zheng" <zyan@redhat.com> [ceph] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
fa45f1162f |
mm/: remove caller signal_pending branch predictions
This is already done for us internally by the signal machinery. Link: http://lkml.kernel.org/r/20181116002713.8474-5-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2c91bd4a4e |
mm: speed up mremap by 20x on large regions
Android needs to mremap large regions of memory during memory management related operations. The mremap system call can be really slow if THP is not enabled. The bottleneck is move_page_tables, which is copying each pte at a time, and can be really slow across a large map. Turning on THP may not be a viable option, and is not for us. This patch speeds up the performance for non-THP system by copying at the PMD level when possible. The speedup is an order of magnitude on x86 (~20x). On a 1GB mremap, the mremap completion times drops from 3.4-3.6 milliseconds to 144-160 microseconds. Before: Total mremap time for 1GB data: 3521942 nanoseconds. Total mremap time for 1GB data: 3449229 nanoseconds. Total mremap time for 1GB data: 3488230 nanoseconds. After: Total mremap time for 1GB data: 150279 nanoseconds. Total mremap time for 1GB data: 144665 nanoseconds. Total mremap time for 1GB data: 158708 nanoseconds. If THP is enabled the optimization is mostly skipped except in certain situations. [joel@joelfernandes.org: fix 'move_normal_pmd' unused function warning] Link: http://lkml.kernel.org/r/20181108224457.GB209347@google.com Link: http://lkml.kernel.org/r/20181108181201.88826-3-joelaf@google.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Michal Hocko <mhocko@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
4cf5892495 |
mm: treewide: remove unused address argument from pte_alloc functions
Patch series "Add support for fast mremap". This series speeds up the mremap(2) syscall by copying page tables at the PMD level even for non-THP systems. There is concern that the extra 'address' argument that mremap passes to pte_alloc may do something subtle architecture related in the future that may make the scheme not work. Also we find that there is no point in passing the 'address' to pte_alloc since its unused. This patch therefore removes this argument tree-wide resulting in a nice negative diff as well. Also ensuring along the way that the enabled architectures do not do anything funky with the 'address' argument that goes unnoticed by the optimization. Build and boot tested on x86-64. Build tested on arm64. The config enablement patch for arm64 will be posted in the future after more testing. The changes were obtained by applying the following Coccinelle script. (thanks Julia for answering all Coccinelle questions!). Following fix ups were done manually: * Removal of address argument from pte_fragment_alloc * Removal of pte_alloc_one_fast definitions from m68k and microblaze. // Options: --include-headers --no-includes // Note: I split the 'identifier fn' line, so if you are manually // running it, please unsplit it so it runs for you. virtual patch @pte_alloc_func_def depends on patch exists@ identifier E2; identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$"; type T2; @@ fn(... - , T2 E2 ) { ... } @pte_alloc_func_proto_noarg depends on patch exists@ type T1, T2, T3, T4; identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$"; @@ ( - T3 fn(T1, T2); + T3 fn(T1); | - T3 fn(T1, T2, T4); + T3 fn(T1, T2); ) @pte_alloc_func_proto depends on patch exists@ identifier E1, E2, E4; type T1, T2, T3, T4; identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$"; @@ ( - T3 fn(T1 E1, T2 E2); + T3 fn(T1 E1); | - T3 fn(T1 E1, T2 E2, T4 E4); + T3 fn(T1 E1, T2 E2); ) @pte_alloc_func_call depends on patch exists@ expression E2; identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$"; @@ fn(... -, E2 ) @pte_alloc_macro depends on patch exists@ identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$"; identifier a, b, c; expression e; position p; @@ ( - #define fn(a, b, c) e + #define fn(a, b) e | - #define fn(a, b) e + #define fn(a) e ) Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Suggested-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michal Hocko <mhocko@kernel.org> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
96d4f267e4 |
Remove 'type' argument from access_ok() function
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument of the user address range verification function since we got rid of the old racy i386-only code to walk page tables by hand. It existed because the original 80386 would not honor the write protect bit when in kernel mode, so you had to do COW by hand before doing any user access. But we haven't supported that in a long time, and these days the 'type' argument is a purely historical artifact. A discussion about extending 'user_access_begin()' to do the range checking resulted this patch, because there is no way we're going to move the old VERIFY_xyz interface to that model. And it's best done at the end of the merge window when I've done most of my merges, so let's just get this done once and for all. This patch was mostly done with a sed-script, with manual fix-ups for the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form. There were a couple of notable cases: - csky still had the old "verify_area()" name as an alias. - the iter_iov code had magical hardcoded knowledge of the actual values of VERIFY_{READ,WRITE} (not that they mattered, since nothing really used it) - microblaze used the type argument for a debug printout but other than those oddities this should be a total no-op patch. I tried to fix up all architectures, did fairly extensive grepping for access_ok() uses, and the changes are trivial, but I may have missed something. Any missed conversion should be trivially fixable, though. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
1ac5cd4978 |
block: don't use un-ordered __set_current_state(TASK_UNINTERRUPTIBLE)
This mostly reverts commit
|
||
|
3868772b99 |
A fairly normal cycle for documentation stuff. We have a new
document on perf security, more Italian translations, more improvements to the memory-management docs, improvements to the pathname lookup documentation, and the usual array of smaller fixes. -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAlwmSPkPHGNvcmJldEBs d24ubmV0AAoJEBdDWhNsDH5Y9ZoH/joPnMFykOxS0SmdfI7Z+F4EiJct/ZwF9bHx T673T0RC30IgnUXGmBl5OtktfWqVh9aGqHOGwgh65ybp2QvzemdP0k6Lu6RtwNk9 6LfkpvuUb8FzaQmCHnSMzMSDmXtZUw3Z/mOjCBcQtfGAsUULNT08xl+Dr+gwWIWt H+gPEEP+MCXTOQO1jm2dHOHW8NGm6XOijMTpOxp/pkoEY5tUxkVB1T//8EeX7LVh c1QHzFrufE3bmmubCLtIuyVqZbm/V5l6rHREDQ46fnH/G9fM4gojzsrAL/Y2m4bt E4y0XJHycjLMRDimAnYhbPm1ryTFAX1lNzHP3M/EF6Heqx8YHAk= =vtwu -----END PGP SIGNATURE----- Merge tag 'docs-5.0' of git://git.lwn.net/linux Pull documentation update from Jonathan Corbet: "A fairly normal cycle for documentation stuff. We have a new document on perf security, more Italian translations, more improvements to the memory-management docs, improvements to the pathname lookup documentation, and the usual array of smaller fixes. As is often the case, there are a few reaches outside of Documentation/ to adjust kerneldoc comments" * tag 'docs-5.0' of git://git.lwn.net/linux: (38 commits) docs: improve pathname-lookup document structure configfs: fix wrong name of struct in documentation docs/mm-api: link slab_common.c to "The Slab Cache" section slab: make kmem_cache_create{_usercopy} description proper kernel-doc doc:process: add links where missing docs/core-api: make mm-api.rst more structured x86, boot: documentation whitespace fixup Documentation: devres: note checking needs when converting doc🇮🇹 add some process/* translations doc🇮🇹 fixes in process/1.Intro Documentation: convert path-lookup from markdown to resturctured text Documentation/admin-guide: update admin-guide index.rst Documentation/admin-guide: introduce perf-security.rst file scripts/kernel-doc: Fix struct and struct field attribute processing Documentation: dev-tools: Fix typos in index.rst Correct gen_init_cpio tool's documentation Document /proc/pid PID reuse behavior Documentation: update path-lookup.md for parallel lookups Documentation: Use "while" instead of "whilst" dmaengine: Add mailing list address to the documentation ... |
||
|
55db91fbaa |
Merge branch 'for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu update from Dennis Zhou: "Michael Cree noted generic UP Alpha has been broken since v3.18. This is a small fix for locking in UP percpu code that fixes the issue" * 'for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu: percpu: convert spin_lock_irq to spin_lock_irqsave. |
||
|
f346b0becb |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: - large KASAN update to use arm's "software tag-based mode" - a few misc things - sh updates - ocfs2 updates - just about all of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (167 commits) kernel/fork.c: mark 'stack_vm_area' with __maybe_unused memcg, oom: notify on oom killer invocation from the charge path mm, swap: fix swapoff with KSM pages include/linux/gfp.h: fix typo mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization memory_hotplug: add missing newlines to debugging output mm: remove __hugepage_set_anon_rmap() include/linux/vmstat.h: remove unused page state adjustment macro mm/page_alloc.c: allow error injection mm: migrate: drop unused argument of migrate_page_move_mapping() blkdev: avoid migration stalls for blkdev pages mm: migrate: provide buffer_migrate_page_norefs() mm: migrate: move migrate_page_lock_buffers() mm: migrate: lock buffers before migrate_page_move_mapping() mm: migration: factor out code to compute expected number of page references mm, page_alloc: enable pcpu_drain with zone capability kmemleak: add config to select auto scan mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init ... |
||
|
0e9da3fbf7 |
for-4.21/block-20181221
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlwb7R8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjiID/97oDjMhNT7rwpuMbHw855h62j1hEN/m+N3 FI0uxivYoYZLD+eJRnMcBwHlKjrCX8iJQAcv9ffI3ThtFW7dnZT3atUacaZVR/Dt IrxdymdBP3qsmuaId5NYBug7rJ+AiqFJKjEvCcSPu5X397J4I3SEbzhfvYLJ/aZX 16o0HJlVVIrcbmq1IP4HwiIIOaKXvPaw04L4z4fpeynRSWG7EAi8NLSnhlR4Rxbb BTiMkCTsjRCFdyO6da4fvNQKWmPGPa3bJkYy3qR99cvJCeIbQjRyCloQlWNJRRgi 3eJpCHVxqFmN0/+DNTJVQEEr4H8o0AVucrLVct1Jc4pessenkpoUniP8vELqwlng Z2VHLkhTfCEmvFlk82grrYdNvGATRsrbswt/PlP4T7rBfr1IpDk8kXDWF59EL2dy ly35Sk3wJGHBl8qa+vEPXOAnaWdqJXuVGpwB4ifOIatOls8mOxwfZjiRc7x05/fC 1O4rR2IfLwRqwoYHs0AJ+h6ohOSn1mkGezl2Tch1VSFcJUOHmuYvraTaUi6hblpA SslaAoEhO39hRBL0HsvsMeqVWM9uzqvFkLDCfNPdiA81H1258CIbo4vF8z6czCIS eeXnTJxVhPVbZgb3a1a93SPwM6KIDZFoIijyd+NqjpU94thlnhYD0QEcKJIKH7os 2p4aHs6ktw== =TRdW -----END PGP SIGNATURE----- Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "This is the main pull request for block/storage for 4.21. Larger than usual, it was a busy round with lots of goodies queued up. Most notable is the removal of the old IO stack, which has been a long time coming. No new features for a while, everything coming in this week has all been fixes for things that were previously merged. This contains: - Use atomic counters instead of semaphores for mtip32xx (Arnd) - Cleanup of the mtip32xx request setup (Christoph) - Fix for circular locking dependency in loop (Jan, Tetsuo) - bcache (Coly, Guoju, Shenghui) * Optimizations for writeback caching * Various fixes and improvements - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith) * host and target support for NVMe over TCP * Error log page support * Support for separate read/write/poll queues * Much improved polling * discard OOM fallback * Tracepoint improvements - lightnvm (Hans, Hua, Igor, Matias, Javier) * Igor added packed metadata to pblk. Now drives without metadata per LBA can be used as well. * Fix from Geert on uninitialized value on chunk metadata reads. * Fixes from Hans and Javier to pblk recovery and write path. * Fix from Hua Su to fix a race condition in the pblk recovery code. * Scan optimization added to pblk recovery from Zhoujie. * Small geometry cleanup from me. - Conversion of the last few drivers that used the legacy path to blk-mq (me) - Removal of legacy IO path in SCSI (me, Christoph) - Removal of legacy IO stack and schedulers (me) - Support for much better polling, now without interrupts at all. blk-mq adds support for multiple queue maps, which enables us to have a map per type. This in turn enables nvme to have separate completion queues for polling, which can then be interrupt-less. Also means we're ready for async polled IO, which is hopefully coming in the next release. - Killing of (now) unused block exports (Christoph) - Unification of the blk-rq-qos and blk-wbt wait handling (Josef) - Support for zoned testing with null_blk (Masato) - sx8 conversion to per-host tag sets (Christoph) - IO priority improvements (Damien) - mq-deadline zoned fix (Damien) - Ref count blkcg series (Dennis) - Lots of blk-mq improvements and speedups (me) - sbitmap scalability improvements (me) - Make core inflight IO accounting per-cpu (Mikulas) - Export timeout setting in sysfs (Weiping) - Cleanup the direct issue path (Jianchao) - Export blk-wbt internals in block debugfs for easier debugging (Ming) - Lots of other fixes and improvements" * tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits) kyber: use sbitmap add_wait_queue/list_del wait helpers sbitmap: add helpers for add/del wait queue handling block: save irq state in blkg_lookup_create() dm: don't reuse bio for flushes nvme-pci: trace SQ status on completions nvme-rdma: implement polling queue map nvme-fabrics: allow user to pass in nr_poll_queues nvme-fabrics: allow nvmf_connect_io_queue to poll nvme-core: optionally poll sync commands block: make request_to_qc_t public nvme-tcp: fix spelling mistake "attepmpt" -> "attempt" nvme-tcp: fix endianess annotations nvmet-tcp: fix endianess annotations nvme-pci: refactor nvme_poll_irqdisable to make sparse happy nvme-pci: only set nr_maps to 2 if poll queues are supported nvmet: use a macro for default error location nvmet: fix comparison of a u16 with -1 blk-mq: enable IO poll if .nr_queues of type poll > 0 blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight() blk-mq: skip zero-queue maps in blk_mq_map_swqueue ... |
||
|
7056d3a37d |
memcg, oom: notify on oom killer invocation from the charge path
Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via eventfd anymore. The reason is that |
||
|
7af7a8e19f |
mm, swap: fix swapoff with KSM pages
KSM pages may be mapped to the multiple VMAs that cannot be reached from one anon_vma. So during swapin, a new copy of the page need to be generated if a different anon_vma is needed, please refer to comments of ksm_might_need_to_copy() for details. During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and virtual address mapped to the page, so not all mappings to a swapped out KSM page could be found. So in try_to_unuse(), even if the swap count of a swap entry isn't zero, the page needs to be deleted from swap cache, so that, in the next round a new page could be allocated and swapin for the other mappings of the swapped out KSM page. But this contradicts with the THP swap support. Where the THP could be deleted from swap cache only after the swap count of every swap entry in the huge swap cluster backing the THP has reach 0. So try_to_unuse() is changed in commit |
||
|
063a7d1d36 |
mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
The kbuild robot reported the following on a development branch that used
memremap.h in a new path:
In file included from arch/m68k/include/asm/pgtable_mm.h:148:0,
from arch/m68k/include/asm/pgtable.h:5,
from include/linux/memremap.h:7,
from drivers//dax/bus.c:3:
arch/m68k/include/asm/motorola_pgtable.h: In function 'pgd_offset':
>> arch/m68k/include/asm/motorola_pgtable.h:199:11: error: dereferencing pointer to incomplete type 'const struct mm_struct'
return mm->pgd + pgd_index(address);
^~
The ->page_fault() callback is specific to HMM. Move it to 'struct
hmm_devmem' where the unusual asm/pgtable.h dependency can be contained in
include/linux/hmm.h. Longer term refactoring this dependency out of HMM
is recommended, but in the meantime memremap.h remains generic.
Link: http://lkml.kernel.org/r/154534090899.3120190.6652620807617715272.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes:
|
||
|
c86aa7bbfd |
hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken into
account as well.
Instead of writing the required complicated code for this rare occurrence,
just eliminate the race. i_mmap_rwsem is now held in read mode for the
duration of page fault processing. Hold i_mmap_rwsem longer in truncation
and hold punch code to cover the call to remove_inode_hugepages.
With this modification, code in remove_inode_hugepages checking for races
becomes 'dead' as it can not longer happen. Remove the dead code and
expand comments to explain reasoning. Similarly, checks for races with
truncation in the page fault path can be simplified and removed.
[mike.kravetz@oracle.com: incorporat suggestions from Kirill]
Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
Fixes:
|
||
|
b43a999005 |
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with the
ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
called.
[mike.kravetz@oracle.com: add explicit check for mapping != null]
Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
Fixes:
|
||
|
451b9514a5 |
mm: remove __hugepage_set_anon_rmap()
This function is identical to __page_set_anon_rmap() since the time, when it was introduced (8 years ago). The patch removes the function, and makes its users to use __page_set_anon_rmap() instead. Link: http://lkml.kernel.org/r/154504875359.30235.6237926369392564851.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
af3b854492 |
mm/page_alloc.c: allow error injection
Model call chain after should_failslab(). Likewise, we can now use a kprobe to override the return value of should_fail_alloc_page() and inject allocation failures into alloc_page*(). This will allow injecting allocation failures using the BCC tools even without building kernel with CONFIG_FAIL_PAGE_ALLOC and booting it with a fail_page_alloc= parameter, which incurs some overhead even when failures are not being injected. On the other hand, this patch adds an unconditional call to should_fail_alloc_page() from page allocation hotpath. That overhead should be rather negligible with CONFIG_FAIL_PAGE_ALLOC=n when there's no kprobe attached, though. [vbabka@suse.cz: changelog addition] Link: http://lkml.kernel.org/r/20181214074330.18917-1-bpoirier@suse.com Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
ab41ee6879 |
mm: migrate: drop unused argument of migrate_page_move_mapping()
All callers of migrate_page_move_mapping() now pass NULL for 'head' argument. Drop it. Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
89cb0888ca |
mm: migrate: provide buffer_migrate_page_norefs()
Provide a variant of buffer_migrate_page() that also checks whether there are no unexpected references to buffer heads. This function will then be safe to use for block device pages. [akpm@linux-foundation.org: remove EXPORT_SYMBOL(buffer_migrate_page_norefs)] Link: http://lkml.kernel.org/r/20181211172143.7358-5-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
84ade7c15c |
mm: migrate: move migrate_page_lock_buffers()
buffer_migrate_page() is the only caller of migrate_page_lock_buffers() move it close to it and also drop the now unused stub for !CONFIG_BLOCK. Link: http://lkml.kernel.org/r/20181211172143.7358-4-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
cc4f11e69f |
mm: migrate: lock buffers before migrate_page_move_mapping()
Lock buffers before calling into migrate_page_move_mapping() so that that function doesn't have to know about buffers (which is somewhat unexpected anyway) and all the buffer head logic is in buffer_migrate_page(). Link: http://lkml.kernel.org/r/20181211172143.7358-3-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
0b3901b38d |
mm: migration: factor out code to compute expected number of page references
Patch series "mm: migrate: Fix page migration stalls for blkdev pages". This patchset deals with page migration stalls that were reported by our customer due to a block device page that had a bufferhead that was in the bh LRU cache. The patchset modifies the page migration code so that bufferheads are completely handled inside buffer_migrate_page() and then provides a new migration helper for pages with buffer heads that is safe to use even for block device pages and that also deals with bh lrus. This patch (of 6): Factor out function to compute number of expected page references in migrate_page_move_mapping(). Note that we move hpage_nr_pages() and page_has_private() checks from under xas_lock_irq() however this is safe since we hold page lock. [jack@suse.cz: fix expected_page_refs()] Link: http://lkml.kernel.org/r/20181217131710.GB8611@quack2.suse.cz Link: http://lkml.kernel.org/r/20181211172143.7358-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d9367bd06f |
mm, page_alloc: enable pcpu_drain with zone capability
drain_all_pages is documented to drain per-cpu pages for a given zone (if non-NULL). The current implementation doesn't match the description though. It will drain all pcp pages for all zones that happen to have cached pages on the same cpu as the given zone. This will lead to premature pcp cache draining for zones that are not of any interest to the caller - e.g. compaction, hwpoison or memory offline. This forces the page allocator to take locks and potential lock contention as a result. There is no real reason for this sub-optimal implementation. Replace per-cpu work item with a dedicated structure which contains a pointer to the zone and pass it over to the worker. This will get the zone information all the way down to the worker function and do the right job. [akpm@linux-foundation.org: avoid 80-col tricks] [mhocko@suse.com: refactor the whole changelog] Link: http://lkml.kernel.org/r/20181212142550.61686-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d53ce04227 |
kmemleak: add config to select auto scan
Kmemleak scan can be cpu intensive and can stall user tasks at times. To prevent this, add config DEBUG_KMEMLEAK_AUTO_SCAN to enable/disable auto scan on boot up. Also protect first_run with DEBUG_KMEMLEAK_AUTO_SCAN as this is meant for only first automatic scan. Link: http://lkml.kernel.org/r/1540231723-7087-1-git-send-email-prpatel@nvidia.com Signed-off-by: Sri Krishna chowdary <schowdary@nvidia.com> Signed-off-by: Sachin Nikam <snikam@nvidia.com> Signed-off-by: Prateek <prpatel@nvidia.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
3c0c12cc8f |
mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred pages initialization can take a long time. Below were the reported init times on a 8-socket 96-core 4TB IvyBridge system. 1) Non-debug kernel without CONFIG_KASAN [ 8.764222] node 1 initialised, 132086516 pages in 7027ms 2) Debug kernel with CONFIG_KASAN [ 146.288115] node 1 initialised, 132075466 pages in 143052ms So the page init time in a debug kernel was 20X of the non-debug kernel. The long init time can be problematic as the page initialization is done with interrupt disabled. In this particular case, it caused the appearance of following warning messages as well as NMI backtraces of all the cores that were doing the initialization. [ 68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 68.241000] rcu: 25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252 [ 68.241000] rcu: 44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253 [ 68.241000] rcu: 54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253 [ 68.241000] rcu: 60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253 [ 68.241000] rcu: 72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253 [ 68.241000] rcu: 84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253 [ 68.241000] rcu: 111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253 [ 68.241000] rcu: (detected by 13, t=65018 jiffies, g=249, q=2) The long init time was mainly caused by the call to kasan_free_pages() to poison the newly initialized pages. On a 4TB system, we are talking about almost 500GB of memory probably on the same node. In reality, we may not need to poison the newly initialized pages before they are ever allocated. So KASAN poisoning of freed pages before the completion of deferred memory initialization is now disabled. Those pages will be properly poisoned when they are allocated or freed after deferred pages initialization is done. With this change, the new page initialization time became: [ 21.948010] node 1 initialised, 132075466 pages in 18702ms This was still about double the non-debug kernel time, but was much better than before. Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
125b860b25 |
mm/pageblock: throw compile error if pageblock_bits cannot hold MIGRATE_TYPES
Currently, NR_PAGEBLOCK_BITS and MIGRATE_TYPES are not associated by code. If someone adds extra migrate type, then he may forget to enlarge the NR_PAGEBLOCK_BITS. Hence it requires some way to fix. NR_PAGEBLOCK_BITS depends on MIGRATE_TYPES, while these macro spread on two different .h file with reverse dependency, it is a little hard to refer to MIGRATE_TYPES in pageblock-flag.h. This patch tries to remind such relation in compiling-time. Link: http://lkml.kernel.org/r/1544508709-11358-1-git-send-email-kernelfans@gmail.com Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
fcf9a0ef8d |
ksm: react on changing "sleep_millisecs" parameter faster
ksm thread unconditionally sleeps in ksm_scan_thread() after each iteration: schedule_timeout_interruptible( msecs_to_jiffies(ksm_thread_sleep_millisecs)) The timeout is configured in /sys/kernel/mm/ksm/sleep_millisecs. In case of user writes a big value by a mistake, and the thread enters into schedule_timeout_interruptible(), it's not possible to cancel the sleep by writing a new smaler value; the thread is just sleeping till timeout expires. The patch fixes the problem by waking the thread each time after the value is updated. This also may be useful for debug purposes; and also for userspace daemons, which change sleep_millisecs value in dependence of system load. Link: http://lkml.kernel.org/r/154454107680.3258.3558002210423531566.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Cyrill Gorcunov <gorcunov@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
e0975b2aae |
mm, fault_around: do not take a reference to a locked page
filemap_map_pages takes a speculative reference to each page in the range before it tries to lock that page. While this is correct it also can influence page migration which will bail out when seeing an elevated reference count. The faultaround code would bail on seeing a locked page so we can pro-actively check the PageLocked bit before page_cache_get_speculative and prevent from pointless reference count churn. Link: http://lkml.kernel.org/r/20181211142741.2607-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
bb8965bd82 |
mm, memory_hotplug: deobfuscate migration part of offlining
Memory migration might fail during offlining and we keep retrying in that case. This is currently obfuscated by goto retry loop. The code is hard to follow and as a result it is even suboptimal becase each retry round scans the full range from start_pfn even though we have successfully scanned/migrated [start_pfn, pfn] range already. This is all only because check_pages_isolated failure has to rescan the full range again. De-obfuscate the migration retry loop by promoting it to a real for loop. In fact remove the goto altogether by making it a proper double loop (yeah, gotos are nasty in this specific case). In the end we will get a slightly more optimal code which is better readable. [akpm@linux-foundation.org: reflow comments to 80 cols] Link: http://lkml.kernel.org/r/20181211142741.2607-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
a85009c377 |
mm, memory_hotplug: try to migrate full pfn range
Patch series "few memory offlining enhancements". I have been chasing memory offlining not making progress recently. On the way I have noticed few weird decisions in the code. The migration itself is restricted without a reasonable justification and the retry loop around the migration is quite messy. This is addressed by patch 1 and patch 2. Patch 3 is targeting on the faultaround code which has been a hot candidate for the initial issue reported upstream [2] and that I am debugging internally. It turned out to be not the main contributor in the end but I believe we should address it regardless. See the patch description for more details. [1] http://lkml.kernel.org/r/20181120134323.13007-1-mhocko@kernel.org [2] http://lkml.kernel.org/r/20181114070909.GB2653@MiWiFi-R3L-srv This patch (of 3): do_migrate_range has been limiting the number of pages to migrate to 256 for some reason which is not documented. Even if the limit made some sense back then when it was introduced it doesn't really serve a good purpose these days. If the range contains huge pages then we break out of the loop too early and go through LRU and pcp caches draining and scan_movable_pages is quite suboptimal. The only reason to limit the number of pages I can think of is to reduce the potential time to react on the fatal signal. But even then the number of pages is a questionable metric because even a single page migration might block in a non-killable state (e.g. __unmap_and_move). Remove the limit and offline the full requested range (this is one memblock worth of pages with the current code). Should we ever get a report that offlining takes too long to react on fatal signal then we should rather fix the core migration to use killable waits and bailout on a signal. Link: http://lkml.kernel.org/r/20181211142741.2607-1-mhocko@kernel.org Link: http://lkml.kernel.org/r/20181211142741.2607-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
7635d9cbe8 |
mm, thp, proc: report THP eligibility for each vma
Userspace falls short when trying to find out whether a specific memory range is eligible for THP. There are usecases that would like to know that http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com : This is used to identify heap mappings that should be able to fault thp : but do not, and they normally point to a low-on-memory or fragmentation : issue. The only way to deduce this now is to query for hg resp. nh flags and confronting the state with the global setting. Except that there is also PR_SET_THP_DISABLE that might change the picture. So the final logic is not trivial. Moreover the eligibility of the vma depends on the type of VMA as well. In the past we have supported only anononymous memory VMAs but things have changed and shmem based vmas are supported as well these days and the query logic gets even more complicated because the eligibility depends on the mount option and another global configuration knob. Simplify the current state and report the THP eligibility in /proc/<pid>/smaps for each existing vma. Reuse transparent_hugepage_enabled for this purpose. The original implementation of this function assumes that the caller knows that the vma itself is supported for THP so make the core checks into __transparent_hugepage_enabled and use it for existing callers. __show_smap just use the new transparent_hugepage_enabled which also checks the vma support status (please note that this one has to be out of line due to include dependency issues). [mhocko@kernel.org: fix oops with NULL ->f_mapping] Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Paul Oppenheimer <bepvte@gmail.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
ac46d4f3c4 |
mm/mmu_notifier: use structure for invalidate_range_start/end calls v2
To avoid having to change many call sites everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end cakks. No functional changes with this patch. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> From: Jérôme Glisse <jglisse@redhat.com> Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3 fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
5d6527a784 |
mm/mmu_notifier: use structure for invalidate_range_start/end callback
Patch series "mmu notifier contextual informations", v2. This patchset adds contextual information, why an invalidation is happening, to mmu notifier callback. This is necessary for user of mmu notifier that wish to maintains their own data structure without having to add new fields to struct vm_area_struct (vma). For instance device can have they own page table that mirror the process address space. When a vma is unmap (munmap() syscall) the device driver can free the device page table for the range. Today we do not have any information on why a mmu notifier call back is happening and thus device driver have to assume that it is always an munmap(). This is inefficient at it means that it needs to re-allocate device page table on next page fault and rebuild the whole device driver data structure for the range. Other use case beside munmap() also exist, for instance it is pointless for device driver to invalidate the device page table when the invalidation is for the soft dirtyness tracking. Or device driver can optimize away mprotect() that change the page table permission access for the range. This patchset enables all this optimizations for device drivers. I do not include any of those in this series but another patchset I am posting will leverage this. The patchset is pretty simple from a code point of view. The first two patches consolidate all mmu notifier arguments into a struct so that it is easier to add/change arguments. The last patch adds the contextual information (munmap, protection, soft dirty, clear, ...). This patch (of 3): To avoid having to change many callback definition everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end callback. No functional changes with this patch. [akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc] Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Jason Gunthorpe <jgg@mellanox.com> [infiniband] Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
b15c87263a |
hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined
We have received a bug report that an injected MCE about faulty memory prevents memory offline to succeed on 4.4 base kernel. The underlying reason was that the HWPoison page has an elevated reference count and the migration keeps failing. There are two problems with that. First of all it is dubious to migrate the poisoned page because we know that accessing that memory is possible to fail. Secondly it doesn't make any sense to migrate a potentially broken content and preserve the memory corruption over to a new location. Oscar has found out that 4.4 and the current upstream kernels behave slightly differently with his simply testcase === int main(void) { int ret; int i; int fd; char *array = malloc(4096); char *array_locked = malloc(4096); fd = open("/tmp/data", O_RDONLY); read(fd, array, 4095); for (i = 0; i < 4096; i++) array_locked[i] = 'd'; ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked)); if (ret) perror("mlock"); sleep (20); ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON); if (ret) perror("madvise"); for (i = 0; i < 4096; i++) array_locked[i] = 'd'; return 0; } === + offline this memory. In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU list kernel: [<ffffffff81019ac9>] dump_trace+0x59/0x340 kernel: [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170 kernel: [<ffffffff8101ac71>] show_stack+0x21/0x40 kernel: [<ffffffff8132bb90>] dump_stack+0x5c/0x7c kernel: [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0 kernel: [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160 kernel: [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100 kernel: [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0 kernel: [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70 kernel: [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs] kernel: [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200 kernel: [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0 kernel: [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660 kernel: [<ffffffff8120e50d>] __vfs_read+0xcd/0x140 kernel: [<ffffffff8120e9ea>] vfs_read+0x7a/0x120 kernel: [<ffffffff8121404b>] kernel_read+0x3b/0x50 kernel: [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0 kernel: [<ffffffff81215f08>] do_execve+0x28/0x30 kernel: [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130 kernel: [<ffffffff8161c045>] ret_from_fork+0x55/0x80 And that latter confuses the hotremove path because an LRU page is attempted to be migrated and that fails due to an elevated reference count. It is quite possible that the reuse of the HWPoisoned page is some kind of fixed race condition but I am not really sure about that. With the upstream kernel the failure is slightly different. The page doesn't seem to have LRU bit set but isolate_movable_page simply fails and do_migrate_range simply puts all the isolated pages back to LRU and therefore no progress is made and scan_movable_pages finds same set of pages over and over again. Fix both cases by explicitly checking HWPoisoned pages before we even try to get reference on the page, try to unmap it if it is still mapped. As explained by Naoya: : Hwpoison code never unmapped those for no big reason because : Ksm pages never dominate memory, so we simply didn't have strong : motivation to save the pages. Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU HWPoison pages which shouldn't happen but I couldn't convince myself about that. Naoya has noted the following: : Theoretically no such gurantee, because try_to_unmap() doesn't have a : guarantee of success and then memory_failure() returns immediately : when hwpoison_user_mappings fails. : Or the following code (comes after hwpoison_user_mappings block) also impli= : es : that the target page can still have PageLRU flag. : : /* : * Torn down by someone else? : */ : if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) { : action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED); : res =3D -EBUSY; : goto out; : } : : So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in : current version of your patch. Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.com> Debugged-by: Oscar Salvador <osalvador@suse.com> Tested-by: Oscar Salvador <osalvador@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
9f1eb38e0e |
mm, kmemleak: little optimization while scanning
kmemleak_scan() goes through all online nodes and tries to scan all used pages. We can do better and use pfn_to_online_page(), so in case we have CONFIG_MEMORY_HOTPLUG, offlined pages will be skiped automatically. For boxes where CONFIG_MEMORY_HOTPLUG is not present, pfn_to_online_page() will fallback to pfn_valid(). Another little optimization is to check if the page belongs to the node we are currently checking, so in case we have nodes interleaved we will not check the same pfn multiple times. I ran some tests: Add some memory to node1 and node2 making it interleaved: (qemu) object_add memory-backend-ram,id=ram0,size=1G (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1 (qemu) object_add memory-backend-ram,id=ram1,size=1G (qemu) device_add pc-dimm,id=dimm1,memdev=ram1,node=2 (qemu) object_add memory-backend-ram,id=ram2,size=1G (qemu) device_add pc-dimm,id=dimm2,memdev=ram2,node=1 Then, we offline that memory: # for i in {32..39} ; do echo "offline" > /sys/devices/system/node/node1/memory$i/state;done # for i in {48..55} ; do echo "offline" > /sys/devices/system/node/node1/memory$i/state;don # for i in {40..47} ; do echo "offline" > /sys/devices/system/node/node2/memory$i/state;done And we run kmemleak_scan: # echo "scan" > /sys/kernel/debug/kmemleak before the patch: kmemleak: time spend: 41596 us after the patch: kmemleak: time spend: 34899 us [akpm@linux-foundation.org: remove stray newline, per Oscar] Link: http://lkml.kernel.org/r/20181206131918.25099-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
c16eb000ca |
mm/filemap.c: remove useless check in pagecache_get_page()
page always is not NULL, so we may remove this useless check. Link: http://lkml.kernel.org/r/154419752044.18559.2452963074922917720.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Cyrill Gorcunov <gorcunov@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
bbe5d9939e |
mm/page_alloc.c: drop uneeded __meminit and __meminitdata
Since commit
|
||
|
3fa750dcf2 |
mm/page-writeback.c: don't break integrity writeback on ->writepage() error
write_cache_pages() is used in both background and integrity writeback scenarios by various filesystems. Background writeback is mostly concerned with cleaning a certain number of dirty pages based on various mm heuristics. It may not write the full set of dirty pages or wait for I/O to complete. Integrity writeback is responsible for persisting a set of dirty pages before the writeback job completes. For example, an fsync() call must perform integrity writeback to ensure data is on disk before the call returns. write_cache_pages() unconditionally breaks out of its processing loop in the event of a ->writepage() error. This is fine for background writeback, which had no strict requirements and will eventually come around again. This can cause problems for integrity writeback on filesystems that might need to clean up state associated with failed page writeouts. For example, XFS performs internal delayed allocation accounting before returning a ->writepage() error, where applicable. If the current writeback happens to be associated with an unmount and write_cache_pages() completes the writeback prematurely due to error, the filesystem is unmounted in an inconsistent state if dirty+delalloc pages still exist. To handle this problem, update write_cache_pages() to always process the full set of pages for integrity writeback regardless of ->writepage() errors. Save the first encountered error and return it to the caller once complete. This facilitates XFS (or any other fs that expects integrity writeback to process the entire set of dirty pages) to clean up its internal state completely in the event of persistent mapping errors. Background writeback continues to exit on the first error encountered. [akpm@linux-foundation.org: fix typo in comment] Link: http://lkml.kernel.org/r/20181116134304.32440-1-bfoster@redhat.com Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
0ecea993d0 |
mm/hmm.c: remove set but not used variable 'devmem'
Fixes gcc '-Wunused-but-set-variable' warning: mm/hmm.c: In function 'hmm_devmem_ref_kill': mm/hmm.c:995:21: warning: variable 'devmem' set but not used [-Wunused-but-set-variable] It not used any more since 35d39f953d4e ("mm, hmm: replace hmm_devmem_pages_create() with devm_memremap_pages()") Link: http://lkml.kernel.org/r/1543629971-128377-1-git-send-email-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
fa004ab736 |
mm, hotplug: move init_currently_empty_zone() under zone_span_lock protection
During online_pages phase, pgdat->nr_zones will be updated in case this zone is empty. Currently the online_pages phase is protected by the global locks (device_device_hotplug_lock and mem_hotplug_lock), which ensures there is no contention during the update of nr_zones. These global locks introduces scalability issues (especially the second one), which slow down code relying on get_online_mems(). This is also a preparation for not having to rely on get_online_mems() but instead some more fine grained locks. The patch moves init_currently_empty_zone under both zone_span_writelock and pgdat_resize_lock because both the pgdat state is changed (nr_zones) and the zone's start_pfn. Also this patch changes the documentation of node_size_lock to include the protection of nr_zones. Link: http://lkml.kernel.org/r/20181203205016.14123-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
4e0d2e7ef1 |
mm, sparse: pass nid instead of pgdat to sparse_add_one_section()
Since the information needed in sparse_add_one_section() is node id to allocate proper memory, it is not necessary to pass its pgdat. This patch changes the prototype of sparse_add_one_section() to pass node id directly. This is intended to reduce misleading that sparse_add_one_section() would touch pgdat. Link: http://lkml.kernel.org/r/20181204085657.20472-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
83af658898 |
mm, sparse: drop pgdat_resize_lock in sparse_add/remove_one_section()
pgdat_resize_lock is used to protect pgdat's memory region information like: node_start_pfn, node_present_pages, etc. While in function sparse_add/remove_one_section(), pgdat_resize_lock is used to protect initialization/release of one mem_section. This looks not proper. These code paths are currently protected by mem_hotplug_lock currently but should there ever be any reason for locking at the sparse layer a dedicated lock should be introduced. Following is the current call trace of sparse_add/remove_one_section() mem_hotplug_begin() arch_add_memory() add_pages() __add_pages() __add_section() sparse_add_one_section() mem_hotplug_done() mem_hotplug_begin() arch_remove_memory() __remove_pages() __remove_section() sparse_remove_one_section() mem_hotplug_done() The comment above the pgdat_resize_lock also mentions "Holding this will also guarantee that any pfn_valid() stays that way.", which is true with the current implementation and false after this patch. But current implementation doesn't meet this comment. There isn't any pfn walkers to take the lock so this looks like a relict from the past. This patch also removes this comment. [richard.weiyang@gmail.com: v4] Link: http://lkml.kernel.org/r/20181204085657.20472-1-richard.weiyang@gmail.com [mhocko@suse.com: changelog suggestion] Link: http://lkml.kernel.org/r/20181128091243.19249-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
fed84c7852 |
mm/memblock.c: skip kmemleak for kasan_init()
Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and Huawei TaiShan 2280 aarch64 servers). After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early log buffer went from something like 280 to 260000 which caused kmemleak disabled and crash dump memory reservation failed. The multitude of kmemleak_alloc() calls is from nested loops while KASAN is setting up full memory mappings, so let early kmemleak allocations skip those memblock_alloc_internal() calls came from kasan_init() given that those early KASAN memory mappings should not reference to other memory. Hence, no kmemleak false positives. kasan_init kasan_map_populate [1] kasan_pgd_populate [2] kasan_pud_populate [3] kasan_pmd_populate [4] kasan_pte_populate [5] kasan_alloc_zeroed_page memblock_alloc_try_nid memblock_alloc_internal kmemleak_alloc [1] for_each_memblock(memory, reg) [2] while (pgdp++, addr = next, addr != end) [3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp))) [4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp))) [5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep))) Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.us Signed-off-by: Qian Cai <cai@gmx.us> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2c2a5af6fe |
mm, memory_hotplug: add nid parameter to arch_remove_memory
Patch series "Do not touch pages in hot-remove path", v2. This patchset aims for two things: 1) A better definition about offline and hot-remove stage 2) Solving bugs where we can access non-initialized pages during hot-remove operations [2] [3]. This is achieved by moving all page/zone handling to the offline stage, so we do not need to access pages when hot-removing memory. [1] https://patchwork.kernel.org/cover/10691415/ [2] https://patchwork.kernel.org/patch/10547445/ [3] https://www.spinics.net/lists/linux-mm/msg161316.html This patch (of 5): This is a preparation for the following-up patches. The idea of passing the nid is that it will allow us to get rid of the zone parameter afterwards. Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
23b68cfaae |
mm: check nr_initialised with PAGES_PER_SECTION directly in defer_init()
When DEFERRED_STRUCT_PAGE_INIT is configured, only the first section of each node's highest zone is initialized before defer stage. static_init_pgcnt is used to store the number of pages like this: pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION, pgdat->node_spanned_pages); because we don't want to overflow zone's range. But this is not necessary, since defer_init() is called like this: memmap_init_zone() for pfn in [start_pfn, end_pfn) defer_init(pfn, end_pfn) In case (pgdat->node_spanned_pages < PAGES_PER_SECTION), the loop would stop before calling defer_init(). BTW, comparing PAGES_PER_SECTION with node_spanned_pages is not correct, since nr_initialised is zone based instead of node based. Even node_spanned_pages is bigger than PAGES_PER_SECTION, its highest zone would have pages less than PAGES_PER_SECTION. Link: http://lkml.kernel.org/r/20181122094807.6985-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
9a1ea439b1 |
mm: put_and_wait_on_page_locked() while page is migrated
Waiting on a page migration entry has used wait_on_page_locked() all along since 2006: but you cannot safely wait_on_page_locked() without holding a reference to the page, and that extra reference is enough to make migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults on the entry before migrate_page_move_mapping() gets there. And that failure is retried nine times, amplifying the pain when trying to migrate a popular page. With a single persistent faulter, migration sometimes succeeds; with two or three concurrent faulters, success becomes much less likely (and the more the page was mapped, the worse the overhead of unmapping and remapping it on each try). This is especially a problem for memory offlining, where the outer level retries forever (or until terminated from userspace), because a heavy refault workload can trigger an endless loop of migration failures. wait_on_page_locked() is the wrong tool for the job. David Herrmann (but was he the first?) noticed this issue in 2014: https://marc.info/?l=linux-mm&m=140110465608116&w=2 Tim Chen started a thread in August 2017 which appears relevant: https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went on to implicate __migration_entry_wait(): https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended up with the v4.14 commits: |
||
|
f0c867d958 |
mm, oom: add oom victim's memcg to the oom context information
The current oom report doesn't display victim's memcg context during the global OOM situation. While this information is not strictly needed, it can be really helpful for containerized environments to locate which container has lost a process. Now that we have a single line for the oom context, we can trivially add both the oom memcg (this can be either global_oom or a specific memcg which hits its hard limits) and task_memcg which is the victim's memcg. Below is the single line output in the oom report after this patch. - global oom context information: oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,global_oom,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid> - memcg oom context information: oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,oom_memcg=<memcg>,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid> [penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()] Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Roman Gushchin <guro@fb.com> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
ef8444ea01 |
mm, oom: reorganize the oom report in dump_header
OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
e5cb113f2d |
mm: make free_reserved_area() return "const char *"
and propagate through down the call stack. Link: http://lkml.kernel.org/r/20181124091411.GC10969@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
9a2f45ff32 |
mm/debug.c: make "migrate_reason_names[]" const char *
Those strings are immutable as well. Link: http://lkml.kernel.org/r/20181124090508.GB10877@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
c999fbd3dc |
mm/mmzone.c: make "migratetype_names" const char *
Those strings are immutable in fact. Link: http://lkml.kernel.org/r/20181124090327.GA10877@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
1c30844d2d |
mm: reclaim small amounts of memory when an external fragmentation event occurs
An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
0a79cdad5e |
mm: use alloc_flags to record if kswapd can wake
This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM into alloc_flags. This is a preparation patch only that avoids having to pass gfp_mask through a long callchain in a future patch. Note that the setting in the fast path happens in alloc_flags_nofragment() and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT. That's true in this patch but is not true later so it's done now for easier review to show where the flag needs to be recorded. No functional change. [mgorman@techsingularity.net: ALLOC_KSWAPD flag needs to be applied in the !CONFIG_ZONE_DMA32 case] Link: http://lkml.kernel.org/r/20181126143503.GO23260@techsingularity.net Link: http://lkml.kernel.org/r/20181123114528.28802-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
a921444382 |
mm: move zone watermark accesses behind an accessor
This is a preparation patch only, no functional change. Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
6bb154504f |
mm, page_alloc: spread allocations across zones before introducing fragmentation
Patch series "Fragmentation avoidance improvements", v5.
It has been noted before that fragmentation avoidance (aka
anti-fragmentation) is not perfect. Given sufficient time or an adverse
workload, memory gets fragmented and the long-term success of high-order
allocations degrades. This series defines an adverse workload, a definition
of external fragmentation events (including serious) ones and a series
that reduces the level of those fragmentation events.
The details of the workload and the consequences are described in more
detail in the changelogs. However, from patch 1, this is a high-level
summary of the adverse workload. The exact details are found in the
mmtests implementation.
The broad details of the workload are as follows;
1. Create an XFS filesystem (not specified in the configuration but done
as part of the testing for this patch)
2. Start 4 fio threads that write a number of 64K files inefficiently.
Inefficiently means that files are created on first access and not
created in advance (fio parameterr create_on_open=1) and fallocate
is not used (fallocate=none). With multiple IO issuers this creates
a mix of slab and page cache allocations over time. The total size
of the files is 150% physical memory so that the slabs and page cache
pages get mixed
3. Warm up a number of fio read-only threads accessing the same files
created in step 2. This part runs for the same length of time it
took to create the files. It'll fault back in old data and further
interleave slab and page cache allocations. As it's now low on
memory due to step 2, fragmentation occurs as pageblocks get
stolen.
4. While step 3 is still running, start a process that tries to allocate
75% of memory as huge pages with a number of threads. The number of
threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
threads contending with fio, any other threads or forcing cross-NUMA
scheduling. Note that the test has not been used on a machine with less
than 8 cores. The benchmark records whether huge pages were allocated
and what the fault latency was in microseconds
5. Measure the number of events potentially causing external fragmentation,
the fault latency and the huge page allocation success rate.
6. Cleanup
Overall the series reduces external fragmentation causing events by over 94%
on 1 and 2 socket machines, which in turn impacts high-order allocation
success rates over the long term. There are differences in latencies and
high-order allocation success rates. Latencies are a mixed bag as they
are vulnerable to exact system state and whether allocations succeeded
so they are treated as a secondary metric.
Patch 1 uses lower zones if they are populated and have free memory
instead of fragmenting a higher zone. It's special cased to
handle a Normal->DMA32 fallback with the reasons explained
in the changelog.
Patch 2-4 boosts watermarks temporarily when an external fragmentation
event occurs. kswapd wakes to reclaim a small amount of old memory
and then wakes kcompactd on completion to recover the system
slightly. This introduces some overhead in the slowpath. The level
of boosting can be tuned or disabled depending on the tolerance
for fragmentation vs allocation latency.
Patch 5 stalls some movable allocation requests to let kswapd from patch 4
make some progress. The duration of the stalls is very low but it
is possible to tune the system to avoid fragmentation events if
larger stalls can be tolerated.
The bulk of the improvement in fragmentation avoidance is from patches
1-4 but patch 5 can deal with a rare corner case and provides the option
of tuning a system for THP allocation success rates in exchange for
some stalls to control fragmentation.
This patch (of 5):
The page allocator zone lists are iterated based on the watermarks of each
zone which does not take anti-fragmentation into account. On x86, node 0
may have multiple zones while other nodes have one zone. A consequence is
that tasks running on node 0 may fragment ZONE_NORMAL even though
ZONE_DMA32 has plenty of free memory. This patch special cases the
allocator fast path such that it'll try an allocation from a lower local
zone before fragmenting a higher zone. In this case, stealing of
pageblocks or orders larger than a pageblock are still allowed in the fast
path as they are uninteresting from a fragmentation point of view.
This was evaluated using a benchmark designed to fragment memory before
attempting THP allocations. It's implemented in mmtests as the following
configurations
configs/config-global-dhp__workload_thpfioscale
configs/config-global-dhp__workload_thpfioscale-defrag
configs/config-global-dhp__workload_thpfioscale-madvhugepage
e.g. from mmtests
./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1
The broad details of the workload are as follows;
1. Create an XFS filesystem (not specified in the configuration but done
as part of the testing for this patch).
2. Start 4 fio threads that write a number of 64K files inefficiently.
Inefficiently means that files are created on first access and not
created in advance (fio parameter create_on_open=1) and fallocate
is not used (fallocate=none). With multiple IO issuers this creates
a mix of slab and page cache allocations over time. The total size
of the files is 150% physical memory so that the slabs and page cache
pages get mixed.
3. Warm up a number of fio read-only processes accessing the same files
created in step 2. This part runs for the same length of time it
took to create the files. It'll refault old data and further
interleave slab and page cache allocations. As it's now low on
memory due to step 2, fragmentation occurs as pageblocks get
stolen.
4. While step 3 is still running, start a process that tries to allocate
75% of memory as huge pages with a number of threads. The number of
threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
threads contending with fio, any other threads or forcing cross-NUMA
scheduling. Note that the test has not been used on a machine with less
than 8 cores. The benchmark records whether huge pages were allocated
and what the fault latency was in microseconds.
5. Measure the number of events potentially causing external fragmentation,
the fault latency and the huge page allocation success rate.
6. Cleanup the test files.
Note that due to the use of IO and page cache that this benchmark is not
suitable for running on large machines where the time to fragment memory
may be excessive. Also note that while this is one mix that generates
fragmentation that it's not the only mix that generates fragmentation.
Differences in workload that are more slab-intensive or whether SLUB is
used with high-order pages may yield different results.
When the page allocator fragments memory, it records the event using the
mm_page_alloc_extfrag ftrace event. If the fallback_order is smaller than
a pageblock order (order-9 on 64-bit x86) then it's considered to be an
"external fragmentation event" that may cause issues in the future.
Hence, the primary metric here is the number of external fragmentation
events that occur with order < 9. The secondary metric is allocation
latency and huge page allocation success rates but note that differences
in latencies and what the success rate also can affect the number of
external fragmentation event which is why it's a secondary metric.
1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------
4.20-rc3 extfrag events < order 9: 804694
4.20-rc3+patch: 408912 (49% reduction)
thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
vanilla lowzone-v5r8
Amean fault-base-1 662.92 ( 0.00%) 653.58 * 1.41%*
Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)
4.20.0-rc3 4.20.0-rc3
vanilla lowzone-v5r8
Percentage huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)
Fault latencies are slightly reduced while allocation success rates remain
at zero as this configuration does not make any special effort to allocate
THP and fio is heavily active at the time and either filling memory or
keeping pages resident. However, a 49% reduction of serious fragmentation
events reduces the changes of external fragmentation being a problem in
the future.
Vlastimil asked during review for a breakdown of the allocation types
that are falling back.
vanilla
3816 MIGRATE_UNMOVABLE
800845 MIGRATE_MOVABLE
33 MIGRATE_UNRECLAIMABLE
patch
735 MIGRATE_UNMOVABLE
408135 MIGRATE_MOVABLE
42 MIGRATE_UNRECLAIMABLE
The majority of the fallbacks are due to movable allocations and this is
consistent for the workload throughout the series so will not be presented
again as the primary source of fallbacks are movable allocations.
Movable fallbacks are sometimes considered "ok" to fallback because they
can be migrated. The problem is that they can fill an
unmovable/reclaimable pageblock causing those allocations to fallback
later and polluting pageblocks with pages that cannot move. If there is a
movable fallback, it is pretty much guaranteed to affect an
unmovable/reclaimable pageblock and while it might not be enough to
actually cause a unmovable/reclaimable fallback in the future, we cannot
know that in advance so the patch takes the only option available to it.
Hence, it's important to control them. This point is also consistent
throughout the series and will not be repeated.
1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------
4.20-rc3 extfrag events < order 9: 291392
4.20-rc3+patch: 191187 (34% reduction)
thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
vanilla lowzone-v5r8
Amean fault-base-1 1495.14 ( 0.00%) 1467.55 ( 1.85%)
Amean fault-huge-1 1098.48 ( 0.00%) 1127.11 ( -2.61%)
thpfioscale Percentage Faults Huge
4.20.0-rc3 4.20.0-rc3
vanilla lowzone-v5r8
Percentage huge-1 78.57 ( 0.00%) 77.64 ( -1.18%)
Fragmentation events were reduced quite a bit although this is known
to be a little variable. The latencies and allocation success rates
are similar but they were already quite high.
2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------
4.20-rc3 extfrag events < order 9: 215698
4.20-rc3+patch: 200210 (7% reduction)
thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
vanilla lowzone-v5r8
Amean fault-base-5 1350.05 ( 0.00%) 1346.45 ( 0.27%)
Amean fault-huge-5 4181.01 ( 0.00%) 3418.60 ( 18.24%)
4.20.0-rc3 4.20.0-rc3
vanilla lowzone-v5r8
Percentage huge-5 1.15 ( 0.00%) 0.78 ( -31.88%)
The reduction of external fragmentation events is slight and this is
partially due to the removal of __GFP_THISNODE in commit
|
||
|
f29d8e9c01 |
mm/memory_hotplug: drop "online" parameter from add_memory_resource()
Userspace should always be in charge of how to online memory and if memory should be onlined automatically in the kernel. Let's drop the parameter to overwrite this - XEN passes memhp_auto_online, just like add_memory(), so we can directly use that instead internally. Link: http://lkml.kernel.org/r/20181123123740.27652-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Arun KS <arunks@codeaurora.org> Cc: Mathieu Malaterre <malat@debian.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
4d72868c8f |
memblock: replace usage of __memblock_free_early() with memblock_free()
__memblock_free_early() is only used by the convenience wrappers, so essentially we wrap a call to memblock_free() twice. Replace calls of __memblock_free_early() with calls to memblock_free() and drop the former. Link: http://lkml.kernel.org/r/20181125102940.GE28634@rapoport-lnx Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Wentao Wang <witallwang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d31cfe7bff |
mm/page_alloc.c: deduplicate __memblock_free_early() and memblock_free()
Link: http://lkml.kernel.org/r/C8ECE1B7A767434691FEEFA3A01765D72AFB8E78@MX203CL03.corp.emc.com Signed-off-by: Wentao Wang <witallwang@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
742aa7fb52 |
mm/page_alloc.c: use a single function to free page
There are multiple places of freeing a page, they all do the same things so a common function can be used to reduce code duplicate. It also avoids bug fixed in one function but left in another. Link: http://lkml.kernel.org/r/20181119134834.17765-3-aaron.lu@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pankaj gupta <pagupta@redhat.com> Cc: Pawel Staszewski <pstaszewski@itcare.pl> Cc: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
65895b67ad |
mm/page_alloc.c: free order-0 pages through PCP in page_frag_free()
page_frag_free() calls __free_pages_ok() to free the page back to Buddy. This is OK for high order page, but for order-0 pages, it misses the optimization opportunity of using Per-Cpu-Pages and can cause zone lock contention when called frequently. Pawel Staszewski recently shared his result of 'how Linux kernel handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer found the lock contention comes from page allocator: mlx5e_poll_tx_cq | --16.34%--napi_consume_skb | |--12.65%--__free_pages_ok | | | --11.86%--free_one_page | | | |--10.10%--queued_spin_lock_slowpath | | | --0.65%--_raw_spin_lock | |--1.55%--page_frag_free | --1.44%--skb_release_data Jesper explained how it happened: mlx5 driver RX-page recycle mechanism is not effective in this workload and pages have to go through the page allocator. The lock contention happens during mlx5 DMA TX completion cycle. And the page allocator cannot keep up at these speeds.[2] I thought that __free_pages_ok() are mostly freeing high order pages and thought this is an lock contention for high order pages but Jesper explained in detail that __free_pages_ok() here are actually freeing order-0 pages because mlx5 is using order-0 pages to satisfy its page pool allocation request.[3] The free path as pointed out by Jesper is: skb_free_head() -> skb_free_frag() -> page_frag_free() And the pages being freed on this path are order-0 pages. Fix this by doing similar things as in __page_frag_cache_drain() - send the being freed page to PCP if it's an order-0 page, or directly to Buddy if it is a high order page. With this change, Paweł hasn't noticed lock contention yet in his workload and Jesper has noticed a 7% performance improvement using a micro benchmark and lock contention is gone. Ilias' test on a 'low' speed 1Gbit interface on an cortex-a53 shows ~11% performance boost testing with 64byte packets and __free_pages_ok() disappeared from perf top. [1]: https://www.spinics.net/lists/netdev/msg531362.html [2]: https://www.spinics.net/lists/netdev/msg531421.html [3]: https://www.spinics.net/lists/netdev/msg531556.html [akpm@linux-foundation.org: add comment] Link: http://lkml.kernel.org/r/20181120014544.GB10657@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Tested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Acked-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Pankaj gupta <pagupta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
02917e9f86 |
mm, hmm: mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL
At Maintainer Summit, Greg brought up a topic I proposed around EXPORT_SYMBOL_GPL usage. The motivation was considerations for when EXPORT_SYMBOL_GPL is warranted and the criteria for taking the exceptional step of reclassifying an existing export. Specifically, I wanted to make the case that although the line is fuzzy and hard to specify in abstract terms, it is nonetheless clear that devm_memremap_pages() and HMM (Heterogeneous Memory Management) have crossed it. The devm_memremap_pages() facility should have been EXPORT_SYMBOL_GPL from the beginning, and HMM as a derivative of that functionality should have naturally picked up that designation as well. Contrary to typical rules, the HMM infrastructure was merged upstream with zero in-tree consumers. There was a promise at the time that those users would be merged "soon", but it has been over a year with no drivers arriving. While the Nouveau driver is about to belatedly make good on that promise it is clear that HMM was targeted first and foremost at an out-of-tree consumer. HMM is derived from devm_memremap_pages(), a facility Christoph and I spearheaded to support persistent memory. It combines a device lifetime model with a dynamically created 'struct page' / memmap array for any physical address range. It enables coordination and control of the many code paths in the kernel built to interact with memory via 'struct page' objects. With HMM the integration goes even deeper by allowing device drivers to hook and manipulate page fault and page free events. One interpretation of when EXPORT_SYMBOL is suitable is when it is exporting stable and generic leaf functionality. The devm_memremap_pages() facility continues to see expanding use cases, peer-to-peer DMA being the most recent, with no clear end date when it will stop attracting reworks and semantic changes. It is not suitable to export devm_memremap_pages() as a stable 3rd party driver API due to the fact that it is still changing and manipulates core behavior. Moreover, it is not in the best interest of the long term development of the core memory management subsystem to permit any external driver to effectively define its own system-wide memory management policies with no encouragement to engage with upstream. I am also concerned that HMM was designed in a way to minimize further engagement with the core-MM. That, with these hooks in place, device-drivers are free to implement their own policies without much consideration for whether and how the core-MM could grow to meet that need. Going forward not only should HMM be EXPORT_SYMBOL_GPL, but the core-MM should be allowed the opportunity and stimulus to change and address these new use cases as first class functionality. Original changelog: hmm_devmem_add(), and hmm_devmem_add_resource() duplicated devm_memremap_pages() and are now simple now wrappers around the core facility to inject a dev_pagemap instance into the global pgmap_radix and hook page-idle events. The devm_memremap_pages() interface is base infrastructure for HMM. HMM has more and deeper ties into the kernel memory management implementation than base ZONE_DEVICE which is itself a EXPORT_SYMBOL_GPL facility. Originally, the HMM page structure creation routines copied the devm_memremap_pages() code and reused ZONE_DEVICE. A cleanup to unify the implementations was discussed during the initial review: http://lkml.iu.edu/hypermail/linux/kernel/1701.2/00812.html Recent work to extend devm_memremap_pages() for the peer-to-peer-DMA facility enabled this cleanup to move forward. In addition to the integration with devm_memremap_pages() HMM depends on other GPL-only symbols: mmu_notifier_unregister_no_release percpu_ref region_intersects __class_create It goes further to consume / indirectly expose functionality that is not exported to any other driver: alloc_pages_vma walk_page_range HMM is derived from devm_memremap_pages(), and extends deep core-kernel fundamentals. Similar to devm_memremap_pages(), mark its entry points EXPORT_SYMBOL_GPL(). [logang@deltatee.com: PCI/P2PDMA: match interface changes to devm_memremap_pages()] Link: http://lkml.kernel.org/r/20181130225911.2900-1-logang@deltatee.com Link: http://lkml.kernel.org/r/154275560565.76910.15919297436557795278.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com>, Cc: Michal Hocko <mhocko@suse.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
bbecd94e6c |
mm, hmm: replace hmm_devmem_pages_create() with devm_memremap_pages()
Commit |
||
|
58ef15b765 |
mm, hmm: use devm semantics for hmm_devmem_{add, remove}
devm semantics arrange for resources to be torn down when device-driver-probe fails or when device-driver-release completes. Similar to devm_memremap_pages() there is no need to support an explicit remove operation when the users properly adhere to devm semantics. Note that devm_kzalloc() automatically handles allocating node-local memory. Link: http://lkml.kernel.org/r/154275559545.76910.9186690723515469051.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
7ead334215 |
mm/page_alloc.c: change the order of MIGRATE_RECLAIMABLE/MIGRATE_MOVABLE in fallbacks
In the enum migratetype definition, MIGRATE_MOVABLE is before MIGRATE_RECLAIMABLE. Change the order of them to match the enumeration's order. Link: http://lkml.kernel.org/r/20181121085821.3442-1-sjhuang@iluvatar.ai Signed-off-by: Huang Shijie <sjhuang@iluvatar.ai> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
66f71da9dd |
mm/swap: use nr_node_ids for avail_lists in swap_info_struct
Since
|
||
|
8b09549c2b |
vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n
Commit
|
||
|
476567e873 |
mm: remove managed_page_count_lock spinlock
Now that totalram_pages and managed_pages are atomic varibles, no need of managed_page_count spinlock. The lock had really a weak consistency guarantee. It hasn't been used for anything but the update but no reader actually cares about all the values being updated to be in sync. Link: http://lkml.kernel.org/r/1542090790-21750-5-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
ca79b0c211 |
mm: convert totalram_pages and totalhigh_pages variables to atomic
totalram_pages and totalhigh_pages are made static inline function. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
9705bea5f8 |
mm: convert zone->managed_pages to atomic variable
totalram_pages, zone->managed_pages and totalhigh_pages updates are protected by managed_page_count_lock, but readers never care about it. Convert these variables to atomic to avoid readers potentially seeing a store tear. This patch converts zone->managed_pages. Subsequent patches will convert totalram_panges, totalhigh_pages and eventually managed_page_count_lock will be removed. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. Link: http://lkml.kernel.org/r/1542090790-21750-3-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
3d6357de8a |
mm: reference totalram_pages and managed_pages once per function
Patch series "mm: convert totalram_pages, totalhigh_pages and managed pages to atomic", v5. This series converts totalram_pages, totalhigh_pages and zone->managed_pages to atomic variables. totalram_pages, zone->managed_pages and totalhigh_pages updates are protected by managed_page_count_lock, but readers never care about it. Convert these variables to atomic to avoid readers potentially seeing a store tear. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better to remove the lock and convert variables to atomic. With the change, preventing poteintial store-to-read tearing comes as a bonus. This patch (of 4): This is in preparation to a later patch which converts totalram_pages and zone->managed_pages to atomic variables. Please note that re-reading the value might lead to a different value and as such it could lead to unexpected behavior. There are no known bugs as a result of the current code but it is better to prevent from them in principle. Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
fecd4a50ba |
mm: remove reset of pcp->counter in pageset_init()
per_cpu_pageset is cleared by memset, it is not necessary to reset it again. Link: http://lkml.kernel.org/r/20181021023920.5501-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
46a3679b81 |
mm, memory_hotplug: do not clear numa_node association after hot_remove
Per-cpu numa_node provides a default node for each possible cpu. The
association gets initialized during the boot when the architecture
specific code explores cpu->NUMA affinity. When the whole NUMA node is
removed though we are clearing this association
try_offline_node
check_and_unmap_cpu_on_node
unmap_cpu_on_node
numa_clear_node
numa_set_node(cpu, NUMA_NO_NODE)
This means that whoever calls cpu_to_node for a cpu associated with such a
node will get NUMA_NO_NODE. This is problematic for two reasons. First
it is fragile because __alloc_pages_node would simply blow up on an
out-of-bound access. We have encountered this when loading kvm module
BUG: unable to handle kernel paging request at 00000000000021c0
IP: __alloc_pages_nodemask+0x93/0xb70
PGD 800000ffe853e067 PUD 7336bbc067 PMD 0
Oops: 0000 [#1] SMP
[...]
CPU: 88 PID: 1223749 Comm: modprobe Tainted: G W 4.4.156-94.64-default #1
RIP: __alloc_pages_nodemask+0x93/0xb70
RSP: 0018:ffff887354493b40 EFLAGS: 00010202
RAX: 00000000000021c0 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000014000c0
RBP: 00000000014000c0 R08: ffffffffffffffff R09: 0000000000000000
R10: ffff88fffc89e790 R11: 0000000000014000 R12: 0000000000000101
R13: ffffffffa0772cd4 R14: ffffffffa0769ac0 R15: 0000000000000000
FS: 00007fdf2f2f1700(0000) GS:ffff88fffc880000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000021c0 CR3: 00000077205ee000 CR4: 0000000000360670
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
hardware_setup+0x781/0x849 [kvm_intel]
kvm_arch_hardware_setup+0x28/0x190 [kvm]
kvm_init+0x7c/0x2d0 [kvm]
vmx_init+0x1e/0x32c [kvm_intel]
do_one_initcall+0xca/0x1f0
do_init_module+0x5a/0x1d7
load_module+0x1393/0x1c90
SYSC_finit_module+0x70/0xa0
entry_SYSCALL_64_fastpath+0x1e/0xb7
DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7
on an older kernel but the code is basically the same in the current Linus
tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which would
recognize NUMA_NO_NODE and use alloc_pages_node which would translate it
to numa_mem_id but that is wrong as well because it would use a cpu
affinity of the local CPU which might be quite far from the original node.
It is also reasonable to expect that cpu_to_node will provide a sane
value and there might be many more callers like that.
The second problem is that __register_one_node relies on cpu_to_node to
properly associate cpus back to the node when it is onlined. We do not
want to lose that link as there is no arch independent way to get it from
the early boot time AFAICS.
Drop the whole check_and_unmap_cpu_on_node machinery and keep the
association to fix both issues. The NODE_DATA(nid) is not deallocated so
it will stay in place and if anybody wants to allocate from that node then
a fallback node will be used.
Thanks to Vlastimil Babka for his live system debugging skills that helped
debugging the issue.
Link: http://lkml.kernel.org/r/20181108100413.966-1-mhocko@kernel.org
Fixes:
|
||
|
9cabf929e7 |
mm/mmap.c: remove verify_mm_writelocked()
We should get rid of this function. It no longer serves its purpose. This is a historical artifact from 2005 where do_brk was called outside of the core mm. We do have a proper abstraction in vm_brk_flags and that one does the locking properly so there is no need to use this function. Link: http://lkml.kernel.org/r/20181108174856.10811-1-tiny.windzz@gmail.com Signed-off-by: Yangtao Li <tiny.windzz@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
59e1a2f4bf |
ksm: replace jhash2 with xxhash
Replace jhash2 with xxhash. Perf numbers: Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz ksm: crc32c hash() 12081 MB/s ksm: xxh64 hash() 8770 MB/s ksm: xxh32 hash() 4529 MB/s ksm: jhash2 hash() 1569 MB/s Sioh Lee did some testing: crc32c_intel: 1084.10ns crc32c (no hardware acceleration): 7012.51ns xxhash32: 2227.75ns xxhash64: 1413.16ns jhash2: 5128.30ns As jhash2 always will be slower (for data size like PAGE_SIZE). Don't use it in ksm at all. Use only xxhash for now, because for using crc32c, cryptoapi must be initialized first - that requires some tricky solution to work well in all situations. Link: http://lkml.kernel.org/r/20181023182554.23464-3-nefelim4ag@gmail.com Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Signed-off-by: leesioh <solee@os.korea.ac.kr> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d381c54760 |
mm: only report isolation failures when offlining memory
Heiko has complained that his log is swamped by warnings from has_unmovable_pages [ 20.536664] page dumped because: has_unmovable_pages [ 20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0 [ 20.536794] flags: 0x3fffe0000010200(slab|head) [ 20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600 [ 20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000 [ 20.536797] page dumped because: has_unmovable_pages [ 20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0 [ 20.536815] flags: 0x7fffe0000000000() [ 20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000 [ 20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000 which are not triggered by the memory hotplug but rather CMA allocator. The original idea behind dumping the page state for all call paths was that these messages will be helpful debugging failures. From the above it seems that this is not the case for the CMA path because we are lacking much more context. E.g the second reported page might be a CMA allocated page. It is still interesting to see a slab page in the CMA area but it is hard to tell whether this is bug from the above output alone. Address this issue by dumping the page state only on request. Both start_isolate_page_range and has_unmovable_pages already have an argument to ignore hwpoison pages so make this argument more generic and turn it into flags and allow callers to combine non-default modes into a mask. While we are at it, has_unmovable_pages call from is_pageblock_removable_nolock (sysfs removable file) is questionable to report the failure so drop it from there as well. Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2932c8b050 |
mm, memory_hotplug: be more verbose for memory offline failures
There is only very limited information printed when the memory offlining fails: [ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff This tells us that the failure is triggered by the userspace intervention but it doesn't tell us much more about the underlying reason. It might be that the page migration failes repeatedly and the userspace timeout expires and send a signal or it might be some of the earlier steps (isolation, memory notifier) takes too long. If the migration failes then it would be really helpful to see which page that and its state. The same applies to the isolation phase. If we fail to isolate a page from the allocator then knowing the state of the page would be helpful as well. Dump the page state that fails to get isolated or migrated. This will tell us more about the failure and what to focus on during debugging. [akpm@linux-foundation.org: add missing printk arg] [mhocko@suse.com: tweak dump_page() `reason' text] Link: http://lkml.kernel.org/r/20181116083020.20260-6-mhocko@kernel.org Link: http://lkml.kernel.org/r/20181107101830.17405-6-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
7960509329 |
mm, memory_hotplug: print reason for the offlining failure
The memory offlining failure reporting is inconsistent and insufficient. Some error paths simply do not report the failure to the log at all. When we do report there are no details about the reason of the failure and there are several of them which makes memory offlining failures hard to debug. Make sure that the memory offlining [mem %#010llx-%#010llx] failed message is printed for all failures and also provide a short textual reason for the failure e.g. [ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff this tells us that the offlining has failed because of a signal pending aka user intervention. [akpm@linux-foundation.org: tweak messages a bit] Link: http://lkml.kernel.org/r/20181107101830.17405-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
6cc2baf600 |
mm, memory_hotplug: drop pointless block alignment checks from __offline_pages
This function is never called from a context which would provide misaligned pfn range so drop the pointless check. Link: http://lkml.kernel.org/r/20181107101830.17405-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
e0392cf7c5 |
mm: lower the printk loglevel for __dump_page messages
__dump_page messages use KERN_EMERG resp. KERN_ALERT loglevel (this is the case since 2004). Most callers of this function are really detecting a critical page state and BUG right after. On the other hand the function is called also from contexts which just want to inform about the page state and those would rather not disrupt logs that much (e.g. some systems route these messages to the normal console). Reduce the loglevel to KERN_WARNING to make dump_page easier to reuse for other contexts while those messages will still make it to the kernel log in most setups. Even if the loglevel setup filters warnings away those paths that are really critical already print the more targeted error or panic and that should make it to the kernel log. [mhocko@kernel.org: fix __dump_page()] Link: http://lkml.kernel.org/r/20181212142540.GA7378@dhcp22.suse.cz [akpm@linux-foundation.org: s/KERN_WARN/KERN_WARNING/, per Michal] Link: http://lkml.kernel.org/r/20181107101830.17405-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
1c6fb1d89e |
mm: print more information about mapping in __dump_page
I have been promissing to improve memory offlining failures debugging for quite some time. As things stand now we get only very limited information in the kernel log when the offlining fails. It is usually only [ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed with no further details. We do not know what exactly fails and for what reason. Whenever I was forced to debug such a failure I've always had to do a debugging patch to tell me more. We can enable some tracepoints but it would be much better to get a better picture without using them. This patch series does 2 things. The first one is to make dump_page more usable by printing more information about the mapping patch 1. Then it reduces the log level from emerg to warning so that this function is usable from less critical context patch 2. Then I have added more detailed information about the offlining failure patch 4 and finally add dump_page to isolation and offlining migration paths. Patch 3 is a trivial cleanup. This patch (of 6): __dump_page prints the mapping pointer but that is quite unhelpful for many reports because the pointer itself only helps to distinguish anon/ksm mappings from other ones (because of lowest bits set). Sometimes it would be much more helpful to know what kind of mapping that is actually and if we know this is a file mapping then also try to resolve the dentry name. [dan.carpenter@oracle.com: fix a width vs precision bug in printk] Link: http://lkml.kernel.org/r/20181123072135.gqvblm2vdujbvfjs@kili.mountain [mhocko@kernel.org: use %dp to print dentry] Link: http://lkml.kernel.org/r/20181125080834.GB12455@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20181107101830.17405-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
20ff1c9505 |
mm/readahead.c: simplify get_next_ra_size()
It's a trivial simplification for get_next_ra_size() and clear enough for humans to understand. It also fixes potential overflow if ra->size(< ra_pages) is too large. Link: http://lkml.kernel.org/r/1540707206-19649-1-git-send-email-hsiangkao@aol.com Signed-off-by: Gao Xiang <hsiangkao@aol.com> Reviewed-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
6a90a83f1d |
mm/mmu_notifier.c: remove mmu_notifier_synchronize()
Contrary to its name, mmu_notifier_synchronize() does not synchronize the notifier's SRCU instance, but rather waits for RCU callbacks to finish. i.e. it invokes rcu_barrier(). The RCU documentation is quite clear on this matter, explicitly calling out that rcu_barrier() does not imply synchronize_rcu(). As there are no callers of mmu_notifier_synchronize() and it's unclear whether any user of mmu_notifier_call_srcu() will ever want to barrier on their callbacks, simply remove the function. Link: http://lkml.kernel.org/r/20181106134705.14197-1-sean.j.christopherson@intel.com Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
5eb570a8d9 |
mm/hotplug: optimize clear_hwpoisoned_pages()
In hot remove, we try to clear poisoned pages, but a small optimization to check if num_poisoned_pages is 0 helps remove the iteration through nr_pages. [akpm@linux-foundation.org: tweak comment text] Link: http://lkml.kernel.org/r/20181102120001.4526-1-bsingharora@gmail.com Signed-off-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
c8f61cfc87 |
mm/page_owner: clamp read count to PAGE_SIZE
The (root-only) page owner read might allocate a large size of memory with a large read count. Allocation fails can easily occur when doing high order allocations. Clamp buffer size to PAGE_SIZE to avoid arbitrary size allocation and avoid allocation fails due to high order allocation. [akpm@linux-foundation.org: use min_t()] Link: http://lkml.kernel.org/r/1541091607-27402-1-git-send-email-miles.chen@mediatek.com Signed-off-by: Miles Chen <miles.chen@mediatek.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Joe Perches <joe@perches.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
88349a2837 |
mm/slub.c: record final state of slub action in deactivate_slab()
If __cmpxchg_double_slab() fails and (l != m), current code records transition states of slub action. Update the action after __cmpxchg_double_slab() success to record the final state. [akpm@linux-foundation.org: more whitespace cleanup] Link: http://lkml.kernel.org/r/20181107013119.3816-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
6159d0f5c0 |
mm/slub.c: page is always non-NULL in node_match()
node_match() is a static function and is only invoked in slub.c. In all three places, `page' is ensured to be valid. Link: http://lkml.kernel.org/r/20181106150245.1668-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
1265ef2de4 |
mm/slub.c: remove validation on cpu_slab in __flush_cpu_slab()
cpu_slab is a per cpu variable which is allocated in all or none. If a cpu_slab failed to be allocated, the slub is not usable. We could use cpu_slab without validation in __flush_cpu_slab(). Link: http://lkml.kernel.org/r/20181103141218.22844-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
221d7da66c |
mm, slab: remove unnecessary unlikely()
WARN_ON() already contains an unlikely(), so it's not necessary to use unlikely. Also change WARN_ON() back to WARN_ON_ONCE() to avoid potentially spamming dmesg with user-triggerable large allocations. [akpm@linux-foundation.org: s/WARN_ON/WARN_ON_ONCE/, per Vlastimil] Link: http://lkml.kernel.org/r/20181104125028.3572-1-tiny.windzz@gmail.com Signed-off-by: Yangtao Li <tiny.windzz@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
e886bf9d9a |
kasan: add SPDX-License-Identifier mark to source files
This patch adds a "SPDX-License-Identifier: GPL-2.0" mark to all source files under mm/kasan. Link: http://lkml.kernel.org/r/bce2d1e618afa5142e81961ab8fa4b4165337380.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
66afc7f1e0 |
kasan: add __must_check annotations to kasan hooks
This patch adds __must_check annotations to kasan hooks that return a pointer to make sure that a tagged pointer always gets propagated. Link: http://lkml.kernel.org/r/03b269c5e453945f724bfca3159d4e1333a8fb1c.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Suggested-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2813b9c029 |
kasan, mm, arm64: tag non slab memory allocated via pagealloc
Tag-based KASAN doesn't check memory accesses through pointers tagged with 0xff. When page_address is used to get pointer to memory that corresponds to some page, the tag of the resulting pointer gets set to 0xff, even though the allocated memory might have been tagged differently. For slab pages it's impossible to recover the correct tag to return from page_address, since the page might contain multiple slab objects tagged with different values, and we can't know in advance which one of them is going to get accessed. For non slab pages however, we can recover the tag in page_address, since the whole page was marked with the same tag. This patch adds tagging to non slab memory allocated with pagealloc. To set the tag of the pointer returned from page_address, the tag gets stored to page->flags when the memory gets allocated. Link: http://lkml.kernel.org/r/d758ddcef46a5abc9970182b9137e2fbee202a2c.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
7f94ffbc4c |
kasan: add hooks implementation for tag-based mode
This commit adds tag-based KASAN specific hooks implementation and adjusts common generic and tag-based KASAN ones. 1. When a new slab cache is created, tag-based KASAN rounds up the size of the objects in this cache to KASAN_SHADOW_SCALE_SIZE (== 16). 2. On each kmalloc tag-based KASAN generates a random tag, sets the shadow memory, that corresponds to this object to this tag, and embeds this tag value into the top byte of the returned pointer. 3. On each kfree tag-based KASAN poisons the shadow memory with a random tag to allow detection of use-after-free bugs. The rest of the logic of the hook implementation is very much similar to the one provided by generic KASAN. Tag-based KASAN saves allocation and free stack metadata to the slab object the same way generic KASAN does. Link: http://lkml.kernel.org/r/bda78069e3b8422039794050ddcb2d53d053ed41.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
5b7c414822 |
mm: move obj_to_index to include/linux/slab_def.h
While with SLUB we can actually preassign tags for caches with contructors and store them in pointers in the freelist, SLAB doesn't allow that since the freelist is stored as an array of indexes, so there are no pointers to store the tags. Instead we compute the tag twice, once when a slab is created before calling the constructor and then again each time when an object is allocated with kmalloc. Tag is computed simply by taking the lowest byte of the index that corresponds to the object. However in kasan_kmalloc we only have access to the objects pointer, so we need a way to find out which index this object corresponds to. This patch moves obj_to_index from slab.c to include/linux/slab_def.h to be reused by KASAN. Link: http://lkml.kernel.org/r/c02cd9e574cfd93858e43ac94b05e38f891fef64.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
121e8f81d3 |
kasan: add bug reporting routines for tag-based mode
This commit adds rountines, that print tag-based KASAN error reports. Those are quite similar to generic KASAN, the difference is: 1. The way tag-based KASAN finds the first bad shadow cell (with a mismatching tag). Tag-based KASAN compares memory tags from the shadow memory to the pointer tag. 2. Tag-based KASAN reports all bugs with the "KASAN: invalid-access" header. Also simplify generic KASAN find_first_bad_addr. Link: http://lkml.kernel.org/r/aee6897b1bd077732a315fd84c6b4f234dbfdfcb.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
11cd3cd69a |
kasan: split out generic_report.c from report.c
Move generic KASAN specific error reporting routines to generic_report.c without any functional changes, leaving common error reporting code in report.c to be later reused by tag-based KASAN. Link: http://lkml.kernel.org/r/ba48c32f8e5aefedee78998ccff0413bee9e0f5b.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
772a2fa50f |
kasan, mm: perform untagged pointers comparison in krealloc
The krealloc function checks where the same buffer was reused or a new one allocated by comparing kernel pointers. Tag-based KASAN changes memory tag on the krealloc'ed chunk of memory and therefore also changes the pointer tag of the returned pointer. Therefore we need to perform comparison on untagged (with tags reset) pointers to check whether it's the same memory region or not. Link: http://lkml.kernel.org/r/14f6190d7846186a3506cd66d82446646fe65090.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
4d176711ea |
kasan: preassign tags to objects with ctors or SLAB_TYPESAFE_BY_RCU
An object constructor can initialize pointers within this objects based on the address of the object. Since the object address might be tagged, we need to assign a tag before calling constructor. The implemented approach is to assign tags to objects with constructors when a slab is allocated and call constructors once as usual. The downside is that such object would always have the same tag when it is reallocated, so we won't catch use-after-frees on it. Also pressign tags for objects from SLAB_TYPESAFE_BY_RCU caches, since they can be validy accessed after having been freed. Link: http://lkml.kernel.org/r/f158a8a74a031d66f0a9398a5b0ed453c37ba09a.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
3c9e3aa110 |
kasan: add tag related helper functions
This commit adds a few helper functions, that are meant to be used to work with tags embedded in the top byte of kernel pointers: to set, to get or to reset the top byte. Link: http://lkml.kernel.org/r/f6c6437bb8e143bc44f42c3c259c62e734be7935.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
080eb83f54 |
kasan: initialize shadow to 0xff for tag-based mode
A tag-based KASAN shadow memory cell contains a memory tag, that corresponds to the tag in the top byte of the pointer, that points to that memory. The native top byte value of kernel pointers is 0xff, so with tag-based KASAN we need to initialize shadow memory to 0xff. [cai@lca.pw: arm64: skip kmemleak for KASAN again\ Link: http://lkml.kernel.org/r/20181226020550.63712-1-cai@lca.pw Link: http://lkml.kernel.org/r/5cc1b789aad7c99cf4f3ec5b328b147ad53edb40.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
9577dd7486 |
kasan: rename kasan_zero_page to kasan_early_shadow_page
With tag based KASAN mode the early shadow value is 0xff and not 0x00, so this patch renames kasan_zero_(page|pte|pmd|pud|p4d) to kasan_early_shadow_(page|pte|pmd|pud|p4d) to avoid confusion. Link: http://lkml.kernel.org/r/3fed313280ebf4f88645f5b89ccbc066d320e177.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2bd926b439 |
kasan: add CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS
This commit splits the current CONFIG_KASAN config option into two: 1. CONFIG_KASAN_GENERIC, that enables the generic KASAN mode (the one that exists now); 2. CONFIG_KASAN_SW_TAGS, that enables the software tag-based KASAN mode. The name CONFIG_KASAN_SW_TAGS is chosen as in the future we will have another hardware tag-based KASAN mode, that will rely on hardware memory tagging support in arm64. With CONFIG_KASAN_SW_TAGS enabled, compiler options are changed to instrument kernel files with -fsantize=kernel-hwaddress (except the ones for which KASAN_SANITIZE := n is set). Both CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS support both CONFIG_KASAN_INLINE and CONFIG_KASAN_OUTLINE instrumentation modes. This commit also adds empty placeholder (for now) implementation of tag-based KASAN specific hooks inserted by the compiler and adjusts common hooks implementation. While this commit adds the CONFIG_KASAN_SW_TAGS config option, this option is not selectable, as it depends on HAVE_ARCH_KASAN_SW_TAGS, which we will enable once all the infrastracture code has been added. Link: http://lkml.kernel.org/r/b2550106eb8a68b10fefbabce820910b115aa853.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
b938fcf427 |
kasan: rename source files to reflect the new naming scheme
We now have two KASAN modes: generic KASAN and tag-based KASAN. Rename kasan.c to generic.c to reflect that. Also rename kasan_init.c to init.c as it contains initialization code for both KASAN modes. Link: http://lkml.kernel.org/r/88c6fd2a883e459e6242030497230e5fb0d44d44.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
bffa986c6f |
kasan: move common generic and tag-based code to common.c
Tag-based KASAN reuses a significant part of the generic KASAN code, so move the common parts to common.c without any functional changes. Link: http://lkml.kernel.org/r/114064d002356e03bb8cc91f7835e20dc61b51d9.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
12b2238699 |
kasan, slub: handle pointer tags in early_kmem_cache_node_alloc
The previous patch updated KASAN hooks signatures and their usage in SLAB and SLUB code, except for the early_kmem_cache_node_alloc function. This patch handles that function separately, as it requires to reorder some of the initialization code to correctly propagate a tagged pointer in case a tag is assigned by kasan_kmalloc. Link: http://lkml.kernel.org/r/fc8d0fdcf733a7a52e8d0daaa650f4736a57de8c.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
0116523cff |
kasan, mm: change hooks signatures
Patch series "kasan: add software tag-based mode for arm64", v13. This patchset adds a new software tag-based mode to KASAN [1]. (Initially this mode was called KHWASAN, but it got renamed, see the naming rationale at the end of this section). The plan is to implement HWASan [2] for the kernel with the incentive, that it's going to have comparable to KASAN performance, but in the same time consume much less memory, trading that off for somewhat imprecise bug detection and being supported only for arm64. The underlying ideas of the approach used by software tag-based KASAN are: 1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store pointer tags in the top byte of each kernel pointer. 2. Using shadow memory, we can store memory tags for each chunk of kernel memory. 3. On each memory allocation, we can generate a random tag, embed it into the returned pointer and set the memory tags that correspond to this chunk of memory to the same value. 4. By using compiler instrumentation, before each memory access we can add a check that the pointer tag matches the tag of the memory that is being accessed. 5. On a tag mismatch we report an error. With this patchset the existing KASAN mode gets renamed to generic KASAN, with the word "generic" meaning that the implementation can be supported by any architecture as it is purely software. The new mode this patchset adds is called software tag-based KASAN. The word "tag-based" refers to the fact that this mode uses tags embedded into the top byte of kernel pointers and the TBI arm64 CPU feature that allows to dereference such pointers. The word "software" here means that shadow memory manipulation and tag checking on pointer dereference is done in software. As it is the only tag-based implementation right now, "software tag-based" KASAN is sometimes referred to as simply "tag-based" in this patchset. A potential expansion of this mode is a hardware tag-based mode, which would use hardware memory tagging support (announced by Arm [3]) instead of compiler instrumentation and manual shadow memory manipulation. Same as generic KASAN, software tag-based KASAN is strictly a debugging feature. [1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html [3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a ====== Rationale On mobile devices generic KASAN's memory usage is significant problem. One of the main reasons to have tag-based KASAN is to be able to perform a similar set of checks as the generic one does, but with lower memory requirements. Comment from Vishwath Mohan <vishwath@google.com>: I don't have data on-hand, but anecdotally both ASAN and KASAN have proven problematic to enable for environments that don't tolerate the increased memory pressure well. This includes (a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go, (c) Connected components like Pixel's visual core [1]. These are both places I'd love to have a low(er) memory footprint option at my disposal. Comment from Evgenii Stepanov <eugenis@google.com>: Looking at a live Android device under load, slab (according to /proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's overhead of 2x - 3x on top of it is not insignificant. Not having this overhead enables near-production use - ex. running KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do not reproduce in test configuration. These are the ones that often cost the most engineering time to track down. CPU overhead is bad, but generally tolerable. RAM is critical, in our experience. Once it gets low enough, OOM-killer makes your life miserable. [1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/ ====== Technical details Software tag-based KASAN mode is implemented in a very similar way to the generic one. This patchset essentially does the following: 1. TCR_TBI1 is set to enable Top Byte Ignore. 2. Shadow memory is used (with a different scale, 1:16, so each shadow byte corresponds to 16 bytes of kernel memory) to store memory tags. 3. All slab objects are aligned to shadow scale, which is 16 bytes. 4. All pointers returned from the slab allocator are tagged with a random tag and the corresponding shadow memory is poisoned with the same value. 5. Compiler instrumentation is used to insert tag checks. Either by calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and CONFIG_KASAN_INLINE flags are reused). 6. When a tag mismatch is detected in callback instrumentation mode KASAN simply prints a bug report. In case of inline instrumentation, clang inserts a brk instruction, and KASAN has it's own brk handler, which reports the bug. 7. The memory in between slab objects is marked with a reserved tag, and acts as a redzone. 8. When a slab object is freed it's marked with a reserved tag. Bug detection is imprecise for two reasons: 1. We won't catch some small out-of-bounds accesses, that fall into the same shadow cell, as the last byte of a slab object. 2. We only have 1 byte to store tags, which means we have a 1/256 probability of a tag match for an incorrect access (actually even slightly less due to reserved tag values). Despite that there's a particular type of bugs that tag-based KASAN can detect compared to generic KASAN: use-after-free after the object has been allocated by someone else. ====== Testing Some kernel developers voiced a concern that changing the top byte of kernel pointers may lead to subtle bugs that are difficult to discover. To address this concern deliberate testing has been performed. It doesn't seem feasible to do some kind of static checking to find potential issues with pointer tagging, so a dynamic approach was taken. All pointer comparisons/subtractions have been instrumented in an LLVM compiler pass and a kernel module that would print a bug report whenever two pointers with different tags are being compared/subtracted (ignoring comparisons with NULL pointers and with pointers obtained by casting an error code to a pointer type) has been used. Then the kernel has been booted in QEMU and on an Odroid C2 board and syzkaller has been run. This yielded the following results. The two places that look interesting are: is_vmalloc_addr in include/linux/mm.h is_kernel_rodata in mm/util.c Here we compare a pointer with some fixed untagged values to make sure that the pointer lies in a particular part of the kernel address space. Since tag-based KASAN doesn't add tags to pointers that belong to rodata or vmalloc regions, this should work as is. To make sure debug checks to those two functions that check that the result doesn't change whether we operate on pointers with or without untagging has been added. A few other cases that don't look that interesting: Comparing pointers to achieve unique sorting order of pointee objects (e.g. sorting locks addresses before performing a double lock): tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c pipe_double_lock in fs/pipe.c unix_state_double_lock in net/unix/af_unix.c lock_two_nondirectories in fs/inode.c mutex_lock_double in kernel/events/core.c ep_cmp_ffd in fs/eventpoll.c fsnotify_compare_groups fs/notify/mark.c Nothing needs to be done here, since the tags embedded into pointers don't change, so the sorting order would still be unique. Checks that a pointer belongs to some particular allocation: is_sibling_entry in lib/radix-tree.c object_is_on_stack in include/linux/sched/task_stack.h Nothing needs to be done here either, since two pointers can only belong to the same allocation if they have the same tag. Overall, since the kernel boots and works, there are no critical bugs. As for the rest, the traditional kernel testing way (use until fails) is the only one that looks feasible. Another point here is that tag-based KASAN is available under a separate config option that needs to be deliberately enabled. Even though it might be used in a "near-production" environment to find bugs that are not found during fuzzing or running tests, it is still a debug tool. ====== Benchmarks The following numbers were collected on Odroid C2 board. Both generic and tag-based KASAN were used in inline instrumentation mode. Boot time [1]: * ~1.7 sec for clean kernel * ~5.0 sec for generic KASAN * ~5.0 sec for tag-based KASAN Network performance [2]: * 8.33 Gbits/sec for clean kernel * 3.17 Gbits/sec for generic KASAN * 2.85 Gbits/sec for tag-based KASAN Slab memory usage after boot [3]: * ~40 kb for clean kernel * ~105 kb (~260% overhead) for generic KASAN * ~47 kb (~20% overhead) for tag-based KASAN KASAN memory overhead consists of three main parts: 1. Increased slab memory usage due to redzones. 2. Shadow memory (the whole reserved once during boot). 3. Quaratine (grows gradually until some preset limit; the more the limit, the more the chance to detect a use-after-free). Comparing tag-based vs generic KASAN for each of these points: 1. 20% vs 260% overhead. 2. 1/16th vs 1/8th of physical memory. 3. Tag-based KASAN doesn't require quarantine. [1] Time before the ext4 driver is initialized. [2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`. [3] Measured as `cat /proc/meminfo | grep Slab`. ====== Some notes A few notes: 1. The patchset can be found here: https://github.com/xairy/kasan-prototype/tree/khwasan 2. Building requires a recent Clang version (7.0.0 or later). 3. Stack instrumentation is not supported yet and will be added later. This patch (of 25): Tag-based KASAN changes the value of the top byte of pointers returned from the kernel allocation functions (such as kmalloc). This patch updates KASAN hooks signatures and their usage in SLAB and SLUB code to reflect that. Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
792bf4d871 |
Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar: "The biggest RCU changes in this cycle were: - Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar. - Replace calls of RCU-bh and RCU-sched update-side functions to their vanilla RCU counterparts. This series is a step towards complete removal of the RCU-bh and RCU-sched update-side functions. ( Note that some of these conversions are going upstream via their respective maintainers. ) - Documentation updates, including a number of flavor-consolidation updates from Joel Fernandes. - Miscellaneous fixes. - Automate generation of the initrd filesystem used for rcutorture testing. - Convert spin_is_locked() assertions to instead use lockdep. ( Note that some of these conversions are going upstream via their respective maintainers. ) - SRCU updates, especially including a fix from Dennis Krein for a bag-on-head-class bug. - RCU torture-test updates" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits) rcutorture: Don't do busted forward-progress testing rcutorture: Use 100ms buckets for forward-progress callback histograms rcutorture: Recover from OOM during forward-progress tests rcutorture: Print forward-progress test age upon failure rcutorture: Print time since GP end upon forward-progress failure rcutorture: Print histogram of CB invocation at OOM time rcutorture: Print GP age upon forward-progress failure rcu: Print per-CPU callback counts for forward-progress failures rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings rcutorture: Dump grace-period diagnostics upon forward-progress OOM rcutorture: Prepare for asynchronous access to rcu_fwd_startat torture: Remove unnecessary "ret" variables rcutorture: Affinity forward-progress test to avoid housekeeping CPUs rcutorture: Break up too-long rcu_torture_fwd_prog() function rcutorture: Remove cbflood facility torture: Bring any extra CPUs online during kernel startup rcutorture: Add call_rcu() flooding forward-progress tests rcutorture/formal: Replace synchronize_sched() with synchronize_rcu() tools/kernel.h: Replace synchronize_sched() with synchronize_rcu() net/decnet: Replace rcu_barrier_bh() with rcu_barrier() ... |
||
|
5694cecdb0 |
arm64 festive updates for 4.21
In the end, we ended up with quite a lot more than I expected: - Support for ARMv8.3 Pointer Authentication in userspace (CRIU and kernel-side support to come later) - Support for per-thread stack canaries, pending an update to GCC that is currently undergoing review - Support for kexec_file_load(), which permits secure boot of a kexec payload but also happens to improve the performance of kexec dramatically because we can avoid the sucky purgatory code from userspace. Kdump will come later (requires updates to libfdt). - Optimisation of our dynamic CPU feature framework, so that all detected features are enabled via a single stop_machine() invocation - KPTI whitelisting of Cortex-A CPUs unaffected by Meltdown, so that they can benefit from global TLB entries when KASLR is not in use - 52-bit virtual addressing for userspace (kernel remains 48-bit) - Patch in LSE atomics for per-cpu atomic operations - Custom preempt.h implementation to avoid unconditional calls to preempt_schedule() from preempt_enable() - Support for the new 'SB' Speculation Barrier instruction - Vectorised implementation of XOR checksumming and CRC32 optimisations - Workaround for Cortex-A76 erratum #1165522 - Improved compatibility with Clang/LLD - Support for TX2 system PMUS for profiling the L3 cache and DMC - Reflect read-only permissions in the linear map by default - Ensure MMIO reads are ordered with subsequent calls to Xdelay() - Initial support for memory hotplug - Tweak the threshold when we invalidate the TLB by-ASID, so that mremap() performance is improved for ranges spanning multiple PMDs. - Minor refactoring and cleanups -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABCgAGBQJcE4TmAAoJELescNyEwWM0Nr0H/iaU7/wQSzHyNXtZoImyKTul Blu2ga4/EqUrTU7AVVfmkl/3NBILWlgQVpY6tH6EfXQuvnxqD7CizbHyLdyO+z0S B5PsFUH2GLMNAi48AUNqGqkgb2knFbg+T+9IimijDBkKg1G/KhQnRg6bXX32mLJv Une8oshUPBVJMsHN1AcQknzKariuoE3u0SgJ+eOZ9yA2ZwKxP4yy1SkDt3xQrtI0 lojeRjxcyjTP1oGRNZC+BWUtGOT35p7y6cGTnBd/4TlqBGz5wVAJUcdoxnZ6JYVR O8+ob9zU+4I0+SKt80s7pTLqQiL9rxkKZ5joWK1pr1g9e0s5N5yoETXKFHgJYP8= =sYdt -----END PGP SIGNATURE----- Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 festive updates from Will Deacon: "In the end, we ended up with quite a lot more than I expected: - Support for ARMv8.3 Pointer Authentication in userspace (CRIU and kernel-side support to come later) - Support for per-thread stack canaries, pending an update to GCC that is currently undergoing review - Support for kexec_file_load(), which permits secure boot of a kexec payload but also happens to improve the performance of kexec dramatically because we can avoid the sucky purgatory code from userspace. Kdump will come later (requires updates to libfdt). - Optimisation of our dynamic CPU feature framework, so that all detected features are enabled via a single stop_machine() invocation - KPTI whitelisting of Cortex-A CPUs unaffected by Meltdown, so that they can benefit from global TLB entries when KASLR is not in use - 52-bit virtual addressing for userspace (kernel remains 48-bit) - Patch in LSE atomics for per-cpu atomic operations - Custom preempt.h implementation to avoid unconditional calls to preempt_schedule() from preempt_enable() - Support for the new 'SB' Speculation Barrier instruction - Vectorised implementation of XOR checksumming and CRC32 optimisations - Workaround for Cortex-A76 erratum #1165522 - Improved compatibility with Clang/LLD - Support for TX2 system PMUS for profiling the L3 cache and DMC - Reflect read-only permissions in the linear map by default - Ensure MMIO reads are ordered with subsequent calls to Xdelay() - Initial support for memory hotplug - Tweak the threshold when we invalidate the TLB by-ASID, so that mremap() performance is improved for ranges spanning multiple PMDs. - Minor refactoring and cleanups" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (125 commits) arm64: kaslr: print PHYS_OFFSET in dump_kernel_offset() arm64: sysreg: Use _BITUL() when defining register bits arm64: cpufeature: Rework ptr auth hwcaps using multi_entry_cap_matches arm64: cpufeature: Reduce number of pointer auth CPU caps from 6 to 4 arm64: docs: document pointer authentication arm64: ptr auth: Move per-thread keys from thread_info to thread_struct arm64: enable pointer authentication arm64: add prctl control for resetting ptrauth keys arm64: perf: strip PAC when unwinding userspace arm64: expose user PAC bit positions via ptrace arm64: add basic pointer authentication support arm64/cpufeature: detect pointer authentication arm64: Don't trap host pointer auth use to EL2 arm64/kvm: hide ptrauth from guests arm64/kvm: consistently handle host HCR_EL2 flags arm64: add pointer authentication register bits arm64: add comments about EC exception levels arm64: perf: Treat EXCLUDE_EL* bit definitions as unsigned arm64: kpti: Whitelist Cortex-A CPUs that don't implement the CSV3 field arm64: enable per-task stack canaries ... |
||
|
4971f090aa |
drm pull request for 4.21-rc1
-----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJcExwOAAoJEAx081l5xIa+euIP/1NZZvSB+bsCtOwDG8I6uWsS OU5JUZ8q2dqyyFagRxzlkeSt3uWJqKp5NyNwuc9z/5u6AGF+3/97D0J1lG6Os/st 4abF6NadivYJ4cXhJ1ddIHOFMVDcAsyMWNDb93NwPwncCsQ0jt5FFOsrCyj6BGY+ ihHFlHrIyDrbBGDHz+u1E/EO5WkNnaLDoC+/k2fTRWCNI3bQL3O+orsYTI6S2uvU lQJnRfYAllgLD2p1k/rrBHcHXBv50roR0e8uhGmbdhGdp5bEW30UGBLHXxQjjSVy fQCwFwTO8X6zoxU53Zbbk+MVrp+jkTHcGKViHRuLkaHzE5mX26UXDwlXdN32ZUbK yHOJp+uDaWXX7MIz0LsB9Iqj2+eIUoFaIJMoZTMGVTNvqnTxKnoHnjAtbTH2u258 teFgmy4BIgPgo2kwEnBEZjCapou0Eivyut2wq8bTAB2Fe8LwURJpr3cioTtMLlUO L5/PoD27eFvBCAeFrQIwF3b2XiQEnBpXocmilEwP1xDMPgoyeePAfIF2iEpDvi0U jce3rLd2yVvo92xYUgoHkVTD8si/pKKnZ1D0U3+RI6pxK6s0HJEHjcNEMdvdm+2S 4qgvBQV3wlWFkXEK8PR5BHPoLntg18tKon/BTLBjgGkN9E1o9fWs1/s6KQGY4xdo l3Vvfx2LTdkgEoBssSwB =wh4W -----END PGP SIGNATURE----- Merge tag 'drm-next-2018-12-14' of git://anongit.freedesktop.org/drm/drm Pull drm updates from Dave Airlie: "Core: - shared fencing staging removal - drop transactional atomic helpers and move helpers to new location - DP/MST atomic cleanup - Leasing cleanups and drop EXPORT_SYMBOL - Convert drivers to atomic helpers and generic fbdev. - removed deprecated obj_ref/unref in favour of get/put - Improve dumb callback documentation - MODESET_LOCK_BEGIN/END helpers panels: - CDTech panels, Banana Pi Panel, DLC1010GIG, - Olimex LCD-O-LinuXino, Samsung S6D16D0, Truly NT35597 WQXGA, - Himax HX8357D, simulated RTSM AEMv8. - GPD Win2 panel - AUO G101EVN010 vgem: - render node support ttm: - move global init out of drivers - fix LRU handling for ghost objects - Support for simultaneous submissions to multiple engines scheduler: - timeout/fault handling changes to help GPU recovery - helpers for hw with preemption support i915: - Scaler/Watermark fixes - DP MST + powerwell fixes - PSR fixes - Break long get/put shmemfs pages - Icelake fixes - Icelake DSI video mode enablement - Engine workaround improvements amdgpu: - freesync support - GPU reset enabled on CI, VI, SOC15 dGPUs - ABM support in DC - KFD support for vega12/polaris12 - SDMA paging queue on vega - More amdkfd code sharing - DCC scanout on GFX9 - DC kerneldoc - Updated SMU firmware for GFX8 chips - XGMI PSP + hive reset support - GPU reset - DC trace support - Powerplay updates for newer Polaris - Cursor plane update fast path - kfd dma-buf support virtio-gpu: - add EDID support vmwgfx: - pageflip with damage support nouveau: - Initial Turing TU104/TU106 modesetting support msm: - a2xx gpu support for apq8060 and imx5 - a2xx gpummu support - mdp4 display support for apq8060 - DPU fixes and cleanups - enhanced profiling support - debug object naming interface - get_iova/page pinning decoupling tegra: - Tegra194 host1x, VIC and display support enabled - Audio over HDMI for Tegra186 and Tegra194 exynos: - DMA/IOMMU refactoring - plane alpha + blend mode support - Color format fixes for mixer driver rcar-du: - R8A7744 and R8A77470 support - R8A77965 LVDS support imx: - fbdev emulation fix - multi-tiled scalling fixes - SPDX identifiers rockchip - dw_hdmi support - dw-mipi-dsi + dual dsi support - mailbox read size fix qxl: - fix cursor pinning vc4: - YUV support (scaling + cursor) v3d: - enable TFU (Texture Formatting Unit) mali-dp: - add support for linear tiled formats sun4i: - Display Engine 3 support - H6 DE3 mixer 0 support - H6 display engine support - dw-hdmi support - H6 HDMI phy support - implicit fence waiting - BGRX8888 support meson: - Overlay plane support - implicit fence waiting - HDMI 1.4 4k modes bridge: - i2c fixes for sii902x" * tag 'drm-next-2018-12-14' of git://anongit.freedesktop.org/drm/drm: (1403 commits) drm/amd/display: Add fast path for cursor plane updates drm/amdgpu: Enable GPU recovery by default for CI drm/amd/display: Fix duplicating scaling/underscan connector state drm/amd/display: Fix unintialized max_bpc state values Revert "drm/amd/display: Set RMX_ASPECT as default" drm/amdgpu: Fix stub function name drm/msm/dpu: Fix clock issue after bind failure drm/msm/dpu: Clean up dpu_media_info.h static inline functions drm/msm/dpu: Further cleanups for static inline functions drm/msm/dpu: Cleanup the debugfs functions drm/msm/dpu: Remove dpu_irq and unused functions drm/msm: Make irq_postinstall optional drm/msm/dpu: Cleanup callers of dpu_hw_blk_init drm/msm/dpu: Remove unused functions drm/msm/dpu: Remove dpu_crtc_is_enabled() drm/msm/dpu: Remove dpu_crtc_get_mixer_height drm/msm/dpu: Remove dpu_dbg drm/msm: dpu: Remove crtc_lock drm/msm: dpu: Remove vblank_requested flag from dpu_crtc drm/msm: dpu: Separate crtc assignment from vblank enable ... |
||
|
17e2e7d7e1 |
mm, page_alloc: fix has_unmovable_pages for HugePages
While playing with gigantic hugepages and memory_hotplug, I triggered the following #PF when "cat memoryX/removable": BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 #PF error: [normal kernel read fault] PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 1 PID: 1481 Comm: cat Tainted: G E 4.20.0-rc6-mm1-1-default+ #18 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:has_unmovable_pages+0x154/0x210 Call Trace: is_mem_section_removable+0x7d/0x100 removable_show+0x90/0xb0 dev_attr_show+0x1c/0x50 sysfs_kf_seq_show+0xca/0x1b0 seq_read+0x133/0x380 __vfs_read+0x26/0x180 vfs_read+0x89/0x140 ksys_read+0x42/0x90 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The reason is we do not pass the Head to page_hstate(), and so, the call to compound_order() in page_hstate() returns 0, so we end up checking all hstates's size to match PAGE_SIZE. Obviously, we do not find any hstate matching that size, and we return NULL. Then, we dereference that NULL pointer in hugepage_migration_supported() and we got the #PF from above. Fix that by getting the head page before calling page_hstate(). Also, since gigantic pages span several pageblocks, re-adjust the logic for skipping pages. While are it, we can also get rid of the round_up(). [osalvador@suse.de: remove round_up(), adjust skip pages logic per Michal] Link: http://lkml.kernel.org/r/20181221062809.31771-1-osalvador@suse.de Link: http://lkml.kernel.org/r/20181217225113.17864-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2e83ee1d86 |
mm: thp: fix flags for pmd migration when split
When splitting a huge migrating PMD, we'll transfer all the existing PMD bits and apply them again onto the small PTEs. However we are fetching the bits unconditionally via pmd_soft_dirty(), pmd_write() or pmd_yound() while actually they don't make sense at all when it's a migration entry. Fix them up. Since at it, drop the ifdef together as not needed. Note that if my understanding is correct about the problem then if without the patch there is chance to lose some of the dirty bits in the migrating pmd pages (on x86_64 we're fetching bit 11 which is part of swap offset instead of bit 2) and it could potentially corrupt the memory of an userspace program which depends on the dirty bit. Link: http://lkml.kernel.org/r/20181213051510.20306-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2830bf6f05 |
mm, memory_hotplug: initialize struct pages for the full memory section
If memory end is not aligned with the sparse memory section boundary, the mapping of such a section is only partly initialized. This may lead to VM_BUG_ON due to uninitialized struct page access from is_mem_section_removable() or test_pages_in_a_zone() function triggered by memory_hotplug sysfs handlers: Here are the the panic examples: CONFIG_DEBUG_VM=y CONFIG_DEBUG_VM_PGFLAGS=y kernel parameter mem=2050M -------------------------- page:000003d082008000 is uninitialized and poisoned page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) Call Trace: ( test_pages_in_a_zone+0xde/0x160) show_valid_zones+0x5c/0x190 dev_attr_show+0x34/0x70 sysfs_kf_seq_show+0xc8/0x148 seq_read+0x204/0x480 __vfs_read+0x32/0x178 vfs_read+0x82/0x138 ksys_read+0x5a/0xb0 system_call+0xdc/0x2d8 Last Breaking-Event-Address: test_pages_in_a_zone+0xde/0x160 Kernel panic - not syncing: Fatal exception: panic_on_oops kernel parameter mem=3075M -------------------------- page:000003d08300c000 is uninitialized and poisoned page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) Call Trace: ( is_mem_section_removable+0xb4/0x190) show_mem_removable+0x9a/0xd8 dev_attr_show+0x34/0x70 sysfs_kf_seq_show+0xc8/0x148 seq_read+0x204/0x480 __vfs_read+0x32/0x178 vfs_read+0x82/0x138 ksys_read+0x5a/0xb0 system_call+0xdc/0x2d8 Last Breaking-Event-Address: is_mem_section_removable+0xb4/0x190 Kernel panic - not syncing: Fatal exception: panic_on_oops Fix the problem by initializing the last memory section of each zone in memmap_init_zone() till the very end, even if it goes beyond the zone end. Michal said: : This has alwways been problem AFAIU. It just went unnoticed because we : have zeroed memmaps during allocation before |
||
|
f496990f1f |
slab: make kmem_cache_create{_usercopy} description proper kernel-doc
Add the description for kmem_cache_create, fixup the return value paragraph and make both kmem_cache_create and add the second '*' to the comment opening. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> |
||
|
6ab7d47bcb |
percpu: convert spin_lock_irq to spin_lock_irqsave.
From Michael Cree: "Bisection lead to commit |
||
|
8ace22bce8 |
hugetlbfs: call VM_BUG_ON_PAGE earlier in free_huge_page()
A stack trace was triggered by VM_BUG_ON_PAGE(page_mapcount(page), page) in free_huge_page(). Unfortunately, the page->mapping field was set to NULL before this test. This made it more difficult to determine the root cause of the problem. Move the VM_BUG_ON_PAGE tests earlier in the function so that if they do trigger more information is present in the page struct. Link: http://lkml.kernel.org/r/1543491843-23438-1-git-send-email-nic_w@163.com Signed-off-by: Yongkai Wu <nic_w@163.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
f5a222dc2f |
memblock: annotate memblock_is_reserved() with __init_memblock
Found warning: WARNING: EXPORT symbol "gsi_write_channel_scratch" [vmlinux] version generation failed, symbol will not be versioned. WARNING: vmlinux.o(.text+0x1e0a0): Section mismatch in reference from the function valid_phys_addr_range() to the function .init.text:memblock_is_reserved() The function valid_phys_addr_range() references the function __init memblock_is_reserved(). This is often because valid_phys_addr_range lacks a __init annotation or the annotation of memblock_is_reserved is wrong. Use __init_memblock instead of __init. Link: http://lkml.kernel.org/r/BLUPR13MB02893411BF12EACB61888E80DFAE0@BLUPR13MB0289.namprd13.prod.outlook.com Signed-off-by: Yueyi Li <liyueyi@live.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
9def36e0fa |
mm/sparse: add common helper to mark all memblocks present
Presently the arches arm64, arm and sh have a function which loops through each memblock and calls memory present. riscv will require a similar function. Introduce a common memblocks_present() function that can be used by all the arches. Subsequent patches will cleanup the arches that make use of this. Link: http://lkml.kernel.org/r/20181107205433.3875-3-logang@deltatee.com Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
880b9df1bf |
XArray updates for 4.20-rc7
Two bugfixes, each with test-suite updates, two improvements to the test-suite without associated bugs, and one patch adding a missing API. -----BEGIN PGP SIGNATURE----- iQFIBAABCgAyFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAlwS8ZUUHHdpbGx5QGlu ZnJhZGVhZC5vcmcACgkQDpNsjXcpgj5h0wf9Fmc3z3WjmX05he+XKhGq1jQuHYVi zt8Eggsc7ns1hX8xPdwSw240CDOCBcbXxCyNL9KFCqlIkfxTAe8pYgoTDKuXhVAK U7VTCHCxJpsYzfhkEke5DaASGb/YP1kmvoTJs7qCfhBuI9ERXLVK6cESJNDZhlMA /d7VfRwRiqSLnK13AXPZAA9Pnw2GtAolMDU9CC9nOtMRlRDVwsQiwYiQ/mBRYK00 u0LoruwBJ7XAoe7Bo1CFmkvJuIV794cmhqkEY2cY85e9aoj15+BDqOu1la8DTaOl e7+7PwK1I6Ed6DfPixGleUP7BYHHXCfb/RVEYn22qGC/YHUQRtpbwrY37Q== =b+pK -----END PGP SIGNATURE----- Merge tag 'xarray-4.20-rc7' of git://git.infradead.org/users/willy/linux-dax Pull XArray fixes from Matthew Wilcox: "Two bugfixes, each with test-suite updates, two improvements to the test-suite without associated bugs, and one patch adding a missing API" * tag 'xarray-4.20-rc7' of git://git.infradead.org/users/willy/linux-dax: XArray: Fix xa_alloc when id exceeds max XArray tests: Check iterating over multiorder entries XArray tests: Handle larger indices more elegantly XArray: Add xa_cmpxchg_irq and xa_cmpxchg_bh radix tree: Don't return retry entries from lookup |
||
|
f6795053da |
mm: mmap: Allow for "high" userspace addresses
This patch adds support for "high" userspace addresses that are optionally supported on the system and have to be requested via a hint mechanism ("high" addr parameter to mmap). Architectures such as powerpc and x86 achieve this by making changes to their architectural versions of arch_get_unmapped_* functions. However, on arm64 we use the generic versions of these functions. Rather than duplicate the generic arch_get_unmapped_* implementations for arm64, this patch instead introduces two architectural helper macros and applies them to arch_get_unmapped_*: arch_get_mmap_end(addr) - get mmap upper limit depending on addr hint arch_get_mmap_base(addr, base) - get mmap_base depending on addr hint If these macros are not defined in architectural code then they default to (TASK_SIZE) and (base) so should not introduce any behavioural changes to architectures that do not define them. Signed-off-by: Steve Capper <steve.capper@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Will Deacon <will.deacon@arm.com> |
||
|
96f774106e |
Linux 4.20-rc6
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlwNpb0eHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwGwH/00UHnXfxww3ixxz zwTVDzptA6SPm6s84yJOWatM5fXhPiAltZaHSYF9lzRzNU71NCq7Frhq3fQUIXKM OxqDn9nfSTWcjWTk2q5n2keyRV/KIn67YX7UgqFc1bO/mqtVjEgNWaMyblhI+e9E giu1ZXayHr43jK1cDOmGExZubXUq7Vsc9TOlrd+d2SwIqeEP7TCMrPhnHDwCNvX2 UU5dtANpVzGtHaBcr37wJj+L8kODCc0f+PQ3g2ar5jTHst5SLlHp2u0AMRnUmgdi VkGx+mu/uk8mtwUqMIMqhplklVoqK6LTeLqsY5Xt32SKruw9UqyJGdphLjW2QP/g MkmA1lI= =7kaD -----END PGP SIGNATURE----- Merge tag 'v4.20-rc6' into for-4.21/block Pull in v4.20-rc6 to resolve the conflict in NVMe, but also to get the two corruption fixes. We're going to be overhauling the direct dispatch path, and we need to do that on top of the changes we made for that in mainline. Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
fa82dcbf2a |
dax fixes 4.20-rc6
* Fix the Xarray conversion of fsdax to properly handle dax_lock_mapping_entry() in the presense of pmd entries. * Fix inode destruction racing a new lock request. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJcDLC5AAoJEB7SkWpmfYgC0yAP/2NULGPapIoR2aw7GKPOStHR tOMGE8+ZDtpd19oVoEiB3PuhMkiCj5Pkyt2ka9eUXSk3VR/N6EtGwBANFUO+0tEh yU8nca2T2G1f+MsZbraWA9zj9aPnfpLt46r7p/AjIB2j1/vyXkYMkXmkXtCvMEUi 3Q9dVsT53fWncJBlTe/WW84wWQp77VsVafN4/UP75dv8cN9F+u91P1xvUXu2AKus RBqjlp6G6xMZ9HcWCQmStdPCi+qbO4k3oTZcohd/DYLb70ZL1kLoEMOjK2/3GnaV YMHZGWRAK5jroDZbdXXnJHE+4/opZpSWhW/J1WFEJUSSr7YmZsWhwegmrBlQ5cp3 V0fZ7LjBWjgtUKsjlmfmu4ccLY87LzVlfIEtJH2+QNiLy/Pl/O0P/Jzg9kHVLEgo zph2yIQlx3yshej8EEBPylDJVtYVzOyjLNLx7r3bL6AzQv+6CKqjxtYOiIDwYQsA Db1InfeDzAAoH6NYSbqN+yG01Bz7zeCYT7Ch4rYSKcA1i8vUMR6iObI32fC3xJQm +TxYmqngrJaU8XOMikIem/qGn4fBpYFULrIdK7lj/WmIxQq8Hp0mqz1ikMIoErzN ZXV5RcbbvSSZi5EOrrVKbr2IfIcx+HYo/bIAkRCR69o8bcIZnLPHDky/7gl7GS/5 sc9A/kysseeF45L2+6QG =kUil -----END PGP SIGNATURE----- Merge tag 'dax-fixes-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull dax fixes from Dan Williams: "The last of the known regression fixes and fallout from the Xarray conversion of the filesystem-dax implementation. On the path to debugging why the dax memory-failure injection test started failing after the Xarray conversion a couple more fixes for the dax_lock_mapping_entry(), now called dax_lock_page(), surfaced. Those plus the bug that started the hunt are now addressed. These patches have appeared in a -next release with no issues reported. Note the touches to mm/memory-failure.c are just the conversion to the new function signature for dax_lock_page(). Summary: - Fix the Xarray conversion of fsdax to properly handle dax_lock_mapping_entry() in the presense of pmd entries - Fix inode destruction racing a new lock request" * tag 'dax-fixes-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: Fix unlock mismatch with updated API dax: Don't access a freed inode dax: Check page->mapping isn't NULL |
||
|
356ff8a9a7 |
Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"
This reverts commit |
||
|
6a7f6d86a5 |
blkcg: associate a blkg for pages being evicted by swap
A prior patch in this series added blkg association to bios issued by cgroups. There are two other paths that we want to attribute work back to the appropriate cgroup: swap and writeback. Here we modify the way swap tags bios to include the blkg. Writeback will be tackle in the next patch. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
55f3f7eab7 |
XArray: Add xa_cmpxchg_irq and xa_cmpxchg_bh
These convenience wrappers match the other _irq and _bh wrappers we already have. It turns out I'd already open-coded xa_cmpxchg_irq() in the shmem code, so convert that. Signed-off-by: Matthew Wilcox <willy@infradead.org> |
||
|
2f0799a0ff |
mm, thp: restore node-local hugepage allocations
This is a full revert of |
||
|
27359fd6e5 |
dax: Fix unlock mismatch with updated API
Internal to dax_unlock_mapping_entry(), dax_unlock_entry() is used to
store a replacement entry in the Xarray at the given xas-index with the
DAX_LOCKED bit clear. When called, dax_unlock_entry() expects the unlocked
value of the entry relative to the current Xarray state to be specified.
In most contexts dax_unlock_entry() is operating in the same scope as
the matched dax_lock_entry(). However, in the dax_unlock_mapping_entry()
case the implementation needs to recall the original entry. In the case
where the original entry is a 'pmd' entry it is possible that the pfn
performed to do the lookup is misaligned to the value retrieved in the
Xarray.
Change the api to return the unlock cookie from dax_lock_page() and pass
it to dax_unlock_page(). This fixes a bug where dax_unlock_page() was
assuming that the page was PMD-aligned if the entry was a PMD entry with
signatures like:
WARNING: CPU: 38 PID: 1396 at fs/dax.c:340 dax_insert_entry+0x2b2/0x2d0
RIP: 0010:dax_insert_entry+0x2b2/0x2d0
[..]
Call Trace:
dax_iomap_pte_fault.isra.41+0x791/0xde0
ext4_dax_huge_fault+0x16f/0x1f0
? up_read+0x1c/0xa0
__do_fault+0x1f/0x160
__handle_mm_fault+0x1033/0x1490
handle_mm_fault+0x18b/0x3d0
Link: https://lkml.kernel.org/r/20181130154902.GL10377@bombadil.infradead.org
Fixes:
|
||
|
89d04ec349 |
Linux 4.20-rc5
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlwEZdIeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGAlQH/19oax2Za3IPqF4X DM3lal5M6zlUVkoYstqzpbR3MqUwgEnMfvoeMDC6mI9N4/+r2LkV7cRR8HzqQCCS jDfD69IzRGb52VSeJmbOrkxBWsR1Nn0t4Z3rEeLPxwaOoNpRc8H973MbAQ2FKMpY S4Y3jIK1dNiRRxdh52NupVkQF+djAUwkBuVk/rrvRJmTDij4la03cuCDAO+Di9lt GHlVvygKw2SJhDR+z3ArwZNmE0ceCcE6+W7zPHzj2KeWuKrZg22kfUD454f2YEIw FG0hu9qecgtpYCkLSm2vr4jQzmpsDoyq3ZfwhjGrP4qtvPC3Db3vL3dbQnkzUcJu JtwhVCE= =O1q1 -----END PGP SIGNATURE----- Merge tag 'v4.20-rc5' into for-4.21/block Pull in v4.20-rc5, solving a conflict we'll otherwise get in aio.c and also getting the merge fix that went into mainline that users are hitting testing for-4.21/block and/or for-next. * tag 'v4.20-rc5': (664 commits) Linux 4.20-rc5 PCI: Fix incorrect value returned from pcie_get_speed_cap() MAINTAINERS: Update linux-mips mailing list address ocfs2: fix potential use after free mm/khugepaged: fix the xas_create_range() error path mm/khugepaged: collapse_shmem() do not crash on Compound mm/khugepaged: collapse_shmem() without freezing new_page mm/khugepaged: minor reorderings in collapse_shmem() mm/khugepaged: collapse_shmem() remember to clear holes mm/khugepaged: fix crashes due to misaccounted holes mm/khugepaged: collapse_shmem() stop if punched or truncated mm/huge_memory: fix lockdep complaint on 32-bit i_size_read() mm/huge_memory: splitting set mapping+index before unfreeze mm/huge_memory: rename freeze_page() to unmap_page() initramfs: clean old path before creating a hardlink kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace psi: make disabling/enabling easier for vendor kernels proc: fixup map_files test on arm debugobjects: avoid recursive calls with kmemleak userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set ... |
||
|
4bbfd7467c |
Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney: - Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar. - Replace calls of RCU-bh and RCU-sched update-side functions to their vanilla RCU counterparts. This series is a step towards complete removal of the RCU-bh and RCU-sched update-side functions. ( Note that some of these conversions are going upstream via their respective maintainers. ) - Documentation updates, including a number of flavor-consolidation updates from Joel Fernandes. - Miscellaneous fixes. - Automate generation of the initrd filesystem used for rcutorture testing. - Convert spin_is_locked() assertions to instead use lockdep. ( Note that some of these conversions are going upstream via their respective maintainers. ) - SRCU updates, especially including a fix from Dennis Krein for a bag-on-head-class bug. - RCU torture-test updates. Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
eaaf055f27 |
Merge branches 'bug.2018.11.12a', 'consolidate.2018.12.01a', 'doc.2018.11.12a', 'fixes.2018.11.12a', 'initrd.2018.11.08b', 'sil.2018.11.12a' and 'srcu.2018.11.27a' into HEAD
bug.2018.11.12a: Get rid of BUG_ON() and friends consolidate.2018.12.01a: Continued RCU flavor-consolidation cleanup doc.2018.11.12a: Documentation updates fixes.2018.11.12a: Miscellaneous fixes initrd.2018.11.08b: Automate creation of rcutorture initrd sil.2018.11.12a: Remove more spin_unlock_wait() calls |
||
|
95feeabb77 |
mm/khugepaged: fix the xas_create_range() error path
collapse_shmem()'s xas_nomem() is very unlikely to fail, but it is
rightly given a failure path, so move the whole xas_create_range() block
up before __SetPageLocked(new_page): so that it does not need to
remember to unlock_page(new_page).
Add the missing mem_cgroup_cancel_charge(), and set (currently unused)
result to SCAN_FAIL rather than SCAN_SUCCEED.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261531200.2275@eggly.anvils
Fixes:
|
||
|
06a5e1268a |
mm/khugepaged: collapse_shmem() do not crash on Compound
collapse_shmem()'s VM_BUG_ON_PAGE(PageTransCompound) was unsafe: before
it holds page lock of the first page, racing truncation then extension
might conceivably have inserted a hugepage there already. Fail with the
SCAN_PAGE_COMPOUND result, instead of crashing (CONFIG_DEBUG_VM=y) or
otherwise mishandling the unexpected hugepage - though later we might
code up a more constructive way of handling it, with SCAN_SUCCESS.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261529310.2275@eggly.anvils
Fixes:
|
||
|
87c460a0bd |
mm/khugepaged: collapse_shmem() without freezing new_page
khugepaged's collapse_shmem() does almost all of its work, to assemble
the huge new_page from 512 scattered old pages, with the new_page's
refcount frozen to 0 (and refcounts of all old pages so far also frozen
to 0). Including shmem_getpage() to read in any which were out on swap,
memory reclaim if necessary to allocate their intermediate pages, and
copying over all the data from old to new.
Imagine the frozen refcount as a spinlock held, but without any lock
debugging to highlight the abuse: it's not good, and under serious load
heads into lockups - speculative getters of the page are not expecting
to spin while khugepaged is rescheduled.
One can get a little further under load by hacking around elsewhere; but
fortunately, freezing the new_page turns out to have been entirely
unnecessary, with no hacks needed elsewhere.
The huge new_page lock is already held throughout, and guards all its
subpages as they are brought one by one into the page cache tree; and
anything reading the data in that page, without the lock, before it has
been marked PageUptodate, would already be in the wrong. So simply
eliminate the freezing of the new_page.
Each of the old pages remains frozen with refcount 0 after it has been
replaced by a new_page subpage in the page cache tree, until they are
all unfrozen on success or failure: just as before. They could be
unfrozen sooner, but cause no problem once no longer visible to
find_get_entry(), filemap_map_pages() and other speculative lookups.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
Fixes:
|
||
|
042a308248 |
mm/khugepaged: minor reorderings in collapse_shmem()
Several cleanups in collapse_shmem(): most of which probably do not
really matter, beyond doing things in a more familiar and reassuring
order. Simplify the failure gotos in the main loop, and on success
update stats while interrupts still disabled from the last iteration.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261526400.2275@eggly.anvils
Fixes:
|
||
|
2af8ff2918 |
mm/khugepaged: collapse_shmem() remember to clear holes
Huge tmpfs testing reminds us that there is no __GFP_ZERO in the gfp
flags khugepaged uses to allocate a huge page - in all common cases it
would just be a waste of effort - so collapse_shmem() must remember to
clear out any holes that it instantiates.
The obvious place to do so, where they are put into the page cache tree,
is not a good choice: because interrupts are disabled there. Leave it
until further down, once success is assured, where the other pages are
copied (before setting PageUptodate).
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261525080.2275@eggly.anvils
Fixes:
|
||
|
aaa52e3400 |
mm/khugepaged: fix crashes due to misaccounted holes
Huge tmpfs testing on a shortish file mapped into a pmd-rounded extent hit shmem_evict_inode()'s WARN_ON(inode->i_blocks) followed by clear_inode()'s BUG_ON(inode->i_data.nrpages) when the file was later closed and unlinked. khugepaged's collapse_shmem() was forgetting to update mapping->nrpages on the rollback path, after it had added but then needs to undo some holes. There is indeed an irritating asymmetry between shmem_charge(), whose callers want it to increment nrpages after successfully accounting blocks, and shmem_uncharge(), when __delete_from_page_cache() already decremented nrpages itself: oh well, just add a comment on that to them both. And shmem_recalc_inode() is supposed to be called when the accounting is expected to be in balance (so it can deduce from imbalance that reclaim discarded some pages): so change shmem_charge() to update nrpages earlier (though it's rare for the difference to matter at all). Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261523450.2275@eggly.anvils Fixes: |
||
|
701270fa19 |
mm/khugepaged: collapse_shmem() stop if punched or truncated
Huge tmpfs testing showed that although collapse_shmem() recognizes a
concurrently truncated or hole-punched page correctly, its handling of
holes was liable to refill an emptied extent. Add check to stop that.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261522040.2275@eggly.anvils
Fixes:
|
||
|
006d3ff27e |
mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()
Huge tmpfs testing, on 32-bit kernel with lockdep enabled, showed that
__split_huge_page() was using i_size_read() while holding the irq-safe
lru_lock and page tree lock, but the 32-bit i_size_read() uses an
irq-unsafe seqlock which should not be nested inside them.
Instead, read the i_size earlier in split_huge_page_to_list(), and pass
the end offset down to __split_huge_page(): all while holding head page
lock, which is enough to prevent truncation of that extent before the
page tree lock has been taken.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261520070.2275@eggly.anvils
Fixes:
|
||
|
173d9d9fd3 |
mm/huge_memory: splitting set mapping+index before unfreeze
Huge tmpfs stress testing has occasionally hit shmem_undo_range()'s VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page). Move the setting of mapping and index up before the page_ref_unfreeze() in __split_huge_page_tail() to fix this: so that a page cache lookup cannot get a reference while the tail's mapping and index are unstable. In fact, might as well move them up before the smp_wmb(): I don't see an actual need for that, but if I'm missing something, this way round is safer than the other, and no less efficient. You might argue that VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page) is misplaced, and should be left until after the trylock_page(); but left as is has not crashed since, and gives more stringent assurance. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261516380.2275@eggly.anvils Fixes: |
||
|
906f9cdfc2 |
mm/huge_memory: rename freeze_page() to unmap_page()
The term "freeze" is used in several ways in the kernel, and in mm it
has the particular meaning of forcing page refcount temporarily to 0.
freeze_page() is just too confusing a name for a function that unmaps a
page: rename it unmap_page(), and rename unfreeze_page() remap_page().
Went to change the mention of freeze_page() added later in mm/rmap.c,
but found it to be incorrect: ordinary page reclaim reaches there too;
but the substance of the comment still seems correct, so edit it down.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261514080.2275@eggly.anvils
Fixes:
|
||
|
dcf7fe9d89 |
userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set
Set the page dirty if VM_WRITE is not set because in such case the pte
won't be marked dirty and the page would be reclaimed without writepage
(i.e. swapout in the shmem case).
This was found by source review. Most apps (certainly including QEMU)
only use UFFDIO_COPY on PROT_READ|PROT_WRITE mappings or the app can't
modify the memory in the first place. This is for correctness and it
could help the non cooperative use case to avoid unexpected data loss.
Link: http://lkml.kernel.org/r/20181126173452.26955-6-aarcange@redhat.com
Reviewed-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
e2a50c1f64 |
userfaultfd: shmem: add i_size checks
With MAP_SHARED: recheck the i_size after taking the PT lock, to
serialize against truncate with the PT lock. Delete the page from the
pagecache if the i_size_read check fails.
With MAP_PRIVATE: check the i_size after the PT lock before mapping
anonymous memory or zeropages into the MAP_PRIVATE shmem mapping.
A mostly irrelevant cleanup: like we do the delete_from_page_cache()
pagecache removal after dropping the PT lock, the PT lock is a spinlock
so drop it before the sleepable page lock.
Link: http://lkml.kernel.org/r/20181126173452.26955-5-aarcange@redhat.com
Fixes:
|
||
|
29ec90660d |
userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas
After the VMA to register the uffd onto is found, check that it has VM_MAYWRITE set before allowing registration. This way we inherit all common code checks before allowing to fill file holes in shmem and hugetlbfs with UFFDIO_COPY. The userfaultfd memory model is not applicable for readonly files unless it's a MAP_PRIVATE. Link: http://lkml.kernel.org/r/20181126173452.26955-4-aarcange@redhat.com Fixes: |
||
|
5b51072e97 |
userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmem
Userfaultfd did not create private memory when UFFDIO_COPY was invoked
on a MAP_PRIVATE shmem mapping. Instead it wrote to the shmem file,
even when that had not been opened for writing. Though, fortunately,
that could only happen where there was a hole in the file.
Fix the shmem-backed implementation of UFFDIO_COPY to create private
memory for MAP_PRIVATE mappings. The hugetlbfs-backed implementation
was already correct.
This change is visible to userland, if userfaultfd has been used in
unintended ways: so it introduces a small risk of incompatibility, but
is necessary in order to respect file permissions.
An app that uses UFFDIO_COPY for anything like postcopy live migration
won't notice the difference, and in fact it'll run faster because there
will be no copy-on-write and memory waste in the tmpfs pagecache
anymore.
Userfaults on MAP_PRIVATE shmem keep triggering only on file holes like
before.
The real zeropage can also be built on a MAP_PRIVATE shmem mapping
through UFFDIO_ZEROPAGE and that's safe because the zeropage pte is
never dirty, in turn even an mprotect upgrading the vma permission from
PROT_READ to PROT_READ|PROT_WRITE won't make the zeropage pte writable.
Link: http://lkml.kernel.org/r/20181126173452.26955-3-aarcange@redhat.com
Fixes:
|
||
|
9e368259ad |
userfaultfd: use ENOENT instead of EFAULT if the atomic copy user fails
Patch series "userfaultfd shmem updates".
Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the
lack of the VM_MAYWRITE check and the lack of i_size checks.
Then looking into the above we also fixed the MAP_PRIVATE case.
Hugh by source review also found a data loss source if UFFDIO_COPY is
used on shmem MAP_SHARED PROT_READ mappings (the production usages
incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't
happen in those production usages like with QEMU).
The whole patchset is marked for stable.
We verified QEMU postcopy live migration with guest running on shmem
MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE.
Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU
unconditionally invokes a punch hole if the guest mapping is filebacked
and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and
for the anon backend).
This patch (of 5):
We internally used EFAULT to communicate with the caller, switch to
ENOENT, so EFAULT can be used as a non internal retval.
Link: http://lkml.kernel.org/r/20181126173452.26955-2-aarcange@redhat.com
Fixes:
|
||
|
8f416836c0 |
mm/page_alloc.c: fix calculation of pgdat->nr_zones
init_currently_empty_zone() will adjust pgdat->nr_zones and set it to
'zone_idx(zone) + 1' unconditionally. This is correct in the normal
case, while not exact in hot-plug situation.
This function is used in two places:
* free_area_init_core()
* move_pfn_range_to_zone()
In the first case, we are sure zone index increase monotonically. While
in the second one, this is under users control.
One way to reproduce this is:
----------------------------
1. create a virtual machine with empty node1
-m 4G,slots=32,maxmem=32G \
-smp 4,maxcpus=8 \
-numa node,nodeid=0,mem=4G,cpus=0-3 \
-numa node,nodeid=1,mem=0G,cpus=4-7
2. hot-add cpu 3-7
cpu-add [3-7]
2. hot-add memory to nod1
object_add memory-backend-ram,id=ram0,size=1G
device_add pc-dimm,id=dimm0,memdev=ram0,node=1
3. online memory with following order
echo online_movable > memory47/state
echo online > memory40/state
After this, node1 will have its nr_zones equals to (ZONE_NORMAL + 1)
instead of (ZONE_MOVABLE + 1).
Michal said:
"Having an incorrect nr_zones might result in all sorts of problems
which would be quite hard to debug (e.g. reclaim not considering the
movable zone). I do not expect many users would suffer from this it
but still this is trivial and obviously right thing to do so
backporting to the stable tree shouldn't be harmful (last famous
words)"
Link: http://lkml.kernel.org/r/20181117022022.9956-1-richard.weiyang@gmail.com
Fixes:
|
||
|
c1cb20d437 |
mm: use swp_offset as key in shmem_replace_page()
We changed the key of swap cache tree from swp_entry_t.val to
swp_offset. We need to do so in shmem_replace_page() as well.
Hugh said:
"shmem_replace_page() has been wrong since the day I wrote it: good
enough to work on swap "type" 0, which is all most people ever use
(especially those few who need shmem_replace_page() at all), but
broken once there are any non-0 swp_type bits set in the higher order
bits"
Link: http://lkml.kernel.org/r/20181121215442.138545-1-yuzhao@google.com
Fixes:
|
||
|
6ff38bd402 |
mm: cleancache: fix corruption on missed inode invalidation
If all pages are deleted from the mapping by memory reclaim and also
moved to the cleancache:
__delete_from_page_cache
(no shadow case)
unaccount_page_cache_page
cleancache_put_page
page_cache_delete
mapping->nrpages -= nr
(nrpages becomes 0)
We don't clean the cleancache for an inode after final file truncation
(removal).
truncate_inode_pages_final
check (nrpages || nrexceptional) is false
no truncate_inode_pages
no cleancache_invalidate_inode(mapping)
These way when reading the new file created with same inode we may get
these trash leftover pages from cleancache and see wrong data instead of
the contents of the new file.
Fix it by always doing truncate_inode_pages which is already ready for
nrpages == 0 && nrexceptional == 0 case and just invalidates inode.
[akpm@linux-foundation.org: add comment, per Jan]
Link: http://lkml.kernel.org/r/20181112095734.17979-1-ptikhomirov@virtuozzo.com
Fixes: commit
|
||
|
08be37b798 |
mm/gup: finish consolidating error handling
Commit |
||
|
b401ec1848 |
mm: Replace call_rcu_sched() with call_rcu()
Now that call_rcu()'s callback is not invoked until after all preempt-disable regions of code have completed (in addition to explicitly marked RCU read-side critical sections), call_rcu() can be used in place of call_rcu_sched(). This commit therefore makes that change. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> |
||
|
6564a25e6c |
slab: Replace synchronize_sched() with synchronize_rcu()
Now that synchronize_rcu() waits for preempt-disable regions of code as well as RCU read-side critical sections, synchronize_sched() can be replaced by synchronize_rcu(). This commit therefore makes this change. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <linux-mm@kvack.org> |
||
|
0a1b8b87d0 |
block: make blk_poll() take a parameter on whether to spin or not
blk_poll() has always kept spinning until it found an IO. This is fine for SYNC polling, since we need to find one request we have pending, but in preparation for ASYNC polling it can be beneficial to just check if we have any entries available or not. Existing callers are converted to pass in 'spin == true', to retain the old behavior. Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
2ac5e38ea4 |
Merge drm/drm-next into drm-intel-next-queued
Pull in v4.20-rc3 via drm-next. Signed-off-by: Jani Nikula <jani.nikula@intel.com> |
||
|
849a370016 |
block: avoid ordered task state change for polled IO
For the core poll helper, the task state setting don't need to imply any atomics, as it's the current task itself that is being modified and we're not going to sleep. For IRQ driven, the wakeup path have the necessary barriers to not need us using the heavy handed version of the task state setting. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
a78b03bc73 |
Linux 4.20-rc3
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlvx2sAeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGycgIAIuxobwt0RRKa0zO ROS+34JGoC2yU2P9VdEGWdtxS6ANMVQgKPBhWL6s+xR89Kd+V4xSdJLD1pNTxxqP 0DCva0np1/Q4juH+JbU50v/lykoLgteZ0P0LBRGf1y8p3WiLPv45IbnNsMDNYhB2 7a8rOmZYakRY9CPznRDw3X8cJt3sddKgFJHIOGz1OQJVWtCD0KPGcJmQNsbDSagY Zx6Z5BKSIdjRqaAdN5gDa1Pft3WQo7TpaQGl80lSsgr5LcjmscXA3sClOCy+25Mo FZLx0PcwP+Efq8RTGzNK51WSOMa6d37hvjDqUAdQBOR0KbyjRyXQwyQVw/MGbPJs 7J3Pzm0= =56Mt -----END PGP SIGNATURE----- Merge tag 'v4.20-rc3' into for-4.21/block Merge in -rc3 to resolve a few conflicts, but also to get a few important fixes that have gone into mainline since the block 4.21 branch was forked off (most notably the SCSI queue issue, which is both a conflict AND needed fix). Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
45e79815b8 |
mm/memblock.c: fix a typo in __next_mem_pfn_range() comments
Link: http://lkml.kernel.org/r/20181107100247.13359-1-rainccrun@gmail.com Signed-off-by: Chen Chang <rainccrun@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
c63ae43ba5 |
mm, page_alloc: check for max order in hot path
Konstantin has noticed that kvmalloc might trigger the following warning: WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60 [...] Call Trace: fragmentation_index+0x76/0x90 compaction_suitable+0x4f/0xf0 shrink_node+0x295/0x310 node_reclaim+0x205/0x250 get_page_from_freelist+0x649/0xad0 __alloc_pages_nodemask+0x12a/0x2a0 kmalloc_large_node+0x47/0x90 __kmalloc_node+0x22b/0x2e0 kvmalloc_node+0x3e/0x70 xt_alloc_table_info+0x3a/0x80 [x_tables] do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables] nf_setsockopt+0x44/0x60 SyS_setsockopt+0x6f/0xc0 do_syscall_64+0x67/0x120 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 the problem is that we only check for an out of bound order in the slow path and the node reclaim might happen from the fast path already. This is fixable by making sure that kvmalloc doesn't ever use kmalloc for requests that are larger than KMALLOC_MAX_SIZE but this also shows that the code is rather fragile. A recent UBSAN report just underlines that by the following report UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19 shift exponent 51 is too large for 32-bit type 'int' CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0xd2/0x148 lib/dump_stack.c:113 ubsan_epilogue+0x12/0x94 lib/ubsan.c:159 __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425 __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117 zone_watermark_fast mm/page_alloc.c:3216 [inline] get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300 __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370 alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093 alloc_pages include/linux/gfp.h:509 [inline] __get_free_pages+0x12/0x60 mm/page_alloc.c:4414 dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156 raw_cmd_copyin drivers/block/floppy.c:3159 [inline] raw_cmd_ioctl drivers/block/floppy.c:3206 [inline] fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544 fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571 __blkdev_driver_ioctl block/ioctl.c:303 [inline] blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601 block_ioctl+0x105/0x150 fs/block_dev.c:1883 vfs_ioctl fs/ioctl.c:46 [inline] do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687 ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702 __do_sys_ioctl fs/ioctl.c:709 [inline] __se_sys_ioctl fs/ioctl.c:707 [inline] __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707 do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Note that this is not a kvmalloc path. It is just that the fast path really depends on having sanitzed order as well. Therefore move the order check to the fast path. Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reported-by: Kyungtae Kim <kt0755@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Byoungyoung Lee <lifeasageek@gmail.com> Cc: "Dae R. Jeong" <threeearcat@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
1a41364693 |
tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset
Other filesystems such as ext4, f2fs and ubifs all return ENXIO when lseek (SEEK_DATA or SEEK_HOLE) requests a negative offset. man 2 lseek says : EINVAL whence is not valid. Or: the resulting file offset would be : negative, or beyond the end of a seekable device. : : ENXIO whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond : the end of the file. Make tmpfs return ENXIO under these circumstances as well. After this, tmpfs also passes xfstests's generic/448. [akpm@linux-foundation.org: rewrite changelog] Link: http://lkml.kernel.org/r/1540434176-14349-1-git-send-email-yuyufen@huawei.com Signed-off-by: Yufen Yu <yuyufen@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
13c9aaf7fa |
mm/vmstat.c: fix NUMA statistics updates
Scan through the whole array to see if an update is needed. While we're
at it, use sizeof() to be safe against any possible type changes in the
future.
The bug here is that we wouldn't sync per-cpu counters into global ones
if there was an update of numa_stats for higher cpus. Highly
theoretical one though because it is much more probable that zone_stats
are updated so we would refresh anyway. So I wouldn't bother to mark
this for stable, yet something nice to fix.
[mhocko@suse.com: changelog enhancement]
Link: http://lkml.kernel.org/r/1541601517-17282-1-git-send-email-janne.huttunen@nokia.com
Fixes:
|
||
|
78179556e7 |
mm/gup.c: fix follow_page_mask() kerneldoc comment
Commit
|
||
|
9d7899999c |
mm, memory_hotplug: check zone_movable in has_unmovable_pages
Page state checks are racy. Under a heavy memory workload (e.g. stress -m 200 -t 2h) it is quite easy to hit a race window when the page is allocated but its state is not fully populated yet. A debugging patch to dump the struct page state shows has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0 page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1 flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked) Note that the state has been checked for both PageLRU and PageSwapBacked already. Closing this race completely would require some sort of retry logic. This can be tricky and error prone (think of potential endless or long taking loops). Workaround this problem for movable zones at least. Such a zone should only contain movable pages. Commit |
||
|
873d7bcfd0 |
mm/swapfile.c: use kvzalloc for swap_info_struct allocation
Commit |
||
|
5e41540c8a |
hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
This bug has been experienced several times by the Oracle DB team. The
BUG is in remove_inode_hugepages() as follows:
/*
* If page is mapped, it was faulted in after being
* unmapped in caller. Unmap (again) now after taking
* the fault mutex. The mutex will prevent faults
* until we finish removing the page.
*
* This race can only happen in the hole punch case.
* Getting here in a truncate operation is a bug.
*/
if (unlikely(page_mapped(page))) {
BUG_ON(truncate_op);
In this case, the elevated map count is not the result of a race.
Rather it was incorrectly incremented as the result of a bug in the huge
pmd sharing code. Consider the following:
- Process A maps a hugetlbfs file of sufficient size and alignment
(PUD_SIZE) that a pmd page could be shared.
- Process B maps the same hugetlbfs file with the same size and
alignment such that a pmd page is shared.
- Process B then calls mprotect() to change protections for the mapping
with the shared pmd. As a result, the pmd is 'unshared'.
- Process B then calls mprotect() again to chage protections for the
mapping back to their original value. pmd remains unshared.
- Process B then forks and process C is created. During the fork
process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
tables. Copying page tables for hugetlb mappings is done in the
routine copy_hugetlb_page_range.
In copy_hugetlb_page_range(), the destination pte is obtained by:
dst_pte = huge_pte_alloc(dst, addr, sz);
If pmd sharing is possible, the returned pointer will be to a pte in an
existing page table. In the situation above, process C could share with
either process A or process B. Since process A is first in the list,
the returned pte is a pointer to a pte in process A's page table.
However, the check for pmd sharing in copy_hugetlb_page_range is:
/* If the pagetables are shared don't copy or take references */
if (dst_pte == src_pte)
continue;
Since process C is sharing with process A instead of process B, the
above test fails. The code in copy_hugetlb_page_range which follows
assumes dst_pte points to a huge_pte_none pte. It copies the pte entry
from src_pte to dst_pte and increments this map count of the associated
page. This is how we end up with an elevated map count.
To solve, check the dst_pte entry for huge_pte_none. If !none, this
implies PMD sharing so do not copy.
Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
Fixes:
|
||
|
ca0246bb97 |
z3fold: fix possible reclaim races
Reclaim and free can race on an object which is basically fine but in order for reclaim to be able to map "freed" object we need to encode object length in the handle. handle_to_chunks() is then introduced to extract object length from a handle and use it during mapping. Moreover, to avoid racing on a z3fold "headless" page release, we should not try to free that page in z3fold_free() if the reclaim bit is set. Also, in the unlikely case of trying to reclaim a page being freed, we should not proceed with that page. While at it, fix the page accounting in reclaim function. This patch supersedes "[PATCH] z3fold: fix reclaim lock-ups". Link: http://lkml.kernel.org/r/20181105162225.74e8837d03583a9b707cf559@gmail.com Signed-off-by: Vitaly Wool <vitaly.vul@sony.com> Signed-off-by: Jongseok Kim <ks77sj@gmail.com> Reported-by-by: Jongseok Kim <ks77sj@gmail.com> Reviewed-by: Snild Dolkow <snild@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
0619317ff8 |
block: add polled wakeup task helper
If we're polling for IO on a device that doesn't use interrupts, then IO completion loop (and wake of task) is done by submitting task itself. If that is the case, then we don't need to enter the wake_up_process() function, we can simply mark ourselves as TASK_RUNNING. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
35f3aa39f2 |
mm: Replace spin_is_locked() with lockdep
lockdep_assert_held() is better suited to checking locking requirements, since it only checks if the current thread holds the lock regardless of whether someone else does. This is also a step towards possibly removing spin_is_locked(). Signed-off-by: Lance Roy <ldr709@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Shakeel Butt <shakeelb@google.com> Cc: <linux-mm@kvack.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> |
||
|
64e3d12f76 |
mm, drm/i915: mark pinned shmemfs pages as unevictable
The i915 driver uses shmemfs to allocate backing storage for gem objects. These shmemfs pages can be pinned (increased ref count) by shmem_read_mapping_page_gfp(). When a lot of pages are pinned, vmscan wastes a lot of time scanning these pinned pages. In some extreme case, all pages in the inactive anon lru are pinned, and only the inactive anon lru is scanned due to inactive_ratio, the system cannot swap and invokes the oom-killer. Mark these pinned pages as unevictable to speed up vmscan. Export pagevec API check_move_unevictable_pages(). This patch was inspired by Chris Wilson's change [1]. [1]: https://patchwork.kernel.org/patch/9768741/ Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Kuo-Hsin Yang <vovoy@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> # mm part Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Link: https://patchwork.freedesktop.org/patch/msgid/20181106132324.17390-1-chris@chris-wilson.co.uk Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> |
||
|
dd33ad7b25 |
memory_hotplug: cond_resched in __remove_pages
We have received a bug report that unbinding a large pmem (>1TB) can result in a soft lockup: NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [ndctl:4365] [...] Supported: Yes CPU: 9 PID: 4365 Comm: ndctl Not tainted 4.12.14-94.40-default #1 SLE12-SP4 Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018 task: ffff9cce7d4410c0 task.stack: ffffbe9eb1bc4000 RIP: 0010:__put_page+0x62/0x80 Call Trace: devm_memremap_pages_release+0x152/0x260 release_nodes+0x18d/0x1d0 device_release_driver_internal+0x160/0x210 unbind_store+0xb3/0xe0 kernfs_fop_write+0x102/0x180 __vfs_write+0x26/0x150 vfs_write+0xad/0x1a0 SyS_write+0x42/0x90 do_syscall_64+0x74/0x150 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7fd13166b3d0 It has been reported on an older (4.12) kernel but the current upstream code doesn't cond_resched in the hot remove code at all and the given range to remove might be really large. Fix the issue by calling cond_resched once per memory section. Link: http://lkml.kernel.org/r/20181031125840.23982-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Thumshirn <jthumshirn@suse.de> Cc: Dan Williams <dan.j.williams@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
89c83fb539 |
mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask
THP allocation mode is quite complex and it depends on the defrag mode. This complexity is hidden in alloc_hugepage_direct_gfpmask from a large part currently. The NUMA special casing (namely __GFP_THISNODE) is however independent and placed in alloc_pages_vma currently. This both adds an unnecessary branch to all vma based page allocation requests and it makes the code more complex unnecessarily as well. Not to mention that e.g. shmem THP used to do the node reclaiming unconditionally regardless of the defrag mode until recently. This was not only unexpected behavior but it was also hardly a good default behavior and I strongly suspect it was just a side effect of the code sharing more than a deliberate decision which suggests that such a layering is wrong. Get rid of the thp special casing from alloc_pages_vma and move the logic to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to the resulting gfp mask only when the direct reclaim is not requested and when there is no explicit numa binding to preserve the current logic. Please note that there's also a slight difference wrt MPOL_BIND now. The previous code would avoid using __GFP_THISNODE if the local node was outside of policy_nodemask(). After this patch __GFP_THISNODE is avoided for all MPOL_BIND policies. So there's a difference that if local node is actually allowed by the bind policy's nodemask, previously __GFP_THISNODE would be added, but now it won't be. From the behavior POV this is still correct because the policy nodemask is used. Link: http://lkml.kernel.org/r/20180925120326.24392-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
ac5b2c1891 |
mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
THP allocation might be really disruptive when allocated on NUMA system with the local node full or hard to reclaim. Stefan has posted an allocation stall report on 4.12 based SLES kernel which suggests the same issue: kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null) kvm cpuset=/ mems_allowed=0-1 CPU: 10 PID: 84752 Comm: kvm Tainted: G W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased) Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017 Call Trace: dump_stack+0x5c/0x84 warn_alloc+0xe0/0x180 __alloc_pages_slowpath+0x820/0xc90 __alloc_pages_nodemask+0x1cc/0x210 alloc_pages_vma+0x1e5/0x280 do_huge_pmd_wp_page+0x83f/0xf00 __handle_mm_fault+0x93d/0x1060 handle_mm_fault+0xc6/0x1b0 __do_page_fault+0x230/0x430 do_page_fault+0x2a/0x70 page_fault+0x7b/0x80 [...] Mem-Info: active_anon:126315487 inactive_anon:1612476 isolated_anon:5 active_file:60183 inactive_file:245285 isolated_file:0 unevictable:15657 dirty:286 writeback:1 unstable:0 slab_reclaimable:75543 slab_unreclaimable:2509111 mapped:81814 shmem:31764 pagetables:370616 bounce:0 free:32294031 free_pcp:6233 free_cma:0 Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no The defrag mode is "madvise" and from the above report it is clear that the THP has been allocated for MADV_HUGEPAGA vma. Andrea has identified that the main source of the problem is __GFP_THISNODE usage: : The problem is that direct compaction combined with the NUMA : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very : hard the local node, instead of failing the allocation if there's no : THP available in the local node. : : Such logic was ok until __GFP_THISNODE was added to the THP allocation : path even with MPOL_DEFAULT. : : The idea behind the __GFP_THISNODE addition, is that it is better to : provide local memory in PAGE_SIZE units than to use remote NUMA THP : backed memory. That largely depends on the remote latency though, on : threadrippers for example the overhead is relatively low in my : experience. : : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in : extremely slow qemu startup with vfio, if the VM is larger than the : size of one host NUMA node. This is because it will try very hard to : unsuccessfully swapout get_user_pages pinned pages as result of the : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE : allocations and instead of trying to allocate THP on other nodes (it : would be even worse without vfio type1 GUP pins of course, except it'd : be swapping heavily instead). Fix this by removing __GFP_THISNODE for THP requests which are requesting the direct reclaim. This effectivelly reverts |
||
|
e68599a3c3 |
mm: handle no memcg case in memcg_kmem_charge() properly
Mike Galbraith reported a regression caused by the commit |
||
|
5f21585384 |
for-linus-20181102
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvchGgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpj/1D/4kEQx4ncnFoZk8QshHV1L++rH3BbcLjQDd Wbh9ZSIQdI/gHTzS6bE7x3YfcbpMWPMO3+jFawdfRiFTEjlF8vQ+mnJ+Btb3z4D6 mGEeFGVhHExlp2a0x/Ma8YWVNlMB7BE8Tq73bZEVMY+9lbpmDW/vp7Sfa87LBDKQ ZmY+My+VdHN7qLtQ7t3W/HtpbU+kcXMMd3ICjK4i+ofXy6mynk4+oQ2jwyXc5L86 UCJCsTsSRr3CgbnkW/uprHo0XHk8i7O/4C3oR+x4pAIxCCa9g+vmw0EO9fvi/2iQ qe8jKdm7Y09xu/TiPBa7iz45tdh0cNMJKo3OezmSF9Np+r69KL5C/U4GRPKN3Iwm keoqn14ScABkYMSe4ys1AdEgKD6bNUaW3r/lJxTH2oUR23mjnCLp7c4WD/G+MlbB CzoakQyCHTZmDFLr2Kc8bkjmpil2T2UFfmLIDAu30LWIYeSGpiIO/V+g1foJMF2f 06ERltNvgX1BJjoh4NSWySLEf1ZtkUU60NeATRol6gwhnIyLrHsgfm6OEhqlW/7x Xc1BWyzX7K6c3Dskk/u5aSRyXOyRC9KkMt3/2XexeDNHkte9yMH0IgSvopPBuER8 +iPvPjNp7ychTKZB3zpSnlqGgePTjbufIEBtO3OyUmDZKjUqxahtxkQfmPhoclu+ XdR4ArcqNg== =0zM4 -----END PGP SIGNATURE----- Merge tag 'for-linus-20181102' of git://git.kernel.dk/linux-block Pull block layer fixes from Jens Axboe: "The biggest part of this pull request is the revert of the blkcg cleanup series. It had one fix earlier for a stacked device issue, but another one was reported. Rather than play whack-a-mole with this, revert the entire series and try again for the next kernel release. Apart from that, only small fixes/changes. Summary: - Indentation fixup for mtip32xx (Colin Ian King) - The blkcg cleanup series revert (Dennis Zhou) - Two NVMe fixes. One fixing a regression in the nvme request initialization in this merge window, causing nvme-fc to not work. The other is a suspend/resume p2p resource issue (James, Keith) - Fix sg discard merge, allowing us to merge in cases where we didn't before (Jianchao Wang) - Call rq_qos_exit() after the queue is frozen, preventing a hang (Ming) - Fix brd queue setup, fixing an oops if we fail setting up all devices (Ming)" * tag 'for-linus-20181102' of git://git.kernel.dk/linux-block: nvme-pci: fix conflicting p2p resource adds nvme-fc: fix request private initialization blkcg: revert blkcg cleanups series block: brd: associate with queue until adding disk block: call rq_qos_exit() after queue is frozen mtip32xx: clean an indentation issue, remove extraneous tabs block: fix the DISCARD request merge |
||
|
c2aa1a444c |
vfs: rework data cloning infrastructure
Rework the vfs_clone_file_range and vfs_dedupe_file_range infrastructure to use a common .remap_file_range method and supply generic bounds and sanity checking functions that are shared with the data write path. The current VFS infrastructure has problems with rlimit, LFS file sizes, file time stamps, maximum filesystem file sizes, stripping setuid bits, etc and so they are addressed in these commits. We also introduce the ability for the ->remap_file_range methods to return short clones so that clones for vfs_copy_file_range() don't get rejected if the entire range can't be cloned. It also allows filesystems to sliently skip deduplication of partial EOF blocks if they are not capable of doing so without requiring errors to be thrown to userspace. All existing filesystems are converted to user the new .remap_file_range method, and both XFS and ocfs2 are modified to make use of the new generic checking infrastructure. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJb29gEAAoJEK3oKUf0dfodpOAQAL2VbHjvKXEwNMDTKscSRMmZ Z0xXo3gamFKQ+VGOqy2g2lmAYQs9SAnTuCGTJ7zIAp7u+q8gzUy5FzKAwLS4Id6L 8siaY6nzlicfO04d0MdXnWz0f3xykChgzfdQfVUlUi7WrDioBUECLPmx4a+USsp1 DQGjLOZfoOAmn2rijdnH9RTEaHqg+8mcTaLN9TRav4gGqrWxldFKXw2y6ouFC7uo /hxTRNXR9VI+EdbDelwBNXl9nU9gQA0WLOvRKwgUrtv6bSJohTPsmXt7EbBtNcVR cl3zDNc1sLD1bLaRLEUAszI/33wXaaQgom1iB51obIcHHef+JxRNG/j6rUMfzxZI VaauGv5EIvtaKN0LTAqVVLQ8t2MQFYfOr8TykmO+1UFog204aKRANdVMHDSjxD/0 dTGKJGcq+HnKQ+JHDbTdvuXEL8sUUl1FiLjOQbZPw63XmuddLKFUA2TOjXn6htbU 1h1MG5d9KjGLpabp2BQheczD08NuSmcrOBNt7IoeI3+nxr3HpMwprfB9TyaERy9X iEgyVXmjjc9bLLRW7A2wm77aW64NvPs51wKMnvuNgNwnCewrGS6cB8WVj2zbQjH1 h3f3nku44s9ctNPSBzb/sJLnpqmZQ5t0oSmrMSN+5+En6rNTacoJCzxHRJBA7z/h Z+C6y1GTZw0euY6Zjiwu =CE/A -----END PGP SIGNATURE----- Merge tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull vfs dedup fixes from Dave Chinner: "This reworks the vfs data cloning infrastructure. We discovered many issues with these interfaces late in the 4.19 cycle - the worst of them (data corruption, setuid stripping) were fixed for XFS in 4.19-rc8, but a larger rework of the infrastructure fixing all the problems was needed. That rework is the contents of this pull request. Rework the vfs_clone_file_range and vfs_dedupe_file_range infrastructure to use a common .remap_file_range method and supply generic bounds and sanity checking functions that are shared with the data write path. The current VFS infrastructure has problems with rlimit, LFS file sizes, file time stamps, maximum filesystem file sizes, stripping setuid bits, etc and so they are addressed in these commits. We also introduce the ability for the ->remap_file_range methods to return short clones so that clones for vfs_copy_file_range() don't get rejected if the entire range can't be cloned. It also allows filesystems to sliently skip deduplication of partial EOF blocks if they are not capable of doing so without requiring errors to be thrown to userspace. Existing filesystems are converted to user the new remap_file_range method, and both XFS and ocfs2 are modified to make use of the new generic checking infrastructure" * tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits) xfs: remove [cm]time update from reflink calls xfs: remove xfs_reflink_remap_range xfs: remove redundant remap partial EOF block checks xfs: support returning partial reflink results xfs: clean up xfs_reflink_remap_blocks call site xfs: fix pagecache truncation prior to reflink ocfs2: remove ocfs2_reflink_remap_range ocfs2: support partial clone range and dedupe range ocfs2: fix pagecache truncation prior to reflink ocfs2: truncate page cache for clone destination file before remapping vfs: clean up generic_remap_file_range_prep return value vfs: hide file range comparison function vfs: enable remap callers that can handle short operations vfs: plumb remap flags through the vfs dedupe functions vfs: plumb remap flags through the vfs clone functions vfs: make remap_file_range functions take and return bytes completed vfs: remap helper should update destination inode metadata vfs: pass remap flags to generic_remap_checks vfs: pass remap flags to generic_remap_file_range_prep vfs: combine the clone and dedupe into a single remap_file_range ... |
||
|
9931a07d51 |
Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull AFS updates from Al Viro: "AFS series, with some iov_iter bits included" * 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) missing bits of "iov_iter: Separate type from direction and use accessor functions" afs: Probe multiple fileservers simultaneously afs: Fix callback handling afs: Eliminate the address pointer from the address list cursor afs: Allow dumping of server cursor on operation failure afs: Implement YFS support in the fs client afs: Expand data structure fields to support YFS afs: Get the target vnode in afs_rmdir() and get a callback on it afs: Calc callback expiry in op reply delivery afs: Fix FS.FetchStatus delivery from updating wrong vnode afs: Implement the YFS cache manager service afs: Remove callback details from afs_callback_break struct afs: Commit the status on a new file/dir/symlink afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS afs: Don't invoke the server to read data beyond EOF afs: Add a couple of tracepoints to log I/O errors afs: Handle EIO from delivery function afs: Fix TTL on VL server and address lists afs: Implement VL server rotation afs: Improve FS server rotation error handling ... |
||
|
b5f2954d30 |
blkcg: revert blkcg cleanups series
This reverts a series committed earlier due to null pointer exception bug report in [1]. It seems there are edge case interactions that I did not consider and will need some time to understand what causes the adverse interactions. The original series can be found in [2] with a follow up series in [3]. [1] https://www.spinics.net/lists/cgroups/msg20719.html [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/ [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/ This reverts the following commits: |
||
|
b5b1de3537 |
virtio, vhost: fixes, tweaks
virtio balloon page hinting support vhost scsi control queue misc fixes. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJb222AAAoJECgfDbjSjVRpfHMH/1uh87pIj6Qbh9LRZm0dRVHs iEuIa0TECb+9AmqHRBliEcqWjihWnlQnSwE6d/ZTk9cH9zEEXsIu0B7nDzjGwJ5b m7wq679JqrtGpCdQhRH85sk8P3fs5ldatdbc4/0fsPwwkoXHcqTOpSZWyhtQHGc4 EkzzxHXVhVXgRBjzLXVCMoOAQ+8QfMZFrKIgKuOB0I4OVughFrAGf0Hemm18f4CL 5+YwsJQjleDkm+Udf+FTQS2oZ57DJsOLm2bwoKgqCkBaDfPlR92uWjgTa50WXggo RaokpFQkJKpz11yenezslzrVWUJApnWnUhfRd71t1ttvujrrcD9zUvWEVMURbuU= =2i/5 -----END PGP SIGNATURE----- Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio/vhost updates from Michael Tsirkin: "Fixes and tweaks: - virtio balloon page hinting support - vhost scsi control queue - misc fixes" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: MAINTAINERS: remove reference to bogus vsock file vhost/scsi: Use common handling code in request queue handler vhost/scsi: Extract common handling code from control queue handler vhost/scsi: Respond to control queue operations vhost/scsi: truncate T10 PI iov_iter to prot_bytes virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON mm/page_poison: expose page_poisoning_enabled to kernel modules virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT kvm_config: add CONFIG_VIRTIO_MENU |
||
|
6444ccfd69 |
Merge branch 'for-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu fixes from Dennis Zhou: "Two small things for v4.20. The first fixes a clang uninitialized variable warning for arm64 in the default path calls BUILD_BUG(). The second removes an unnecessary unlikely() in a WARN_ON() use" * 'for-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu: arm64: percpu: Initialize ret in the default case mm: percpu: remove unnecessary unlikely() |
||
|
2ebe82288b |
mm/gup.c: fix __get_user_pages_fast() comment
mmu_gather_tlb() no longer exists. Replace with mmu_table_batch(). Link: http://lkml.kernel.org/r/20180928053441.rpzwafzlsnp74mkl@wfg-t540p.sh.intel.com Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Kosina <trivial@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
f2c57d91b0 |
mm: Fix warning in insert_pfn()
In DAX mode a write pagefault can race with write(2) in the following way: CPU0 CPU1 write fault for mapped zero page (hole) dax_iomap_rw() iomap_apply() xfs_file_iomap_begin() - allocates blocks dax_iomap_actor() invalidate_inode_pages2_range() - invalidates radix tree entries in given range dax_iomap_pte_fault() grab_mapping_entry() - no entry found, creates empty ... xfs_file_iomap_begin() - finds already allocated block ... vmf_insert_mixed_mkwrite() - WARNs and does nothing because there is still zero page mapped in PTE unmap_mapping_pages() This race results in WARN_ON from insert_pfn() and is occasionally triggered by fstest generic/344. Note that the race is otherwise harmless as before write(2) on CPU0 is finished, we will invalidate page tables properly and thus user of mmap will see modified data from write(2) from that point on. So just restrict the warning only to the case when the PFN in PTE is not zero page. Link: http://lkml.kernel.org/r/20180824154542.26872-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
381eab4a6e |
mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
There seem to be some problems as result of |
||
|
8df1d0e4a2 |
mm/memory_hotplug: make add_memory() take the device_hotplug_lock
add_memory() currently does not take the device_hotplug_lock, however
is aleady called under the lock from
arch/powerpc/platforms/pseries/hotplug-memory.c
drivers/acpi/acpi_memhotplug.c
to synchronize against CPU hot-remove and similar.
In general, we should hold the device_hotplug_lock when adding memory to
synchronize against online/offline request (e.g. from user space) - which
already resulted in lock inversions due to device_lock() and
mem_hotplug_lock - see
|
||
|
d15e59260f |
mm/memory_hotplug: make remove_memory() take the device_hotplug_lock
Patch series "mm: online/offline_pages called w.o. mem_hotplug_lock", v3. Reading through the code and studying how mem_hotplug_lock is to be used, I noticed that there are two places where we can end up calling device_online()/device_offline() - online_pages()/offline_pages() without the mem_hotplug_lock. And there are other places where we call device_online()/device_offline() without the device_hotplug_lock. While e.g. echo "online" > /sys/devices/system/memory/memory9/state is fine, e.g. echo 1 > /sys/devices/system/memory/memory9/online Will not take the mem_hotplug_lock. However the device_lock() and device_hotplug_lock. E.g. via memory_probe_store(), we can end up calling add_memory()->online_pages() without the device_hotplug_lock. So we can have concurrent callers in online_pages(). We e.g. touch in online_pages() basically unprotected zone->present_pages then. Looks like there is a longer history to that (see Patch #2 for details), and fixing it to work the way it was intended is not really possible. We would e.g. have to take the mem_hotplug_lock in device/base/core.c, which sounds wrong. Summary: We had a lock inversion on mem_hotplug_lock and device_lock(). More details can be found in patch 3 and patch 6. I propose the general rules (documentation added in patch 6): 1. add_memory/add_memory_resource() must only be called with device_hotplug_lock. 2. remove_memory() must only be called with device_hotplug_lock. This is already documented and holds for all callers. 3. device_online()/device_offline() must only be called with device_hotplug_lock. This is already documented and true for now in core code. Other callers (related to memory hotplug) have to be fixed up. 4. mem_hotplug_lock is taken inside of add_memory/remove_memory/ online_pages/offline_pages. To me, this looks way cleaner than what we have right now (and easier to verify). And looking at the documentation of remove_memory, using lock_device_hotplug also for add_memory() feels natural. This patch (of 6): remove_memory() is exported right now but requires the device_hotplug_lock, which is not exported. So let's provide a variant that takes the lock and only export that one. The lock is already held in arch/powerpc/platforms/pseries/hotplug-memory.c drivers/acpi/acpi_memhotplug.c arch/powerpc/platforms/powernv/memtrace.c Apart from that, there are not other users in the tree. Link: http://lkml.kernel.org/r/20180925091457.28651-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Rashmica Gupta <rashmica.g@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Rashmica Gupta <rashmica.g@gmail.com> Cc: Michael Neuling <mikey@neuling.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: John Allen <jallen@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com> Cc: Mathieu Malaterre <malat@debian.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juergen Gross <jgross@suse.com> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2f770806fd |
mm/memblock.c: warn if zero alignment was requested
After updating all memblock users to explicitly specify SMP_CACHE_BYTES alignment rather than use 0, it is still possible that uncovered users may sneak in. Add a WARN_ON_ONCE for such cases. [sfr@canb.auug.org.au: use dump_stack() instead of WARN_ON_ONCE for the alignment checks] Link: http://lkml.kernel.org/r/20181016131927.6ceba6ab@canb.auug.org.au [akpm@linux-foundation.org: add apologetic comment] Link: http://lkml.kernel.org/r/20181011060850.GA19822@rapoport-lnx Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
7e1c4e2792 |
memblock: stop using implicit alignment to SMP_CACHE_BYTES
When a memblock allocation APIs are called with align = 0, the alignment is implicitly set to SMP_CACHE_BYTES. Implicit alignment is done deep in the memblock allocator and it can come as a surprise. Not that such an alignment would be wrong even when used incorrectly but it is better to be explicit for the sake of clarity and the prinicple of the least surprise. Replace all such uses of memblock APIs with the 'align' parameter explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment in the memblock internal allocation functions. For the case when memblock APIs are used via helper functions, e.g. like iommu_arena_new_node() in Alpha, the helper functions were detected with Coccinelle's help and then manually examined and updated where appropriate. The direct memblock APIs users were updated using the semantic patch below: @@ expression size, min_addr, max_addr, nid; @@ ( | - memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc(size, 0) + memblock_alloc(size, SMP_CACHE_BYTES) | - memblock_alloc_raw(size, 0) + memblock_alloc_raw(size, SMP_CACHE_BYTES) | - memblock_alloc_from(size, 0, min_addr) + memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr) | - memblock_alloc_nopanic(size, 0) + memblock_alloc_nopanic(size, SMP_CACHE_BYTES) | - memblock_alloc_low(size, 0) + memblock_alloc_low(size, SMP_CACHE_BYTES) | - memblock_alloc_low_nopanic(size, 0) + memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES) | - memblock_alloc_from_nopanic(size, 0, min_addr) + memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr) | - memblock_alloc_node(size, 0, nid) + memblock_alloc_node(size, SMP_CACHE_BYTES, nid) ) [mhocko@suse.com: changelog update] [akpm@linux-foundation.org: coding-style fixes] [rppt@linux.ibm.com: fix missed uses of implicit alignment] Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Paul Burton <paul.burton@mips.com> [MIPS] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
57c8a661d9 |
mm: remove include/linux/bootmem.h
Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
97ad1087ef |
memblock: replace BOOTMEM_ALLOC_* with MEMBLOCK variants
Drop BOOTMEM_ALLOC_ACCESSIBLE and BOOTMEM_ALLOC_ANYWHERE in favor of identical MEMBLOCK definitions. Link: http://lkml.kernel.org/r/1536927045-23536-29-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
bda49a8116 |
mm: remove nobootmem
Move a few remaining functions from nobootmem.c to memblock.c and remove nobootmem Link: http://lkml.kernel.org/r/1536927045-23536-28-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
7c2ee349cf |
memblock: rename __free_pages_bootmem to memblock_free_pages
The conversion is done using sed -i 's@__free_pages_bootmem@memblock_free_pages@' \ $(git grep -l __free_pages_bootmem) Link: http://lkml.kernel.org/r/1536927045-23536-27-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
c6ffc5ca8f |
memblock: rename free_all_bootmem to memblock_free_all
The conversion is done using sed -i 's@free_all_bootmem@memblock_free_all@' \ $(git grep -l free_all_bootmem) Link: http://lkml.kernel.org/r/1536927045-23536-26-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
53ab85ebfd |
memblock: replace free_bootmem_late with memblock_free_late
The free_bootmem_late and memblock_free_late do exactly the same thing: they iterate over a range and give pages to the page allocator. Replace calls to free_bootmem_late with calls to memblock_free_late and remove the bootmem variant. Link: http://lkml.kernel.org/r/1536927045-23536-25-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2013288f72 |
memblock: replace free_bootmem{_node} with memblock_free
The free_bootmem and free_bootmem_node are merely wrappers for memblock_free. Replace their usage with a call to memblock_free using the following semantic patch: @@ expression e1, e2, e3; @@ ( - free_bootmem(e1, e2) + memblock_free(e1, e2) | - free_bootmem_node(e1, e2, e3) + memblock_free(e2, e3) ) Link: http://lkml.kernel.org/r/1536927045-23536-24-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
6c7835f8d0 |
mm: nobootmem: remove bootmem allocation APIs
The bootmem compatibility APIs are not used and can be removed. Link: http://lkml.kernel.org/r/1536927045-23536-23-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
3913c8f9f9 |
memblock: add align parameter to memblock_alloc_node()
With the align parameter memblock_alloc_node() can be used as drop in replacement for alloc_bootmem_pages_node() and __alloc_bootmem_node(), which is done in the following patches. Link: http://lkml.kernel.org/r/1536927045-23536-15-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
eb31d559f1 |
memblock: remove _virt from APIs returning virtual address
The conversion is done using sed -i 's@memblock_virt_alloc@memblock_alloc@g' \ $(git grep -l memblock_virt_alloc) Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
9a8dd708d5 |
memblock: rename memblock_alloc{_nid,_try_nid} to memblock_phys_alloc*
Make it explicit that the caller gets a physical address rather than a virtual one. This will also allow using meblock_alloc prefix for memblock allocations returning virtual address, which is done in the following patches. The conversion is done using the following semantic patch: @@ expression e1, e2, e3; @@ ( - memblock_alloc(e1, e2) + memblock_phys_alloc(e1, e2) | - memblock_alloc_nid(e1, e2, e3) + memblock_phys_alloc_nid(e1, e2, e3) | - memblock_alloc_try_nid(e1, e2, e3) + memblock_phys_alloc_try_nid(e1, e2, e3) ) Link: http://lkml.kernel.org/r/1536927045-23536-7-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
b146ada221 |
mm: nobootmem: remove dead code
Several bootmem functions and macros are not used. Remove them. Link: http://lkml.kernel.org/r/1536927045-23536-6-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
355c45affc |
mm: remove bootmem allocator implementation.
All architectures have been converted to use MEMBLOCK + NO_BOOTMEM. The bootmem allocator implementation can be removed. Link: http://lkml.kernel.org/r/1536927045-23536-5-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
aca52c3983 |
mm: remove CONFIG_HAVE_MEMBLOCK
All architecures use memblock for early memory management. There is no need for the CONFIG_HAVE_MEMBLOCK configuration option. [rppt@linux.vnet.ibm.com: of/fdt: fixup #ifdefs] Link: http://lkml.kernel.org/r/20180919103457.GA20545@rapoport-lnx [rppt@linux.vnet.ibm.com: csky: fixups after bootmem removal] Link: http://lkml.kernel.org/r/20180926112744.GC4628@rapoport-lnx [rppt@linux.vnet.ibm.com: remove stale #else and the code it protects] Link: http://lkml.kernel.org/r/1538067825-24835-1-git-send-email-rppt@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1536927045-23536-4-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
b4a991ec58 |
mm: remove CONFIG_NO_BOOTMEM
All achitectures select NO_BOOTMEM which essentially becomes 'Y' for any kernel configuration and therefore it can be removed. [alexander.h.duyck@linux.intel.com: remove now defunct NO_BOOTMEM from depends list for deferred init] Link: http://lkml.kernel.org/r/20180925201814.3576.15105.stgit@localhost.localdomain Link: http://lkml.kernel.org/r/1536927045-23536-3-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
4b408c74ee |
mm/gup_benchmark.c: prevent integer overflow in ioctl
The concern here is that "gup->size" is a u64 and "nr_pages" is unsigned
long. On 32 bit systems we could trick the kernel into allocating fewer
pages than expected.
Link: http://lkml.kernel.org/r/20181025061546.hnhkv33diogf2uis@kili.mountain
Fixes:
|
||
|
ec131b2d7f |
mm/hmm: invalidate device page table at start of invalidation
Invalidate device page table at start of invalidation and invalidate in progress CPU page table snapshooting at both start and end of any invalidation. This is helpful when device need to dirty page because the device page table report the page as dirty. Dirtying page must happen in the start mmu notifier callback and not in the end one. Link: http://lkml.kernel.org/r/20181019160442.18723-7-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
44532d4c59 |
mm/hmm: use a structure for update callback parameters
Use a structure to gather all the parameters for the update callback. This make it easier when adding new parameters by avoiding having to update all callback function signature. The hmm_update structure is always associated with a mmu_notifier callbacks so we are not planing on grouping multiple updates together. Nor do we care about page size for the range as range will over fully cover the page being invalidated (this is a mmu_notifier property). Link: http://lkml.kernel.org/r/20181019160442.18723-6-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d08faca018 |
mm/hmm: properly handle migration pmd
Before this patch migration pmd entry (!pmd_present()) would have been treated as a bad entry (pmd_bad() returns true on migration pmd entry). The outcome was that device driver would believe that the range covered by the pmd was bad and would either SIGBUS or simply kill all the device's threads (each device driver decide how to react when the device tries to access poisonnous or invalid range of memory). This patch explicitly handle the case of migration pmd entry which are non present pmd entry and either wait for the migration to finish or report empty range (when device is just trying to pre- fill a range of virtual address and thus do not want to wait or trigger page fault). Link: http://lkml.kernel.org/r/20181019160442.18723-5-jglisse@redhat.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Michal Hocko <mhocko@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
86a2d59841 |
mm/hmm: fix race between hmm_mirror_unregister() and mmu_notifier callback
In hmm_mirror_unregister(), mm->hmm is set to NULL and then mmu_notifier_unregister_no_release() is called. That creates a small window where mmu_notifier can call mmu_notifier_ops with mm->hmm equal to NULL. Fix this by first unregistering mmu notifier callbacks and then setting mm->hmm to NULL. Similarly in hmm_register(), set mm->hmm before registering mmu_notifier callbacks so callback functions always see mm->hmm set. Link: http://lkml.kernel.org/r/20181019160442.18723-4-jglisse@redhat.com Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Balbir Singh <bsingharora@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
aab8d0520e |
mm/rmap: map_pte() was not handling private ZONE_DEVICE page properly
Private ZONE_DEVICE pages use a special pte entry and thus are not present. Properly handle this case in map_pte(), it is already handled in check_pte(), the map_pte() part was lost in some rebase most probably. Without this patch the slow migration path can not migrate back to any private ZONE_DEVICE memory to regular memory. This was found after stress testing migration back to system memory. This ultimatly can lead to the CPU constantly page fault looping on the special swap entry. Link: http://lkml.kernel.org/r/20181019160442.18723-3-jglisse@redhat.com Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Balbir Singh <bsingharora@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
f813f21971 |
mm/hmm: fix utf8 ...
Patch series "HMM updates, improvements and fixes", v2 Few fixes that only affect HMM users. Improve the synchronization call back so that we match was other mmu_notifier listener do and add proper support to the new blockable flags in the process. For curious folks here are branches to leverage HMM in various existing device drivers: https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau-v01 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-radeon-v00 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-intel-v00 More to come (amd gpu, Mellanox, ...) I expect more of the preparatory work for nouveau will be merge in 4.20 (like we have been doing since 4.16) and i will wait until this patchset is upstream before pushing the patches that actualy make use of HMM (to avoid complex tree inter-dependency). This patch (of 6): Somehow utf=8 must have been broken. Link: http://lkml.kernel.org/r/20181019160442.18723-2-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
eca3654e3c |
vfs: enable remap callers that can handle short operations
Plumb in a remap flag that enables the filesystem remap handler to shorten remapping requests for callers that can handle it. Now copy_file_range can report partial success (in case we run up against alignment problems, resource limits, etc.). We also enable CAN_SHORTEN for fideduperange to maintain existing userspace-visible behavior where xfs/btrfs shorten the dedupe range to avoid stale post-eof data exposure. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
|
42ec3d4c02 |
vfs: make remap_file_range functions take and return bytes completed
Change the remap_file_range functions to take a number of bytes to operate upon and return the number of bytes they operated on. This is a requirement for allowing fs implementations to return short clone/dedupe results to the user, which will enable us to obey resource limits in a graceful manner. A subsequent patch will enable copy_file_range to signal to the ->clone_file_range implementation that it can handle a short length, which will be returned in the function's return value. For now the short return is not implemented anywhere so the behavior won't change -- either copy_file_range manages to clone the entire range or it tries an alternative. Neither clone ioctl can take advantage of this, alas. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
|
3d28193e1d |
vfs: pass remap flags to generic_remap_checks
Pass the same remap flags to generic_remap_checks for consistency. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
|
9fd91a90cb |
vfs: strengthen checking of file range inputs to generic_remap_checks
File range remapping, if allowed to run past the destination file's EOF, is an optimization on a regular file write. Regular file writes that extend the file length are subject to various constraints which are not checked by range cloning. This is a correctness problem because we're never allowed to touch ranges that the page cache can't support (s_maxbytes); we're not supposed to deal with large offsets (MAX_NON_LFS) if O_LARGEFILE isn't set; and we must obey resource limits (RLIMIT_FSIZE). Therefore, add these checks to the new generic_remap_checks function so that we curtail unexpected behavior. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
|
1383a7ed67 |
vfs: check file ranges before cloning files
Move the file range checks from vfs_clone_file_prep into a separate generic_remap_checks function so that all the checks are collected in a central location. This forms the basis for adding more checks from generic_write_checks that will make cloning's input checking more consistent with write input checking. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
|
dad4f140ed |
Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax
Pull XArray conversion from Matthew Wilcox: "The XArray provides an improved interface to the radix tree data structure, providing locking as part of the API, specifying GFP flags at allocation time, eliminating preloading, less re-walking the tree, more efficient iterations and not exposing RCU-protected pointers to its users. This patch set 1. Introduces the XArray implementation 2. Converts the pagecache to use it 3. Converts memremap to use it The page cache is the most complex and important user of the radix tree, so converting it was most important. Converting the memremap code removes the only other user of the multiorder code, which allows us to remove the radix tree code that supported it. I have 40+ followup patches to convert many other users of the radix tree over to the XArray, but I'd like to get this part in first. The other conversions haven't been in linux-next and aren't suitable for applying yet, but you can see them in the xarray-conv branch if you're interested" * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits) radix tree: Remove multiorder support radix tree test: Convert multiorder tests to XArray radix tree tests: Convert item_delete_rcu to XArray radix tree tests: Convert item_kill_tree to XArray radix tree tests: Move item_insert_order radix tree test suite: Remove multiorder benchmarking radix tree test suite: Remove __item_insert memremap: Convert to XArray xarray: Add range store functionality xarray: Move multiorder_check to in-kernel tests xarray: Move multiorder_shrink to kernel tests xarray: Move multiorder account test in-kernel radix tree test suite: Convert iteration test to XArray radix tree test suite: Convert tag_tagged_items to XArray radix tree: Remove radix_tree_clear_tags radix tree: Remove radix_tree_maybe_preload_order radix tree: Remove split/join code radix tree: Remove radix_tree_update_node_t page cache: Finish XArray conversion dax: Convert page fault handlers to XArray ... |
||
|
22146c3ce9 |
hugetlbfs: dirty pages as they are added to pagecache
Some test systems were experiencing negative huge page reserve counts and
incorrect file block counts. This was traced to /proc/sys/vm/drop_caches
removing clean pages from hugetlbfs file pagecaches. When non-hugetlbfs
explicit code removes the pages, the appropriate accounting is not
performed.
This can be recreated as follows:
fallocate -l 2M /dev/hugepages/foo
echo 1 > /proc/sys/vm/drop_caches
fallocate -l 2M /dev/hugepages/foo
grep -i huge /proc/meminfo
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
HugePages_Total: 2048
HugePages_Free: 2047
HugePages_Rsvd: 18446744073709551615
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 4194304 kB
ls -lsh /dev/hugepages/foo
4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo
To address this issue, dirty pages as they are added to pagecache. This
can easily be reproduced with fallocate as shown above. Read faulted
pages will eventually end up being marked dirty. But there is a window
where they are clean and could be impacted by code such as drop_caches.
So, just dirty them all as they are added to the pagecache.
Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com
Fixes:
|
||
|
aa8aa8a331 |
mm: export add_swap_extent()
Btrfs currently does not support swap files because swap's use of bmap does not work with copy-on-write and multiple devices. See |
||
|
bc4ae27d81 |
mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go through the filesystem, and to make swapoff() call ->swap_deactivate(). For Btrfs, we want the latter but not the former, so split this flag into two. This makes us always call ->swap_deactivate() if ->swap_activate() succeeded, not just if it didn't add any swap extents itself. This also resolves the issue of the very misleading name of SWP_FILE, which is only used for swap files over NFS. Link: http://lkml.kernel.org/r/6d63d8668c4287a4f6d203d65696e96f80abdfc7.1536704650.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
7eef5f97c1 |
mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
There should be no cache left by the time we overwrite the old transhuge pmd with the new one. It's already too late to flush through the virtual address because we already copied the page data to the new physical address. So flush the cache before the data copy. Also delete the "end" variable to shutoff a "unused variable" warning on x86 where flush_cache_range() is a noop. Link: http://lkml.kernel.org/r/20181015202311.7209-1-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
7066f0f933 |
mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB right away. do_huge_pmd_numa_page() flushes the TLB before calling migrate_misplaced_transhuge_page(). By the time do_huge_pmd_numa_page() runs some CPU could still access the page through the TLB. change_huge_pmd() before arming the numa/protnone transhuge pmd calls mmu_notifier_invalidate_range_start(). So there's no need of mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end() sequence in migrate_misplaced_transhuge_page() too, because by the time migrate_misplaced_transhuge_page() runs, the pmd mapping has already been invalidated in the secondary MMUs. It has to or if a secondary MMU can still write to the page, the migrate_page_copy() would lose data. However an explicit mmu_notifier_invalidate_range() is needed before migrate_misplaced_transhuge_page() starts copying the data of the transhuge page or the below can happen for MMU notifier users sharing the primary MMU pagetables and only implementing ->invalidate_range: CPU0 CPU1 GPU sharing linux pagetables using only ->invalidate_range ----------- ------------ --------- GPU secondary MMU writes to the page mapped by the transhuge pmd change_pmd_range() mmu..._range_start() ->invalidate_range_start() noop change_huge_pmd() set_pmd_at(numa/protnone) pmd_unlock() do_huge_pmd_numa_page() CPU TLB flush globally (1) CPU cannot write to page migrate_misplaced_transhuge_page() GPU writes to the page... migrate_page_copy() ...GPU stops writing to the page CPU TLB flush (2) mmu..._range_end() (3) ->invalidate_range_stop() noop ->invalidate_range() GPU secondary MMU is invalidated and cannot write to the page anymore (too late) Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives too late, we also need a mmu_notifier_invalidate_range() before calling migrate_misplaced_transhuge_page(), because the ->invalidate_range() in (3) also arrives too late. This requirement is the result of the lazy optimization in change_huge_pmd() that releases the pmd_lock without first flushing the TLB and without first calling mmu_notifier_invalidate_range(). Even converting the removed mmu_notifier_invalidate_range_only_end() into a mmu_notifier_invalidate_range_end() would not have been enough to fix this, because it run after migrate_page_copy(). After the hugepage data copy is done migrate_misplaced_transhuge_page() can proceed and call set_pmd_at without having to flush the TLB nor any secondary MMUs because the secondary MMU invalidate, just like the CPU TLB flush, has to happen before the migrate_page_copy() is called or it would be a bug in the first place (and it was for drivers using ->invalidate_range()). KVM is unaffected because it doesn't implement ->invalidate_range(). The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and uses the generic migrate_pages which transitions the pte from numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs and all mmu notifiers there before copying the page. Link: http://lkml.kernel.org/r/20181013002430.698-3-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d7c3393413 |
mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
Patch series "migrate_misplaced_transhuge_page race conditions". Aaron found a new instance of the THP MADV_DONTNEED race against pmdp_clear_flush* variants, that was apparently left unfixed. While looking into the race found by Aaron, I may have found two more issues in migrate_misplaced_transhuge_page. These race conditions would not cause kernel instability, but they'd corrupt userland data or leave data non zero after MADV_DONTNEED. I did only minor testing, and I don't expect to be able to reproduce this (especially the lack of ->invalidate_range before migrate_page_copy, requires the latest iommu hardware or infiniband to reproduce). The last patch is noop for x86 and it needs further review from maintainers of archs that implement flush_cache_range() (not in CC yet). To avoid confusion, it's not the first patch that introduces the bug fixed in the second patch, even before removing the pmdp_huge_clear_flush_notify, that _notify suffix was called after migrate_page_copy already run. This patch (of 3): This is a corollary of |
||
|
026d1eaf5e |
mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
The static lock quarantine_lock is used in quarantine.c to protect the quarantine queue datastructures. It is taken inside quarantine queue manipulation routines (quarantine_put(), quarantine_reduce() and quarantine_remove_cache()), with IRQs disabled. This is not a problem on a stock kernel but is problematic on an RT kernel where spin locks are sleeping spinlocks, which can sleep and can not be acquired with disabled interrupts. Convert the quarantine_lock to a raw spinlock_t. The usage of quarantine_lock is confined to quarantine.c and the work performed while the lock is held is used for debug purpose. [bigeasy@linutronix.de: slightly altered the commit message] Link: http://lkml.kernel.org/r/20181010214945.5owshc3mlrh74z4b@linutronix.de Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
df06b37ffe |
mm/gup: cache dev_pagemap while pinning pages
Getting pages from ZONE_DEVICE memory needs to check the backing device's live-ness, which is tracked in the device's dev_pagemap metadata. This metadata is stored in a radix tree and looking it up adds measurable software overhead. This patch avoids repeating this relatively costly operation when dev_pagemap is used by caching the last dev_pagemap while getting user pages. The gup_benchmark kernel self test reports this reduces time to get user pages to as low as 1/3 of the previous time. Link: http://lkml.kernel.org/r/20181012173040.15669-1-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
ec393a0f01 |
mm: return zero_resv_unavail optimization
When checking for valid pfns in zero_resv_unavail(), it is not necessary to verify that pfns within pageblock_nr_pages ranges are valid, only the first one needs to be checked. This is because memory for pages are allocated in contiguous chunks that contain pageblock_nr_pages struct pages. Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.com Signed-off-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reviewed-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
907ec5fca3 |
mm: zero remaining unavailable struct pages
Patch series "mm: Fix for movable_node boot option", v3. This patch series contains a fix for the movable_node boot option issue which was introduced by commit |
||
|
714a3a1eba |
mm/gup_benchmark.c: add additional pinning methods
Provide new gup benchmark ioctl commands to run different user page pinning methods, get_user_pages_longterm() and get_user_pages(), in addition to the existing get_user_pages_fast(). Link: http://lkml.kernel.org/r/20181010195605.10689-2-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
26db3d09d9 |
mm/gup_benchmark.c: time put_page()
We'd like to measure time to unpin user pages, so this adds a second benchmark timer on put_page, separate from get_page. Adding the field breaks this ioctl ABI, but should be okay since this an in-tree kernel selftest. [akpm@linux-foundation.org: add expansion to struct gup_benchmark for future use] Link: http://lkml.kernel.org/r/20181010195605.10689-1-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
7a1adfddaf |
mm: don't raise MEMCG_OOM event due to failed high-order allocation
It was reported that on some of our machines containers were restarted with OOM symptoms without an obvious reason. Despite there were almost no memory pressure and plenty of page cache, MEMCG_OOM event was raised occasionally, causing the container management software to think, that OOM has happened. However, no tasks have been killed. The following investigation showed that the problem is caused by a failing attempt to charge a high-order page. In such case, the OOM killer is never invoked. As shown below, it can happen under conditions, which are very far from a real OOM: e.g. there is plenty of clean page cache and no memory pressure. There is no sense in raising an OOM event in this case, as it might confuse a user and lead to wrong and excessive actions (e.g. restart the workload, as in my case). Let's look at the charging path in try_charge(). If the memory usage is about memory.max, which is absolutely natural for most memory cgroups, we try to reclaim some pages. Even if we were able to reclaim enough memory for the allocation, the following check can fail due to a race with another concurrent allocation: if (mem_cgroup_margin(mem_over_limit) >= nr_pages) goto retry; For regular pages the following condition will save us from triggering the OOM: if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER)) goto retry; But for high-order allocation this condition will intentionally fail. The reason behind is that we'll likely fall to regular pages anyway, so it's ok and even preferred to return ENOMEM. In this case the idea of raising MEMCG_OOM looks dubious. Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation order check, so that the event won't be raised for high order allocations. This change doesn't affect regular pages allocation and charging. Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
64081362e8 |
mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
We've recently seen a workload on XFS filesystems with a repeatable deadlock between background writeback and a multi-process application doing concurrent writes and fsyncs to a small range of a file. range_cyclic writeback Process 1 Process 2 xfs_vm_writepages write_cache_pages writeback_index = 2 cycled = 0 .... find page 2 dirty lock Page 2 ->writepage page 2 writeback page 2 clean page 2 added to bio no more pages write() locks page 1 dirties page 1 locks page 2 dirties page 1 fsync() .... xfs_vm_writepages write_cache_pages start index 0 find page 1 towrite lock Page 1 ->writepage page 1 writeback page 1 clean page 1 added to bio find page 2 towrite lock Page 2 page 2 is writeback <blocks> write() locks page 1 dirties page 1 fsync() .... xfs_vm_writepages write_cache_pages start index 0 !done && !cycled sets index to 0, restarts lookup find page 1 dirty find page 1 towrite lock Page 1 page 1 is writeback <blocks> lock Page 1 <blocks> DEADLOCK because: - process 1 needs page 2 writeback to complete to make enough progress to issue IO pending for page 1 - writeback needs page 1 writeback to complete so process 2 can progress and unlock the page it is blocked on, then it can issue the IO pending for page 2 - process 2 can't make progress until process 1 issues IO for page 1 The underlying cause of the problem here is that range_cyclic writeback is processing pages in descending index order as we hold higher index pages in a structure controlled from above write_cache_pages(). The write_cache_pages() caller needs to be able to submit these pages for IO before write_cache_pages restarts writeback at mapping index 0 to avoid wcp inverting the page lock/writeback wait order. generic_writepages() is not susceptible to this bug as it has no private context held across write_cache_pages() - filesystems using this infrastructure always submit pages in ->writepage immediately and so there is no problem with range_cyclic going back to mapping index 0. However: mpage_writepages() has a private bio context, exofs_writepages() has page_collect fuse_writepages() has fuse_fill_wb_data nfs_writepages() has nfs_pageio_descriptor xfs_vm_writepages() has xfs_writepage_ctx All of these ->writepages implementations can hold pages under writeback in their private structures until write_cache_pages() returns, and hence they are all susceptible to this deadlock. Also worth noting is that ext4 has it's own bastardised version of write_cache_pages() and so it /may/ have an equivalent deadlock. I looked at the code long enough to understand that it has a similar retry loop for range_cyclic writeback reaching the end of the file and then promptly ran away before my eyes bled too much. I'll leave it for the ext4 developers to determine if their code is actually has this deadlock and how to fix it if it has. There's a few ways I can see avoid this deadlock. There's probably more, but these are the first I've though of: 1. get rid of range_cyclic altogether 2. range_cyclic always stops at EOF, and we start again from writeback index 0 on the next call into write_cache_pages() 2a. wcp also returns EAGAIN to ->writepages implementations to indicate range cyclic has hit EOF. writepages implementations can then flush the current context and call wpc again to continue. i.e. lift the retry into the ->writepages implementation 3. range_cyclic uses trylock_page() rather than lock_page(), and it skips pages it can't lock without blocking. It will already do this for pages under writeback, so this seems like a no-brainer 3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid blocking as per pages under writeback. I don't think #1 is an option - range_cyclic prevents frequently dirtied lower file offset from starving background writeback of rarely touched higher file offsets. #2 is simple, and I don't think it will have any impact on performance as going back to the start of the file implies an immediate seek. We'll have exactly the same number of seeks if we switch writeback to another inode, and then come back to this one later and restart from index 0. #2a is pretty much "status quo without the deadlock". Moving the retry loop up into the wcp caller means we can issue IO on the pending pages before calling wcp again, and so avoid locking or waiting on pages in the wrong order. I'm not convinced we need to do this given that we get the same thing from #2 on the next writeback call from the writeback infrastructure. #3 is really just a band-aid - it doesn't fix the access/wait inversion problem, just prevents it from becoming a deadlock situation. I'd prefer we fix the inversion, not sweep it under the carpet like this. #3a is really an optimisation that just so happens to include the band-aid fix of #3. So it seems that the simplest way to fix this issue is to implement solution #2 Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.com Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.de> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
a9a9e77fbf |
mm: move mirrored memory specific code outside of memmap_init_zone
memmap_init_zone, is getting complex, because it is called from different contexts: hotplug, and during boot, and also because it must handle some architecture quirks. One of them is mirrored memory. Move the code that decides whether to skip mirrored memory outside of memmap_init_zone, into a separate function. [pasha.tatashin@oracle.com: uninline overlap_memmap_init()] Link: http://lkml.kernel.org/r/20180726193509.3326-4-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180724235520.10200-4-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d3035be4ce |
mm: calculate deferred pages after skipping mirrored memory
update_defer_init() should be called only when struct page is about to be initialized. Because it counts number of initialized struct pages, but there we may skip struct pages if there is some mirrored memory. So move, update_defer_init() after checking for mirrored memory. Also, rename update_defer_init() to defer_init() and reverse the return boolean to emphasize that this is a boolean function, that tells that the reset of memmap initialization should be deferred. Make this function self-contained: do not pass number of already initialized pages in this zone by using static counters. I found this bug by reading the code. The effect is that fewer than expected struct pages are initialized early in boot, and it is possible that in some corner cases we may fail to boot when mirrored pages are used. The deferred on demand code should somewhat mitigate this. But this still brings some inconsistencies compared to when booting without mirrored pages, so it is better to fix. [pasha.tatashin@oracle.com: add comment about defer_init's lack of locking] Link: http://lkml.kernel.org/r/20180726193509.3326-3-pasha.tatashin@oracle.com [akpm@linux-foundation.org: make defer_init non-inline, __meminit] Link: http://lkml.kernel.org/r/20180724235520.10200-3-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
dfb3ccd00a |
mm: make memmap_init a proper function
memmap_init is sometimes a macro sometimes a function based on __HAVE_ARCH_MEMMAP_INIT. It is only a function on ia64. Make memmap_init a weak function instead, and let ia64 redefine it. Link: http://lkml.kernel.org/r/20180724235520.10200-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
1c2d479a11 |
mm/memcontrol.c: convert mem_cgroup_id::ref to refcount_t type
This will allow to use generic refcount_t interfaces to check counters overflow instead of currently existing VM_BUG_ON(). The only difference after the patch is VM_BUG_ON() may cause BUG(), while refcount_t fires with WARN(). But this seems not to be significant here, since such the problems are usually caught by syzbot with panic-on-warn enabled. Link: http://lkml.kernel.org/r/153910718919.7006.13400779039257185427.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Andrea Parri <andrea.parri@amarulasolutions.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
4a222127f3 |
mm/page_alloc.c: initialize num_movable in move_freepages()
If move_freepages_block() returns 0 because !zone_spans_pfn(), *num_movable can hold the value from the stack because it does not get initialized in move_freepages(). Move the initialization to move_freepages_block() to guarantee the value actually makes sense. This currently doesn't affect its only caller where num_movable != NULL, so no bug fix, but just more robust. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1810051355490.212229@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
61855f021c |
mm/zsmalloc.c: fix fall-through annotation
Replace "fallthru" with a proper "fall through" annotation. This fix is part of the ongoing efforts to enabling -Wimplicit-fallthrough Link: http://lkml.kernel.org/r/20181003105114.GA24423@embeddedor.com Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
f0ecf25a09 |
mm/vmstat.c: assert that vmstat_text is in sync with stat_items_size
Having two gigantic arrays that must manually be kept in sync, including ifdefs, isn't exactly robust. To make it easier to catch such issues in the future, add a BUILD_BUG_ON(). Link: http://lkml.kernel.org/r/20181001143138.95119-3-jannh@google.com Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
ff09d7ec97 |
mm/memory.c: recheck page table entry with page table lock held
We clear the pte temporarily during read/modify/write update of the pte. If we take a page fault while the pte is cleared, the application can get SIGBUS. One such case is with remap_pfn_range without a backing vm_ops->fault callback. do_fault will return SIGBUS in that case. cpu 0 cpu1 mprotect() ptep_modify_prot_start()/pte cleared. . . page fault. . . prep_modify_prot_commit() Fix this by taking page table lock and rechecking for pte_none. [aneesh.kumar@linux.ibm.com: fix crash observed with syzkaller run] Link: http://lkml.kernel.org/r/87va6bwlfg.fsf@linux.ibm.com Link: http://lkml.kernel.org/r/20180926031858.9692-1-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ido Schimmel <idosch@idosch.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
9bc8039e71 |
mm: brk: downgrade mmap_sem to read when shrinking
brk might be used to shrink memory mapping too other than munmap(). So, it may hold write mmap_sem for long time when shrinking large mapping, as what commit ("mm: mmap: zap pages with read mmap_sem in munmap") described. The brk() will not manipulate vmas anymore after __do_munmap() call for the mapping shrink use case. But, it may set mm->brk after __do_munmap(), which needs hold write mmap_sem. However, a simple trick can workaround this by setting mm->brk before __do_munmap(). Then restore the original value if __do_munmap() fails. With this trick, it is safe to downgrade to read mmap_sem. So, the same optimization, which downgrades mmap_sem to read for zapping pages, is also feasible and reasonable to this case. The period of holding exclusive mmap_sem for shrinking large mapping would be reduced significantly with this optimization. [akpm@linux-foundation.org: tweak comment] [yang.shi@linux.alibaba.com: fix unsigned compare against 0 issue] Link: http://lkml.kernel.org/r/1538687672-17795-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1538067582-60038-2-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Colin Ian King <colin.king@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
85a06835f6 |
mm: mremap: downgrade mmap_sem to read when shrinking
Other than munmap, mremap might be used to shrink memory mapping too. So, it may hold write mmap_sem for long time when shrinking large mapping, as what commit ("mm: mmap: zap pages with read mmap_sem in munmap") described. The mremap() will not manipulate vmas anymore after __do_munmap() call for the mapping shrink use case, so it is safe to downgrade to read mmap_sem. So, the same optimization, which downgrades mmap_sem to read for zapping pages, is also feasible and reasonable to this case. The period of holding exclusive mmap_sem for shrinking large mapping would be reduced significantly with this optimization. MREMAP_FIXED and MREMAP_MAYMOVE are more complicated to adopt this optimization since they need manipulate vmas after do_munmap(), downgrading mmap_sem may create race window. Simple mapping shrink is the low hanging fruit, and it may cover the most cases of unmap with munmap together. [akpm@linux-foundation.org: tweak comment] [yang.shi@linux.alibaba.com: fix unsigned compare against 0 issue] Link: http://lkml.kernel.org/r/1538687672-17795-2-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1538067582-60038-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Colin Ian King <colin.king@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
3c0513243a |
mm/filemap.c: use vmf_error()
These codes can be replaced with new inline vmf_error(). Link: http://lkml.kernel.org/r/20180927171411.GA23331@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d4faa40259 |
mm: remove unnecessary local variable addr in __get_user_pages_fast()
The local variable `addr' in __get_user_pages_fast() is just a shadow of `start'. Since `start' never changes after assignment to `addr', it is fine to replace `start' with it. Also the meaning of [start, end] is more obvious than [addr, end] when passed to gup_pgd_range(). Link: http://lkml.kernel.org/r/20180925021448.20265-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
966cf44f63 |
mm: defer ZONE_DEVICE page initialization to the point where we init pgmap
The ZONE_DEVICE pages were being initialized in two locations. One was with the memory_hotplug lock held and another was outside of that lock. The problem with this is that it was nearly doubling the memory initialization time. Instead of doing this twice, once while holding a global lock and once without, I am opting to defer the initialization to the one outside of the lock. This allows us to avoid serializing the overhead for memory init and we can instead focus on per-node init times. One issue I encountered is that devm_memremap_pages and hmm_devmmem_pages_create were initializing only the pgmap field the same way. One wasn't initializing hmm_data, and the other was initializing it to a poison value. Since this is something that is exposed to the driver in the case of hmm I am opting for a third option and just initializing hmm_data to 0 since this is going to be exposed to unknown third party drivers. [alexander.h.duyck@linux.intel.com: fix reference count for pgmap in devm_memremap_pages] Link: http://lkml.kernel.org/r/20181008233404.1909.37302.stgit@localhost.localdomain Link: http://lkml.kernel.org/r/20180925202053.3576.66039.stgit@localhost.localdomain Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
d483da5bc7 |
mm: create non-atomic version of SetPageReserved for init use
It doesn't make much sense to use the atomic SetPageReserved at init time
when we are using memset to clear the memory and manipulating the page
flags via simple "&=" and "|=" operations in __init_single_page.
This patch adds a non-atomic version __SetPageReserved that can be used
during page init and shows about a 10% improvement in initialization times
on the systems I have available for testing. On those systems I saw
initialization times drop from around 35 seconds to around 32 seconds to
initialize a 3TB block of persistent memory. I believe the main advantage
of this is that it allows for more compiler optimization as the __set_bit
operation can be reordered whereas the atomic version cannot.
I tried adding a bit of documentation based on
|
||
|
f682a97a00 |
mm: provide kernel parameter to allow disabling page init poisoning
Patch series "Address issues slowing persistent memory initialization", v5. The main thing this patch set achieves is that it allows us to initialize each node worth of persistent memory independently. As a result we reduce page init time by about 2 minutes because instead of taking 30 to 40 seconds per node and going through each node one at a time, we process all 4 nodes in parallel in the case of a 12TB persistent memory setup spread evenly over 4 nodes. This patch (of 3): On systems with a large amount of memory it can take a significant amount of time to initialize all of the page structs with the PAGE_POISON_PATTERN value. I have seen it take over 2 minutes to initialize a system with over 12TB of RAM. In order to work around the issue I had to disable CONFIG_DEBUG_VM and then the boot time returned to something much more reasonable as the arch_add_memory call completed in milliseconds versus seconds. However in doing that I had to disable all of the other VM debugging on the system. In order to work around a kernel that might have CONFIG_DEBUG_VM enabled on a system that has a large amount of memory I have added a new kernel parameter named "vm_debug" that can be set to "-" in order to disable it. Link: http://lkml.kernel.org/r/20180925201921.3576.84239.stgit@localhost.localdomain Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
85cfb24506 |
memcg: remove memcg_kmem_skip_account
The flag memcg_kmem_skip_account was added during the era of opt-out kmem accounting. There is no need for such flag in the opt-in world as there aren't any __GFP_ACCOUNT allocations within memcg_create_cache_enqueue(). Link: http://lkml.kernel.org/r/20180919004501.178023-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
86b27beae5 |
mm/memory_hotplug.c: clean up node_states_check_changes_offline()
This patch, as the previous one, gets rid of the wrong if statements. While at it, I realized that the comments are sometimes very confusing, to say the least, and wrong. For example: ___ zone_last = ZONE_MOVABLE; /* * check whether node_states[N_HIGH_MEMORY] will be changed * If we try to offline the last present @nr_pages from the node, * we can determind we will need to clear the node from * node_states[N_HIGH_MEMORY]. */ for (; zt <= zone_last; zt++) present_pages += pgdat->node_zones[zt].present_pages; if (nr_pages >= present_pages) arg->status_change_nid = zone_to_nid(zone); else arg->status_change_nid = -1; ___ In case the node gets empry, it must be removed from N_MEMORY. We already check N_HIGH_MEMORY a bit above within the CONFIG_HIGHMEM ifdef code. Not to say that status_change_nid is for N_MEMORY, and not for N_HIGH_MEMORY. So I re-wrote some of the comments to what I think is better. [osalvador@suse.de: address feedback from Pavel] Link: http://lkml.kernel.org/r/20180921132634.10103-5-osalvador@techadventures.net Link: http://lkml.kernel.org/r/20180919100819.25518-6-osalvador@techadventures.net Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: <yasu.isimatu@gmail.com> Cc: Mathieu Malaterre <malat@debian.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
8efe33f40f |
mm/memory_hotplug.c: simplify node_states_check_changes_online
While looking at node_states_check_changes_online, I stumbled upon some confusing things. Right after entering the function, we find this: if (N_MEMORY == N_NORMAL_MEMORY) zone_last = ZONE_MOVABLE; This is wrong. N_MEMORY cannot really be equal to N_NORMAL_MEMORY. My guess is that this wanted to be something like: if (N_NORMAL_MEMORY == N_HIGH_MEMORY) to check if we have CONFIG_HIGHMEM. Later on, in the CONFIG_HIGHMEM block, we have: if (N_MEMORY == N_HIGH_MEMORY) zone_last = ZONE_MOVABLE; Again, this is wrong, and will never be evaluated to true. Besides removing these wrong if statements, I simplified the function a bit. [osalvador@suse.de: address feedback from Pavel] Link: http://lkml.kernel.org/r/20180921132634.10103-4-osalvador@techadventures.net Link: http://lkml.kernel.org/r/20180919100819.25518-5-osalvador@techadventures.net Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Mathieu Malaterre <malat@debian.org> Cc: Michal Hocko <mhocko@suse.com> Cc: <yasu.isimatu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
cf01f6f5e3 |
mm/memory_hotplug.c: tidy up node_states_clear_node()
node_states_clear has the following if statements: if ((N_MEMORY != N_NORMAL_MEMORY) && (arg->status_change_nid_high >= 0)) ... if ((N_MEMORY != N_HIGH_MEMORY) && (arg->status_change_nid >= 0)) ... N_MEMORY can never be equal to neither N_NORMAL_MEMORY nor N_HIGH_MEMORY. Similar problem was found in [1]. Since this is wrong, let us get rid of it. [1] https://patchwork.kernel.org/patch/10579155/ Link: http://lkml.kernel.org/r/20180919100819.25518-4-osalvador@techadventures.net Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Mathieu Malaterre <malat@debian.org> Cc: Michal Hocko <mhocko@suse.com> Cc: <yasu.isimatu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
83d83612d7 |
mm/memory_hotplug.c: spare unnecessary calls to node_set_state
In node_states_check_changes_online, we check if the node will have to be set for any of the N_*_MEMORY states after the pages have been onlined. Later on, we perform the activation in node_states_set_node. Currently, in node_states_set_node we set the node to N_MEMORY unconditionally. This means that we call node_set_state for N_MEMORY every time pages go online, but we only need to do it if the node has not yet been set for N_MEMORY. Fix this by checking status_change_nid. Link: http://lkml.kernel.org/r/20180919100819.25518-2-osalvador@techadventures.net Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: <yasu.isimatu@gmail.com> Cc: Mathieu Malaterre <malat@debian.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
3cb7b121ff |
mm/filemap.c: Use existing variable
Use the variable write_len instead of ov_iter_count(from). Link: http://lkml.kernel.org/r/1537375855-2088-1-git-send-email-leviathan0992@gmail.com Signed-off-by: haiqing.shq <leviathan0992@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
cb4922496a |
mm: unmap VM_PFNMAP mappings with optimized path
When unmapping VM_PFNMAP mappings, vm flags need to be updated. Since the vmas have been detached, so it sounds safe to update vm flags with read mmap_sem. Link: http://lkml.kernel.org/r/1537376621-51150-4-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
b4cefb3605 |
mm: unmap VM_HUGETLB mappings with optimized path
When unmapping VM_HUGETLB mappings, vm flags need to be updated. Since the vmas have been detached, so it sounds safe to update vm flags with read mmap_sem. Link: http://lkml.kernel.org/r/1537376621-51150-3-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
dd2283f260 |
mm: mmap: zap pages with read mmap_sem in munmap
Patch series "mm: zap pages with read mmap_sem in munmap for large mapping", v11. Background: Recently, when we ran some vm scalability tests on machines with large memory, we ran into a couple of mmap_sem scalability issues when unmapping large memory space, please refer to https://lkml.org/lkml/2017/12/14/733 and https://lkml.org/lkml/2018/2/20/576. History: Then akpm suggested to unmap large mapping section by section and drop mmap_sem at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784). V1 patch series was submitted to the mailing list per Andrew's suggestion (see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great feedback and suggestions. Then this topic was discussed on LSFMM summit 2018. In the summit, Michal Hocko suggested (also in the v1 patches review) to try "two phases" approach. Zapping pages with read mmap_sem, then doing via cleanup with write mmap_sem (for discussion detail, see https://lwn.net/Articles/753269/) Approach: Zapping pages is the most time consuming part, according to the suggestion from Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas. But, we can't call MADV_DONTNEED directly, since there are two major drawbacks: * The unexpected state from PF if it wins the race in the middle of munmap. It may return zero page, instead of the content or SIGSEGV. * Can't handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which is a showstopper from akpm But, some part may need write mmap_sem, for example, vma splitting. So, the design is as follows: acquire write mmap_sem lookup vmas (find and split vmas) deal with special mappings detach vmas downgrade_write zap pages free page tables release mmap_sem The vm events with read mmap_sem may come in during page zapping, but since vmas have been detached before, they, i.e. page fault, gup, etc, will not be able to find valid vma, then just return SIGSEGV or -EFAULT as expected. If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special mappings. They will be handled by falling back to regular do_munmap() with exclusive mmap_sem held in this patch since they may update vm flags. But, with the "detach vmas first" approach, the vmas have been detached when vm flags are updated, so it sounds safe to update vm flags with read mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be handled by using the optimized path in the following separate patches for bisectable sake. Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES). However it is fine to have false-positive MMF_RECALC_UPROBES according to uprobes developer. So, uprobe unmap will not be handled by the regular path. With the "detach vmas first" approach we don't have to re-acquire mmap_sem again to clean up vmas to avoid race window which might get the address space changed since downgrade_write() doesn't release the lock to lead regression, which simply downgrades to read lock. And, since the lock acquire/release cost is managed to the minimum and almost as same as before, the optimization could be extended to any size of mapping without incurring significant penalty to small mappings. For the time being, just do this in munmap syscall path. Other vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain intact due to some implementation difficulties since they acquire write mmap_sem from very beginning and hold it until the end, do_munmap() might be called in the middle. But, the optimized do_munmap would like to be called without mmap_sem held so that we can do the optimization. So, if we want to do the similar optimization for mmap/mremap path, I'm afraid we would have to redesign them. mremap might be called on very large area depending on the usecases, the optimization to it will be considered in the future. This patch (of 3): When running some mmap/munmap scalability tests with large memory (i.e. > 300GB), the below hung task issue may happen occasionally. INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 Call Trace: [<ffffffff817154d0>] ? __schedule+0x250/0x730 [<ffffffff817159e6>] schedule+0x36/0x80 [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150 [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30 [<ffffffff81717db0>] down_read+0x20/0x40 [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0 [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100 [<ffffffff81241d87>] __vfs_read+0x37/0x150 [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0 [<ffffffff81242266>] vfs_read+0x96/0x130 [<ffffffff812437b5>] SyS_read+0x55/0xc0 [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5 It is because munmap holds mmap_sem exclusively from very beginning to all the way down to the end, and doesn't release it in the middle. When unmapping large mapping, it may take long time (take ~18 seconds to unmap 320GB mapping with every single page mapped on an idle machine). Zapping pages is the most time consuming part, according to the suggestion from Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas. But, some part may need write mmap_sem, for example, vma splitting. So, the design is as follows: acquire write mmap_sem lookup vmas (find and split vmas) deal with special mappings detach vmas downgrade_write zap pages free page tables release mmap_sem The vm events with read mmap_sem may come in during page zapping, but since vmas have been detached before, they, i.e. page fault, gup, etc, will not be able to find valid vma, then just return SIGSEGV or -EFAULT as expected. If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special mappings. They will be handled by without downgrading mmap_sem in this patch since they may update vm flags. But, with the "detach vmas first" approach, the vmas have been detached when vm flags are updated, so it sounds safe to update vm flags with read mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be handled by using the optimized path in the following separate patches for bisectable sake. Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES). However it is fine to have false-positive MMF_RECALC_UPROBES according to uprobes developer. With the "detach vmas first" approach we don't have to re-acquire mmap_sem again to clean up vmas to avoid race window which might get the address space changed since downgrade_write() doesn't release the lock to lead regression, which simply downgrades to read lock. And, since the lock acquire/release cost is managed to the minimum and almost as same as before, the optimization could be extended to any size of mapping without incurring significant penalty to small mappings. For the time being, just do this in munmap syscall path. Other vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain intact due to some implementation difficulties since they acquire write mmap_sem from very beginning and hold it until the end, do_munmap() might be called in the middle. But, the optimized do_munmap would like to be called without mmap_sem held so that we can do the optimization. So, if we want to do the similar optimization for mmap/mremap path, I'm afraid we would have to redesign them. mremap might be called on very large area depending on the usecases, the optimization to it will be considered in the future. With the patches, exclusive mmap_sem hold time when munmap a 80GB address space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level from second. munmap_test-15002 [008] 594.380138: funcgraph_entry: | __vm_munmap() { munmap_test-15002 [008] 594.380146: funcgraph_entry: !2485684 us | unmap_region(); munmap_test-15002 [008] 596.865836: funcgraph_exit: !2485692 us | } Here the execution time of unmap_region() is used to evaluate the time of holding read mmap_sem, then the remaining time is used with holding exclusive lock. [1] https://lwn.net/Articles/753269/ Link: http://lkml.kernel.org/r/1537376621-51150-2-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>Suggested-by: Michal Hocko <mhocko@kernel.org> Suggested-by: Kirill A. Shutemov <kirill@shutemov.name> Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Matthew Wilcox <willy@infradead.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
a8dda165ec |
vfree: add debug might_sleep()
Add might_sleep() call to vfree() to catch potential sleep-in-atomic bugs earlier. [aryabinin@virtuozzo.com: drop might_sleep_if() from kvfree()] Link: http://lkml.kernel.org/r/7e19e4df-b1a6-29bd-9ae7-0266d50bef1d@virtuozzo.com Link: http://lkml.kernel.org/r/20180914130512.10394-3-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
3ca4ea3a7a |
mm/vmalloc.c: improve vfree() kerneldoc
vfree() might sleep if called not in interrupt context. Explain that in the comment. Link: http://lkml.kernel.org/r/20180914130512.10394-2-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
52414d3302 |
kvfree(): fix misleading comment
vfree() might sleep if called not in interrupt context. So does kvfree()
too. Fix misleading kvfree()'s comment about allowed context.
Link: http://lkml.kernel.org/r/20180914130512.10394-1-aryabinin@virtuozzo.com
Fixes:
|
||
|
dedf2c73b8 |
mm/mempolicy.c: use match_string() helper to simplify the code
match_string() returns the index of an array for a matching string, which can be used intead of open coded implementation. Link: http://lkml.kernel.org/r/1536988365-50310-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
c3df29d130 |
mm/swap.c: remove duplicated include
Remove duplicated include linux/memremap.h Link: http://lkml.kernel.org/r/20180917131308.16420-1-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
2c029a1ea3 |
mm, page_alloc: drop should_suppress_show_mem
should_suppress_show_mem() was introduced to reduce the overhead of
show_mem on large NUMA systems. Things have changed since then though.
Namely
|
||
|
e9b257ed15 |
mm/memcontrol.c: fix memory.stat item ordering
The refault stats go better with the page fault stats, and are of higher interest than the stats on LRU operations. In fact they used to be grouped together; when the LRU operation stats were added later on, they were wedged in between. Move them back together. Documentation/admin-guide/cgroup-v2.rst already lists them in the right order. Link: http://lkml.kernel.org/r/20181010140239.GA2527@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
4b85afbdac |
mm: zero-seek shrinkers
The page cache and most shrinkable slab caches hold data that has been read from disk, but there are some caches that only cache CPU work, such as the dentry and inode caches of procfs and sysfs, as well as the subset of radix tree nodes that track non-resident page cache. Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for the shrinker's seeks setting tells the reclaim algorithm that for every two page cache pages scanned it should scan one slab object. This is a bogus setting. A virtual inode that required no IO to create is not twice as valuable as a page cache page; shadow cache entries with eviction distances beyond the size of memory aren't either. In most cases, the behavior in practice is still fine. Such virtual caches don't tend to grow and assert themselves aggressively, and usually get picked up before they cause problems. But there are scenarios where that's not true. Our database workloads suffer from two of those. For one, their file workingset is several times bigger than available memory, which has the kernel aggressively create shadow page cache entries for the non-resident parts of it. The workingset code does tell the VM that most of these are expendable, but the VM ends up balancing them 2:1 to cache pages as per the seeks setting. This is a huge waste of memory. These workloads also deal with tens of thousands of open files and use /proc for introspection, which ends up growing the proc_inode_cache to absurdly large sizes - again at the cost of valuable cache space, which isn't a reasonable trade-off, given that proc inodes can be re-created without involving the disk. This patch implements a "zero-seek" setting for shrinkers that results in a target ratio of 0:1 between their objects and IO-backed caches. This allows such virtual caches to grow when memory is available (they do cache/avoid CPU work after all), but effectively disables them as soon as IO-backed objects are under pressure. It then switches the shrinkers for procfs and sysfs metadata, as well as excess page cache shadow nodes, to the new zero-seek setting. Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Domas Mituzas <dmituzas@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
68d48e6a2d |
mm: workingset: add vmstat counter for shadow nodes
Make it easier to catch bugs in the shadow node shrinker by adding a counter for the shadow nodes in circulation. [akpm@linux-foundation.org: assert that irqs are disabled, for __inc_lruvec_page_state()] [akpm@linux-foundation.org: s/WARN_ON_ONCE/VM_WARN_ON_ONCE/, per Johannes] Link: http://lkml.kernel.org/r/20181009184732.762-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
505802a535 |
mm: workingset: use cheaper __inc_lruvec_state in irqsafe node reclaim
No need to use the preemption-safe lruvec state function inside the reclaim region that has irqs disabled. Link: http://lkml.kernel.org/r/20181009184732.762-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
eb414681d5 |
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard to tell exactly the impact this has on workload productivity, or how close the system is to lockups and OOM kills. In particular, when machines work multiple jobs concurrently, the impact of overcommit in terms of latency and throughput on the individual job can be enormous. In order to maximize hardware utilization without sacrificing individual job health or risk complete machine lockups, this patch implements a way to quantify resource pressure in the system. A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that expose the percentage of time the system is stalled on CPU, memory, or IO, respectively. Stall states are aggregate versions of the per-task delay accounting delays: cpu: some tasks are runnable but not executing on a CPU memory: tasks are reclaiming, or waiting for swapin or thrashing cache io: tasks are waiting for io completions These percentages of walltime can be thought of as pressure percentages, and they give a general sense of system health and productivity loss incurred by resource overcommit. They can also indicate when the system is approaching lockup scenarios and OOMs. To do this, psi keeps track of the task states associated with each CPU and samples the time they spend in stall states. Every 2 seconds, the samples are averaged across CPUs - weighted by the CPUs' non-idle time to eliminate artifacts from unused CPUs - and translated into percentages of walltime. A running average of those percentages is maintained over 10s, 1m, and 5m periods (similar to the loadaverage). [hannes@cmpxchg.org: doc fixlet, per Randy] Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org [hannes@cmpxchg.org: code optimization] Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter] Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org [hannes@cmpxchg.org: fix build] Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
b1d29ba82c |
delayacct: track delays from thrashing cache pages
Delay accounting already measures the time a task spends in direct reclaim and waiting for swapin, but in low memory situations tasks spend can spend a significant amount of their time waiting on thrashing page cache. This isn't tracked right now. To know the full impact of memory contention on an individual task, measure the delay when waiting for a recently evicted active cache page to read back into memory. Also update tools/accounting/getdelays.c: [hannes@computer accounting]$ sudo ./getdelays -d -p 1 print delayacct stats ON PID 1 CPU count real total virtual total delay total delay average 50318 745000000 847346785 400533713 0.008ms IO count delay total delay average 435 122601218 0ms SWAP count delay total delay average 0 0 0ms RECLAIM count delay total delay average 0 0 0ms THRASHING count delay total delay average 19 12621439 0ms Link: http://lkml.kernel.org/r/20180828172258.3185-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
1899ad18c6 |
mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset. During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing. Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing. How many page->flags does this leave us with on 32-bit? 20 bits are always page flags 21 if you have an MMU 23 with the zone bits for DMA, Normal, HighMem, Movable 29 with the sparsemem section bits 30 if PAE is enabled 31 with this patch. So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits. Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
95f9ab2d59 |
mm: workingset: don't drop refault information prematurely
Patch series "psi: pressure stall information for CPU, memory, and IO", v4. Overview PSI reports the overall wallclock time in which the tasks in a system (or cgroup) wait for (contended) hardware resources. This helps users understand the resource pressure their workloads are under, which allows them to rootcause and fix throughput and latency problems caused by overcommitting, underprovisioning, suboptimal job placement in a grid; as well as anticipate major disruptions like OOM. Real-world applications We're using the data collected by PSI (and its previous incarnation, memdelay) quite extensively at Facebook, and with several success stories. One usecase is avoiding OOM hangs/livelocks. The reason these happen is because the OOM killer is triggered by reclaim not being able to free pages, but with fast flash devices there is *always* some clean and uptodate cache to reclaim; the OOM killer never kicks in, even as tasks spend 90% of the time thrashing the cache pages of their own executables. There is no situation where this ever makes sense in practice. We wrote a <100 line POC python script to monitor memory pressure and kill stuff way before such pathological thrashing leads to full system losses that would require forcible hard resets. We've since extended and deployed this code into other places to guarantee latency and throughput SLAs, since they're usually violated way before the kernel OOM killer would ever kick in. It is available here: https://github.com/facebookincubator/oomd Eventually we probably want to trigger the in-kernel OOM killer based on extreme sustained pressure as well, so that Linux can avoid memory livelocks - which technically aren't deadlocks, but to the user indistinguishable from them - out of the box. We'd continue using OOMD as the first line of defense to ensure workload health and implement complex kill policies that are beyond the scope of the kernel. We also use PSI memory pressure for loadshedding. Our batch job infrastructure used to use heuristics based on various VM stats to anticipate OOM situations, with lackluster success. We switched it to PSI and managed to anticipate and avoid OOM kills and lockups fairly reliably. The reduction of OOM outages in the worker pool raised the pool's aggregate productivity, and we were able to switch that service to smaller machines. Lastly, we use cgroups to isolate a machine's main workload from maintenance crap like package upgrades, logging, configuration, as well as to prevent multiple workloads on a machine from stepping on each others' toes. We were not able to configure this properly without the pressure metrics; we would see latency or bandwidth drops, but it would often be hard to impossible to rootcause it post-mortem. We now log and graph pressure for the containers in our fleet and can trivially link latency spikes and throughput drops to shortages of specific resources after the fact, and fix the job config/scheduling. PSI has also received testing, feedback, and feature requests from Android and EndlessOS for the purpose of low-latency OOM killing, to intervene in pressure situations before the UI starts hanging. How do you use this feature? A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have cpu.pressure, memory.pressure and io.pressure files, which simply aggregate task stalls at the cgroup level instead of system-wide. The cpu file contains one line: some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722 The averages give the percentage of walltime in which one or more tasks are delayed on the runqueue while another task has the CPU. They're recent averages over 10s, 1m, 5m windows, so you can tell short term trends from long term ones, similarly to the load average. The total= value gives the absolute stall time in microseconds. This allows detecting latency spikes that might be too short to sway the running averages. It also allows custom time averaging in case the 10s/1m/5m windows aren't adequate for the usecase (or are too coarse with future hardware). What to make of this "some" metric? If CPU utilization is at 100% and CPU pressure is 0, it means the system is perfectly utilized, with one runnable thread per CPU and nobody waiting. At two or more runnable tasks per CPU, the system is 100% overcommitted and the pressure average will indicate as much. From a utilization perspective this is a great state of course: no CPU cycles are being wasted, even when 50% of the threads were to go idle (as most workloads do vary). From the perspective of the individual job it's not great, however, and they would do better with more resources. Depending on what your priority and options are, raised "some" numbers may or may not require action. The memory file contains two lines: some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828 full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258 The some line is the same as for cpu, the time in which at least one task is stalled on the resource. In the case of memory, this includes waiting on swap-in, page cache refaults and page reclaim. The full line, however, indicates time in which *nobody* is using the CPU productively due to pressure: all non-idle tasks are waiting for memory in one form or another. Significant time spent in there is a good trigger for killing things, moving jobs to other machines, or dropping incoming requests, since neither the jobs nor the machine overall are making too much headway. The io file is similar to memory. Because the block layer doesn't have a concept of hardware contention right now (how much longer is my IO request taking due to other tasks?), it reports CPU potential lost on all IO delays, not just the potential lost due to competition. FAQ Q: How is PSI's CPU component different from the load average? A: There are several quirks in the load average that make it hard to impossible to tell how overcommitted the CPU really is. 1. The load average is reported as a raw number of active tasks. You need to know how many CPUs there are in the system, how many CPUs the workload is allowed to use, then think about what the proportion between load and the number of CPUs mean for the tasks trying to run. PSI reports the percentage of wallclock time in which tasks are waiting for a CPU to run on. It doesn't matter how many CPUs are present or usable. The number always tells the quality of life of tasks in the system or in a particular cgroup. 2. The shortest averaging window is 1m, which is extremely coarse, and it's sampled in 5s intervals. A *lot* can happen on a CPU in 5 seconds. This *may* be able to identify persistent long-term trends and very clear and obvious overloads, but it's unusable for latency spikes and more subtle overutilization. PSI's shortest window is 10s. It also exports the cumulative stall times (in microseconds) of synchronously recorded events. 3. On Linux, the load average for historical reasons includes all TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how busy the system is, but on the flipside it doesn't distinguish whether tasks are likely to contend over the CPU or IO - which obviously requires very different interventions from a sys admin or a job scheduler. PSI reports independent metrics for CPU and IO. You can tell which resource is making the tasks wait, but in conjunction still see how overloaded the system is overall. Q: What's the cost / performance impact of this feature? A: PSI's primary cost is in the scheduler, in particular task wakeups and sleeps. I benchmarked this code using Facebook's two most scheduling sensitive workloads: memcache and webserver. They handle a ton of small requests - lots of wakeups and sleeps with little actual work in between - so they tend to be canaries for scheduler regressions. In the tests, the boxes were handling live traffic over the course of several hours. Half the machines, the control, ran with CONFIG_PSI=n. For memcache I used eight machines total. They're 2-socket, 14 core, 56 thread boxes. The test runs for half the test period, flips the test and control kernels on the hardware to rule out HW factors, DC location etc., then runs the other half of the test. For the webservers, I used 32 machines total. They're single socket, 16 core, 32 thread machines. During the memcache test, CPU load was nopsi=78.05% psi=78.98% in the first half and nopsi=77.52% psi=78.25%, so PSI added between 0.7 and 0.9 percentage points to the CPU load, a difference of about 1%. UPDATE: I re-ran this test with the v3 version of this patch set and the CPU utilization was equivalent between test and control. UPDATE: v4 is on par with v3. As far as end-to-end request latency from the client perspective goes, we don't sample those finely enough to capture the requests going to those particular machines during the test, but we know the p50 turnaround time in this workload is 54us, and perf bench sched pipe on those machines show nopsi=5.232666 us/op and psi=5.587347 us/op, so this doesn't add much here either. The profile for the pipe benchmark shows: 0.87% sched-pipe [kernel.vmlinux] [k] psi_group_change 0.83% perf.real [kernel.vmlinux] [k] psi_group_change 0.82% perf.real [kernel.vmlinux] [k] psi_task_change 0.58% sched-pipe [kernel.vmlinux] [k] psi_task_change The webserver load is running inside 4 nested cgroup levels. The CPU load with both nopsi and psi kernels was indistinguishable at 81%. For comparison, we had to disable the cgroup cpu controller on the webservers because it added 4 percentage points to the CPU% during this same exact test. Versions of this accounting code now run on 80% of our fleet. None of our workloads have reported regressions during the rollout. Daniel Drake said: : I just retested the latest version at : http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results : are great. : : Test setup: : Endless OS : GeminiLake N4200 low end laptop : 2GB RAM : swap (and zram swap) disabled : : Baseline test: open a handful of large-ish apps and several website : tabs in Google Chrome. : : Results: after a couple of minutes, system is excessively thrashing, mouse : cursor can barely be moved, UI is not responding to mouse clicks, so it's : impractical to recover from this situation as an ordinary user : : Add my simple killer: : https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd : : Results: when the thrashing causes the UI to become sluggish, the killer : steps in and kills something (usually a chrome tab), and the system : remains usable. I repeatedly opened more apps and more websites over a 15 : minute period but I wasn't able to get the system to a point of UI : unresponsiveness. Suren said: : Backported to 4.9 and retested on ARMv8 8 code system running Android. : Signals behave as expected reacting to memory pressure, no jumps in : "total" counters that would indicate an overflow/underflow issues. Nicely : done! This patch (of 9): If we keep just enough refault information to match the *current* page cache during reclaim time, we could lose a lot of events when there is only a temporary spike in non-cache memory consumption that pushes out all the cache. Once cache comes back, we won't see those refaults. They might not be actionable for LRU aging, but we want to know about them for measuring memory pressure. [hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters] Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <jweiner@fb.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Christopher Lameter <cl@linux.com> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
f0d7787414 |
mm, slab: shorten kmalloc cache names for large sizes
Kmalloc cache names can get quite long for large object sizes, when the sizes are expressed in bytes. Use 'k' and 'M' prefixes to make the names as short as possible e.g. in /proc/slabinfo. This works, as we mostly use power-of-two sizes, with exceptions only below 1k. Example: 'kmalloc-4194304' becomes 'kmalloc-4M' Link: http://lkml.kernel.org/r/20180731090649.16028-7-vbabka@suse.cz Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
b29940c1ab |
mm: rename and change semantics of nr_indirectly_reclaimable_bytes
The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by
commit
|
||
|
1291523f2c |
mm, slab/slub: introduce kmalloc-reclaimable caches
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker). This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions. The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit
|
||
|
cc252eae85 |
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4. As discussed at LSF/MM [1] here's a patchset that introduces kmalloc-reclaimable caches (more details in the second patch) and uses them for dcache external names. That allows us to repurpose the NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series. With patch 3/6, dcache external names are allocated from kmalloc-rcl-* caches, eliminating the need for manual accounting. More importantly, it also ensures the reclaimable kmalloc allocations are grouped in pages separate from the regular kmalloc allocations. The need for proper accounting of dcache external names has shown it's easy for misbehaving process to allocate lots of them, causing premature OOMs. Without the added grouping, it's likely that a similar workload can interleave the dcache external names allocations with regular kmalloc allocations (note: I haven't searched myself for an example of such regular kmalloc allocation, but I would be very surprised if there wasn't some). A pathological case would be e.g. one 64byte regular allocations with 63 external dcache names in a page (64x64=4096), which means the page is not freed even after reclaiming after all dcache names, and the process can thus "steal" the whole page with single 64byte allocation. If other kmalloc users similar to dcache external names become identified, they can also benefit from the new functionality simply by adding __GFP_RECLAIMABLE to the kmalloc calls. Side benefits of the patchset (that could be also merged separately) include removed branch for detecting __GFP_DMA kmalloc(), and shortening kmalloc cache names in /proc/slabinfo output. The latter is potentially an ABI break in case there are tools parsing the names and expecting the values to be in bytes. This is how /proc/slabinfo looks like after booting in virtme: ... kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0 ... kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0 kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0 kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0 kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0 kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0 kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0 ... /proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter: ... nr_slab_reclaimable 2817 nr_slab_unreclaimable 1781 ... nr_kernel_misc_reclaimable 0 ... /proc/meminfo with new KReclaimable counter: ... Shmem: 564 kB KReclaimable: 11260 kB Slab: 18368 kB SReclaimable: 11260 kB SUnreclaim: 7108 kB KernelStack: 1248 kB ... This patch (of 6): The kmalloc caches currently mainain separate (optional) array kmalloc_dma_caches for __GFP_DMA allocations. There are tests for __GFP_DMA in the allocation hotpaths. We can avoid the branches by combining kmalloc_caches and kmalloc_dma_caches into a single two-dimensional array where the outer dimension is cache "type". This will also allow to add kmalloc-reclaimable caches as a third type. Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
3b9aadf727 |
userfaultfd: allow get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) to trigger userfaults
get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would not be waiting for userfaults before failing and it would hit on a SIGBUS instead. Using get_user_pages_locked/unlocked instead will allow get_mempolicy to allow userfaults to resolve the fault and fill the hole, before grabbing the node id of the page. If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an address inside an area managed by uffd and there is no page at that address, the page allocation from within get_mempolicy() will fail because get_user_pages() does not allow for page fault retry required for uffd; the user will get SIGBUS. With this patch, the page fault will be resolved by the uffd and the get_mempolicy() will continue normally. Background: Via code review, previously the syscall would have returned -EFAULT (vm_fault_to_errno), now it will block and wait for an userfault (if it's waken before the fault is resolved it'll still -EFAULT). This way get_mempolicy will give a chance to an "unaware" app to be compliant with userfaults. The reason this visible change is that becoming "userfault compliant" cannot regress anything: all other syscalls including read(2)/write(2) had to become "userfault compliant" long time ago (that's one of the things userfaultfd can do that PROT_NONE and trapping segfaults can't). So this is just one more syscall that become "userfault compliant" like all other major ones already were. This has been happening on virtio-bridge dpdk process which just called get_mempolicy on the guest space post live migration, but before the memory had a chance to be migrated to destination. I didn't run an strace to be able to show the -EFAULT going away, but I've the confirmation of the below debug aid information (only visible with CONFIG_DEBUG_VM=y) going away with the patch: [20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0 [20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1 [20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017 [20116.371466] Call Trace: [20116.371473] dump_stack+0x5c/0x80 [20116.371476] handle_userfault.cold.37+0x1b/0x22 [20116.371479] ? remove_wait_queue+0x20/0x60 [20116.371481] ? poll_freewait+0x45/0xa0 [20116.371483] ? do_sys_poll+0x31c/0x520 [20116.371485] ? radix_tree_lookup_slot+0x1e/0x50 [20116.371488] shmem_getpage_gfp+0xce7/0xe50 [20116.371491] ? page_add_file_rmap+0x1a/0x2c0 [20116.371493] shmem_fault+0x78/0x1e0 [20116.371495] ? filemap_map_pages+0x3a1/0x450 [20116.371498] __do_fault+0x1f/0xc0 [20116.371500] __handle_mm_fault+0xe2e/0x12f0 [20116.371502] handle_mm_fault+0xda/0x200 [20116.371504] __get_user_pages+0x238/0x790 [20116.371506] get_user_pages+0x3e/0x50 [20116.371510] kernel_get_mempolicy+0x40b/0x700 [20116.371512] ? vfs_write+0x170/0x1a0 [20116.371515] __x64_sys_get_mempolicy+0x21/0x30 [20116.371517] do_syscall_64+0x5b/0x160 [20116.371520] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The above harmless debug message (not a kernel crash, just a dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify and improve kernel spots that may have to become "userfaultfd compliant" like this one (without having to run an strace and search for syscall misbehavior). Spots like the above are more closer to a kernel bug for the non-cooperative usages that Mike focuses on, than for for dpdk qemu-cooperative usages that reproduced it, but it's still nicer to get this fixed for dpdk too. The part of the patch that caused me to think is only the implementation issue of mpol_get, but it looks like it should work safe no matter the kind of mempolicy structure that is (the default static policy also starts at 1 so it'll go to 2 and back to 1 without crashing everything at 0). [rppt@linux.vnet.ibm.com: changelog addition] http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
9b5a8e00d4 |
mm: convert insert_pfn() to vm_fault_t
All callers convert its errno into a vm_fault_t, so convert it to return a vm_fault_t directly. Link: http://lkml.kernel.org/r/20180828145728.11873-11-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
79f3aa5ba9 |
mm: convert __vm_insert_mixed() to vm_fault_t
Both of its callers currently convert its errno return into a vm_fault_t, so move the conversion into __vm_insert_mixed(). Link: http://lkml.kernel.org/r/20180828145728.11873-10-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |