Commit Graph

856835 Commits

Author SHA1 Message Date
Christoph Hellwig
a520110e4a mm: split out a new pagewalk.h header from mm.h
Add a new header for the two handful of users of the walk_page_range /
walk_page_vma interface instead of polluting all users of mm.h with it.

Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-09-07 04:28:04 -03:00
Daniel Vetter
810e24e009 mm/mmu_notifiers: annotate with might_sleep()
Since mmu notifiers don't exist for many processes, but could block in
interesting places, add some annotations. This should help make sure the
core mm keeps up its end of the mmu notifier contract.

The checks here are outside of all notifier checks because of that.
They compile away without CONFIG_DEBUG_ATOMIC_SLEEP.

Link: https://lore.kernel.org/r/20190826201425.17547-6-daniel.vetter@ffwll.ch
Suggested-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-09-07 04:28:04 -03:00
Daniel Vetter
66204f1d2d mm/mmu_notifiers: prime lockdep
We want to teach lockdep that mmu notifiers can be called from direct
reclaim paths, since on many CI systems load might never reach that
level (e.g. when just running fuzzer or small functional tests).

I've put the annotation into mmu_notifier_register since only when we have
mmu notifiers registered is there any point in teaching lockdep about
them. Also, we already have a kmalloc(, GFP_KERNEL), so this is safe.

Link: https://lore.kernel.org/r/20190826201425.17547-3-daniel.vetter@ffwll.ch
Suggested-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-09-07 04:28:04 -03:00
Daniel Vetter
23b68395c7 mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
This is a similar idea to the fs_reclaim fake lockdep lock. It's fairly
easy to provoke a specific notifier to be run on a specific range: Just
prep it, and then munmap() it.

A bit harder, but still doable, is to provoke the mmu notifiers for all
the various callchains that might lead to them. But both at the same time
is really hard to reliably hit, especially when you want to exercise paths
like direct reclaim or compaction, where it's not easy to control what
exactly will be unmapped.

By introducing a lockdep map to tie them all together we allow lockdep to
see a lot more dependencies, without having to actually hit them in a
single challchain while testing.

On Jason's suggestion this is is rolled out for both
invalidate_range_start and invalidate_range_end. They both have the same
calling context, hence we can share the same lockdep map. Note that the
annotation for invalidate_ranage_start is outside of the
mm_has_notifiers(), to make sure lockdep is informed about all paths
leading to this context irrespective of whether mmu notifiers are present
for a given context. We don't do that on the invalidate_range_end side to
avoid paying the overhead twice, there the lockdep annotation is pushed
down behind the mm_has_notifiers() check.

Link: https://lore.kernel.org/r/20190826201425.17547-2-daniel.vetter@ffwll.ch
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-09-07 04:27:42 -03:00
Christoph Hellwig
f0ade90a8a mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
No modular code uses these, which makes a lot of sense given the wrappers
around them are only called by core mm code.

Link: https://lore.kernel.org/r/20190828142109.29012-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-28 11:52:35 -03:00
Ralph Campbell
c18ce674d5 mm/hmm: hmm_range_fault() infinite loop
Normally, callers to handle_mm_fault() are supposed to check the
vma->vm_flags first. hmm_range_fault() checks for VM_READ but doesn't
check for VM_WRITE if the caller requests a page to be faulted in with
write permission (via the hmm_range.pfns[] value).  If the vma is write
protected, this can result in an infinite loop:

  hmm_range_fault()
    walk_page_range()
      ...
      hmm_vma_walk_hole()
        hmm_vma_walk_hole_()
          hmm_vma_do_fault()
            handle_mm_fault(FAULT_FLAG_WRITE)
            /* returns VM_FAULT_WRITE */
          /* returns -EBUSY */
        /* returns -EBUSY */
      /* returns -EBUSY */
    /* loops on -EBUSY and range->valid */

Prevent this by checking for vma->vm_flags & VM_WRITE before calling
handle_mm_fault().

Link: https://lore.kernel.org/r/20190823221753.2514-3-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-27 19:27:07 -03:00
Ralph Campbell
6c64f2bbe7 mm/hmm: hmm_range_fault() NULL pointer bug
Although hmm_range_fault() calls find_vma() to make sure that a vma exists
before calling walk_page_range(), hmm_vma_walk_hole() can still be called
with walk->vma == NULL if the start and end address are not contained
within the vma range.

 hmm_range_fault() /* calls find_vma() but no range check */
  walk_page_range() /* calls find_vma(), sets walk->vma = NULL */
   __walk_page_range()
    walk_pgd_range()
     walk_p4d_range()
      walk_pud_range()
       hmm_vma_walk_hole()
        hmm_vma_walk_hole_()
         hmm_vma_do_fault()
          handle_mm_fault(vma=0)

Link: https://lore.kernel.org/r/20190823221753.2514-2-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-27 19:27:07 -03:00
Yang, Philip
e3fe8e555d mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
hmm_range_fault() may return NULL pages because some of the pfns are equal
to HMM_PFN_NONE. This happens randomly under memory pressure. The reason
is during the swapped out page pte path, hmm_vma_handle_pte() doesn't
update the fault variable from cpu_flags, so it failed to call
hmm_vam_do_fault() to swap the page in.

The fix is to call hmm_pte_need_fault() to update fault variable.

Fixes: 74eee180b9 ("mm/hmm/mirror: device page fault handler")
Link: https://lore.kernel.org/r/20190815205227.7949-1-Philip.Yang@amd.com
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-23 10:22:20 -03:00
Jason Gunthorpe
c96245148c mm/mmu_notifiers: remove unregister_no_release
mmu_notifier_unregister_no_release() and mmu_notifier_call_srcu() no
longer have any users, they have all been converted to use
mmu_notifier_put().

So delete this difficult to use interface.

Link: https://lore.kernel.org/r/20190806231548.25242-12-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 20:58:19 -03:00
Jason Gunthorpe
47f725ee7b RDMA/odp: remove ib_ucontext from ib_umem
At this point the ucontext is only being stored to access the ib_device,
so just store the ib_device directly instead. This is more natural and
logical as the umem has nothing to do with the ucontext.

Link: https://lore.kernel.org/r/20190806231548.25242-8-jgg@ziepe.ca
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 20:58:19 -03:00
Jason Gunthorpe
c571feca2d RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
This is a significant simplification, no extra list is kept per FD, and
the interval tree is now shared between all the ucontexts, reducing
overhead if there are multiple ucontexts active.

Link: https://lore.kernel.org/r/20190806231548.25242-7-jgg@ziepe.ca
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 20:58:18 -03:00
Jason Gunthorpe
daa138a58c Merge branch 'odp_fixes' into hmm.git
From rdma.git

Jason Gunthorpe says:

====================
This is a collection of general cleanups for ODP to clarify some of the
flows around umem creation and use of the interval tree.
====================

The branch is based on v5.3-rc5 due to dependencies, and is being taken
into hmm.git due to dependencies in the next patches.

* odp_fixes:
  RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
  RDMA/mlx5: Use ib_umem_start instead of umem.address
  RDMA/core: Make invalidate_range a device operation
  RDMA/odp: Use kvcalloc for the dma_list and page_list
  RDMA/odp: Check for overflow when computing the umem_odp end
  RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
  RDMA/odp: Split creating a umem_odp from ib_umem_get
  RDMA/odp: Make the three ways to create a umem_odp clear
  RMDA/odp: Consolidate umem_odp initialization
  RDMA/odp: Make it clearer when a umem is an implicit ODP umem
  RDMA/odp: Iterate over the whole rbtree directly
  RDMA/odp: Use the common interval tree library instead of generic
  RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 20:58:18 -03:00
Jason Gunthorpe
fba0e448a2 RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
These are the same thing since mr always comes from odp->private. It is
confusing to reference the same memory via two names.

Link: https://lore.kernel.org/r/20190819111710.18440-13-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 14:08:43 -03:00
Jason Gunthorpe
a705f3e3a1 RDMA/mlx5: Use ib_umem_start instead of umem.address
These are subtly different, the address is the original VA requested
during umem_get, while ib_umem_start() is the version that is rounded to
the proper page size, ie is the true start of the umem's dma map.

Link: https://lore.kernel.org/r/20190819111710.18440-12-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 14:08:43 -03:00
Moni Shoua
ce51346fee RDMA/core: Make invalidate_range a device operation
The callback function 'invalidate_range' is implemented in a driver so the
place for it is in the ib_device_ops structure and not in ib_ucontext.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Link: https://lore.kernel.org/r/20190819111710.18440-11-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 14:08:43 -03:00
Jason Gunthorpe
37824952dc RDMA/odp: Use kvcalloc for the dma_list and page_list
There is no specific need for these to be in the valloc space, let the
system decide automatically how to do the allocation.

Link: https://lore.kernel.org/r/20190819111710.18440-10-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 14:08:43 -03:00
Jason Gunthorpe
204e3e5630 RDMA/odp: Check for overflow when computing the umem_odp end
Since the page size can be extended in the ODP case by IB_ACCESS_HUGETLB
the existing overflow checks done by ib_umem_get() are not
sufficient. Check for overflow again.

Further, remove the unchecked math from the inlines and just use the
precomputed value stored in the interval_tree_node.

Link: https://lore.kernel.org/r/20190819111710.18440-9-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 14:08:43 -03:00
Jason Gunthorpe
0446cad9ca RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
Now that there are allocator APIs that return the ib_umem_odp directly
it should be freed through a umem_odp free'er as well.

Link: https://lore.kernel.org/r/20190819111710.18440-8-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 14:08:42 -03:00
Jason Gunthorpe
261dc53f8e RDMA/odp: Split creating a umem_odp from ib_umem_get
This is the last creation API that is overloaded for both, there is very
little code sharing and a driver has to be specifically ready for a
umem_odp to be created to use the odp version.

Link: https://lore.kernel.org/r/20190819111710.18440-7-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 14:08:42 -03:00
Jason Gunthorpe
f20bef6a95 RDMA/odp: Make the three ways to create a umem_odp clear
The three paths to build the umem_odps are kind of muddled, they are:
- As a normal ib_mr umem
- As a child in an implicit ODP umem tree
- As the root of an implicit ODP umem tree

Only the first two are actually umem's, the last is an abuse.

The implicit case can only be triggered by explicit driver request, it
should never be co-mingled with the normal case. While we are here, make
sensible function names and add some comments to make this clearer.

Link: https://lore.kernel.org/r/20190819111710.18440-6-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 14:08:42 -03:00
Jason Gunthorpe
22d79c9a91 RMDA/odp: Consolidate umem_odp initialization
This is done in two different places, consolidate all the post-allocation
initialization into a single function.

Link: https://lore.kernel.org/r/20190819111710.18440-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 14:08:42 -03:00
Jason Gunthorpe
fd7dbf035e RDMA/odp: Make it clearer when a umem is an implicit ODP umem
Implicit ODP umems are special, they don't have any page lists, they don't
exist in the interval tree and they are never DMA mapped.

Instead of trying to guess this based on a zero length use an explicit
flag.

Further, do not allow non-implicit umems to be 0 size.

Link: https://lore.kernel.org/r/20190819111710.18440-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 14:08:42 -03:00
Jason Gunthorpe
f993de88a5 RDMA/odp: Iterate over the whole rbtree directly
Instead of intersecting a full interval, just iterate over every element
directly. This is faster and clearer.

Link: https://lore.kernel.org/r/20190819111710.18440-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 14:08:24 -03:00
Jason Gunthorpe
7cc2e18f21 RDMA/odp: Use the common interval tree library instead of generic
ODP is working with userspace VA's in the interval tree which always fit
into an unsigned long, so we can use the common code.

This comes at a cost of a 16 byte increase in ib_umem_odp struct size due
to storing the interval tree start/last in addition to the umem
addr/length. However these values were computed and are performance
critical for the interval lookup, so this seems like a worthwhile trade
off.

Removes 2k of .text from the kernel.

Link: https://lore.kernel.org/r/20190819111710.18440-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 13:34:09 -03:00
Jason Gunthorpe
27b7fb1ab7 RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB
When ODP is enabled with IB_ACCESS_HUGETLB then the required pages
should be calculated based on the extent of the MR, which is rounded
to the nearest huge page alignment.

Fixes: d2183c6f19 ("RDMA/umem: Move page_shift from ib_umem to ib_odp_umem")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190815083834.9245-5-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-08-20 13:44:43 -04:00
Christoph Hellwig
6869b7b206 memremap: provide a not device managed memremap_pages
The kvmppc ultravisor code wants a device private memory pool that is
system wide and not attached to a device.  Instead of faking up one
provide a low-level memremap_pages for it.

Link: https://lore.kernel.org/r/20190818090557.17853-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:41:35 -03:00
Christoph Hellwig
6f42193fd8 memremap: don't use a separate devm action for devmap_managed_enable_get
Just clean up for early failures and then piggy back on
devm_memremap_pages_release.  This helps with a pending not device managed
version of devm_memremap_pages.

Link: https://lore.kernel.org/r/20190818090557.17853-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:41:35 -03:00
Christoph Hellwig
fdc029b19d memremap: remove the dev field in struct dev_pagemap
The dev field in struct dev_pagemap is only used to print dev_name in two
places, which are at best nice to have.  Just remove the field and thus
the name in those two messages.

Link: https://lore.kernel.org/r/20190818090557.17853-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:41:35 -03:00
Christoph Hellwig
0c38519039 resource: add a not device managed request_free_mem_region variant
Factor out the guts of devm_request_free_mem_region so that we can
implement both a device managed and a manually release version as tiny
wrappers around it.

Link: https://lore.kernel.org/r/20190818090557.17853-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:39:41 -03:00
Daniel Vetter
8402ce61be mm/mmu_notifiers: check if mmu notifier callbacks are allowed to fail
Just a bit of paranoia, since if we start pushing this deep into
callchains it's hard to spot all places where an mmu notifier
implementation might fail when it's not allowed to.

Inspired by some confusion we had discussing i915 mmu notifiers and
whether we could use the newly-introduced return value to handle some
corner cases. Until we realized that these are only for when a task has
been killed by the oom reaper.

An alternative approach would be to split the callback into two versions,
one with the int return value, and the other with void return value like
in older kernels. But that's a lot more churn for fairly little gain I
think.

Summary from the m-l discussion on why we want something at warning level:
This allows automated tooling in CI to catch bugs without humans having to
look at everything. If we just upgrade the existing pr_info to a pr_warn,
then we'll have false positives. And as-is, no one will ever spot the
problem since it's lost in the massive amounts of overall dmesg noise.

Link: https://lore.kernel.org/r/20190814202027.18735-2-daniel.vetter@ffwll.ch
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:03 -03:00
Christoph Hellwig
9b2ed9cb97 mm: remove CONFIG_MIGRATE_VMA_HELPER
CONFIG_MIGRATE_VMA_HELPER guards helpers that are required for proper
devic private memory support.  Remove the option and just check for
CONFIG_DEVICE_PRIVATE instead.

Link: https://lore.kernel.org/r/20190814075928.23766-11-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:03 -03:00
Christoph Hellwig
06d462beb4 mm: remove the unused MIGRATE_PFN_DEVICE flag
No one ever checks this flag, and we could easily get that information
from the page if needed.

Link: https://lore.kernel.org/r/20190814075928.23766-10-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:03 -03:00
Christoph Hellwig
2a915acf88 mm: remove the unused MIGRATE_PFN_ERROR flag
Now that we can rely errors in the normal control flow there is no
need for this flag, remove it.

Link: https://lore.kernel.org/r/20190814075928.23766-9-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:03 -03:00
Christoph Hellwig
f268307ec7 nouveau: simplify nouveau_dmem_migrate_vma
Factor the main copy page to vram routine out into a helper that acts
on a single page and which doesn't require the nouveau_dmem_migrate
structure for argument passing.  As an added benefit the new version
only allocates the dma address array once and reuses it for each
subsequent chunk of work.

Link: https://lore.kernel.org/r/20190814075928.23766-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:03 -03:00
Christoph Hellwig
bfe69ef94a nouveau: simplify nouveau_dmem_migrate_to_ram
Factor the main copy page to ram routine out into a helper that acts on
a single page and which doesn't require the nouveau_dmem_fault
structure for argument passing.  Also remove the loop over multiple
pages as we only handle one at the moment, although the structure of
the main worker function makes it relatively easy to add multi page
support back if needed in the future.  But at least for now this avoid
the needed to dynamically allocate memory for the dma addresses in
what is essentially the page fault path.

Link: https://lore.kernel.org/r/20190814075928.23766-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:03 -03:00
Christoph Hellwig
107ba59fc6 nouveau: remove a few function stubs
nouveau_dmem_migrate_vma and nouveau_dmem_convert_pfn are only called
when CONFIG_DRM_NOUVEAU_SVM is enabled, so there is no need to provide
!CONFIG_DRM_NOUVEAU_SVM stubs for them.

Link: https://lore.kernel.org/r/20190814075928.23766-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:03 -03:00
Christoph Hellwig
2ab2bda53c nouveau: factor out dmem fence completion
Factor out the end of fencing logic from the two migration routines.

Link: https://lore.kernel.org/r/20190814075928.23766-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:03 -03:00
Christoph Hellwig
64de8b8d65 nouveau: factor out device memory address calculation
Factor out the repeated device memory address calculation into
a helper.

Link: https://lore.kernel.org/r/20190814075928.23766-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:03 -03:00
Christoph Hellwig
dea027f282 nouveau: reset dma_nr in nouveau_dmem_migrate_alloc_and_copy
When we start a new batch of dma_map operations we need to reset dma_nr,
as we start filling a newly allocated array.

Link: https://lore.kernel.org/r/20190814075928.23766-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:03 -03:00
Christoph Hellwig
a7d1f22bb7 mm: turn migrate_vma upside down
There isn't any good reason to pass callbacks to migrate_vma.  Instead
we can just export the three steps done by this function to drivers and
let them sequence the operation without callbacks.  This removes a lot
of boilerplate code as-is, and will allow the drivers to drastically
improve code flow and error handling further on.

Link: https://lore.kernel.org/r/20190814075928.23766-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:02 -03:00
Jason Gunthorpe
f4fb3b9c19 Merge 'notifier_get_put' into hmm.git
Jason Gunthorpe says:

====================
Add mmu_notifier_get/put for managing mmu notifier registrations

This series introduces a new registration flow for mmu_notifiers based on
the idea that the user would like to get a single refcounted piece of
memory for a mm, keyed to its use.

For instance many users of mmu_notifiers use an interval tree or similar
to dispatch notifications to some object. There are many objects but only
one notifier subscription per mm holding the tree.

Of the 12 places that call mmu_notifier_register:
 - 7 are maintaining some kind of obvious mapping of mm_struct to
   mmu_notifier registration, ie in some linked list or hash table. Of
   the 7 this series converts 4 (gru, hmm, RDMA, radeon)

 - 3 (hfi1, gntdev, vhost) are registering multiple notifiers, but each
   one immediately does some VA range filtering, ie with an interval tree.
   These would be better with a global subsystem-wide range filter and
   could convert to this API.

 - 2 (kvm, amd_iommu) are deliberately using a single mm at a time, and
   really can't use this API. One of the intel-svm's modes is also in this
   list

The 3/7 unconverted drivers are:
 - intel-svm
   This driver tracks mm's in a global linked list 'global_svm_list'
   and would benefit from this API.

   Its flow is a bit complex, since it also wants a set of non-shared
   notifiers.

 - i915_gem_usrptr
   This driver tracks mm's in a per-device hash
   table (dev_priv->mm_structs), but only has an optional use of
   mmu_notifiers.  Since it still seems to need the hash table it is
   difficult to convert.

 - amdkfd/kfd_process
   This driver is using a global SRCU hash table to track mm's

   The control flow here is very complicated and the driver is relying on
   this hash table to be fast on the ioctl syscall path.

   It would definitely benefit, but only if the ioctl path didn't need to
   do the search so often.
====================

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:02 -03:00
Jason Gunthorpe
471f390205 drm/amdkfd: use mmu_notifier_put
The sequence of mmu_notifier_unregister_no_release(),
mmu_notifier_call_srcu() is identical to mmu_notifier_put() with the
free_notifier callback.

As this is the last user of those APIs, converting it means we can drop
them.

Link: https://lore.kernel.org/r/20190806231548.25242-11-jgg@ziepe.ca
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:02 -03:00
Jason Gunthorpe
0029cab314 drm/amdkfd: fix a use after free race with mmu_notifer unregister
When using mmu_notifer_unregister_no_release() the caller must ensure
there is a SRCU synchronize before the mn memory is freed, otherwise use
after free races are possible, for instance:

     CPU0                                      CPU1
                                      invalidate_range_start
                                         hlist_for_each_entry_rcu(..)
 mmu_notifier_unregister_no_release(&p->mn)
 kfree(mn)
                                      if (mn->ops->invalidate_range_end)

The error unwind in amdkfd misses the SRCU synchronization.

amdkfd keeps the kfd_process around until the mm is released, so split the
flow to fully initialize the kfd_process and register it for find_process,
and with the notifier. Past this point the kfd_process does not need to be
cleaned up as it is fully ready.

The final failable step does a vm_mmap() and does not seem to impact the
kfd_process global state. Since it also cannot be undone (and already has
problems with undo if it internally fails), it has to be last.

This way we don't have to try to unwind the mmu_notifier_register() and
avoid the problem with the SRCU.

Along the way this also fixes various other error unwind bugs in the flow.

Fixes: 45102048f7 ("amdkfd: Add process queue manager module")
Link: https://lore.kernel.org/r/20190806231548.25242-10-jgg@ziepe.ca
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:02 -03:00
Jason Gunthorpe
534e5f84b7 drm/radeon: use mmu_notifier_get/put for struct radeon_mn
radeon is using a device global hash table to track what mmu_notifiers
have been registered on struct mm. This is better served with the new
get/put scheme instead.

radeon has a bug where it was not blocking notifier release() until all
the BO's had been invalidated. This could result in a use after free of
pages the BOs. This is tied into a second bug where radeon left the
notifiers running endlessly even once the interval tree became
empty. This could result in a use after free with module unload.

Both are fixed by changing the lifetime model, the BOs exist in the
interval tree with their natural lifetimes independent of the mm_struct
lifetime using the get/put scheme. The release runs synchronously and just
does invalidate_start across the entire interval tree to create the
required DMA fence.

Additions to the interval tree after release are already impossible as
only current->mm is used during the add.

Link: https://lore.kernel.org/r/20190806231548.25242-9-jgg@ziepe.ca
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:02 -03:00
Jason Gunthorpe
c7d8b7824f hmm: use mmu_notifier_get/put for 'struct hmm'
This is a significant simplification, it eliminates all the remaining
'hmm' stuff in mm_struct, eliminates krefing along the critical notifier
paths, and takes away all the ugly locking and abuse of page_table_lock.

mmu_notifier_get() provides the single struct hmm per struct mm which
eliminates mm->hmm.

It also directly guarantees that no mmu_notifier op callback is callable
while concurrent free is possible, this eliminates all the krefs inside
the mmu_notifier callbacks.

The remaining krefs in the range code were overly cautious, drivers are
already not permitted to free the mirror while a range exists.

Link: https://lore.kernel.org/r/20190806231548.25242-6-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:02 -03:00
Linus Torvalds
d1abaeb3be Linux 5.3-rc5 2019-08-18 14:31:08 -07:00
Linus Torvalds
6825e5a6c4 This pull request contains a single fix for MTD:
- spi-nor: Fix to correctly set the WP pin
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl1ZqZMWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wZ21EADh+84WfLoWiAaD8wPByib7E6s2
 Lu8yfOjEpAskts2/v7H5wihYtgnKcyOgqVIjSJr9uQkbRebi98LNBPr9mhyW+TB0
 KzrDXb3l5gJkfHPHK9WaZFkH7c6hXCgnpD3g6YVZF4eU52znhc0Ci2UIEjHtoy/7
 ULrz6IilIH20gDC8GA2vJIMEwTBjYf3hMFmxPq080bjO3KzSv8yGX45IoYpARk2C
 DZxHkFM8wj25B59ChmtQ6kKtkf7MBC0WwEeaeUoLT1lh59wY5cU+nBECl/4ujwh7
 WRb7ZH8ml5t+vyMebg/pxHJiQNUQBoKv2TMPKhaRvnlu44WWHEClvOfWI29AZDGN
 pm1zgkCw/R3+6qpdBgYqKv6GVuiUqcgciMz0jG4RdmhYqSe9RFoPoF0LAfFoZ22i
 jtNfbseS130n7jgvW7VAV9OFU6U5PAp/WM0UdQcjfuwJKEY/spkmDOqtKF73YSrM
 N9NHD4JUoc7mXKQWkO0y/hNE6g6MK2GInkJgXgwpl0DuP4mQs8PJTXEqVd6YCpnF
 LykPtLO8D9AFg8Cms/adxdrBDIAL4ZI9klh0agWPGpDBadHdndqVaWH52RDosxBw
 L6qsAsB2hyQm0bt5TCFMsYMUY5dCqb4Q4wUexXPT1wy1/uAlwxagwgRT159SaRVs
 a3/94O0KnhRii4Qt9Q==
 =AzKU
 -----END PGP SIGNATURE-----

Merge tag 'fixes-for-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux

Pull MTD fix from Richard Weinberger:
 "A single fix for MTD to correctly set the spi-nor WP pin"

* tag 'fixes-for-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
  mtd: spi-nor: Fix the disabling of write protection at init
2019-08-18 12:56:42 -07:00
Linus Torvalds
3039fadf2b for-5.3-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl1ZOygACgkQxWXV+ddt
 WDvokw/8CRjbN5Bjlk/JzmikH+mU28Cd7qQgahWw7Afkyh5Gzb4IJJbNXzapy988
 dMMKYF2H6lxY46EZG8cF4MFjCv8L4L1eZ9CqwfZyf7MfzPL2pnLSN77QJYYebWp3
 y9rc8xv+qUdIcumQP25yxXmtN0YxT5bIJiuCmpHJFNfyijHVRnXV2CxQ8nwe63/1
 G25a03x6BNKtSrU3ZP0fW2VjARKzIF7i+bEy4Ew3n3dNKqAiROYecswcBedhCBkF
 D9hL62uHcxQSeHi/6lAZYKpsp4g4pKEO4c92MxsI8PJtd4zgHQG7MttXsHC1F1FQ
 lHcP3vxKICOOMa13W6QdytSX6uSpjeLyMTDfmvaahQmwG6I5dBN56pkGlWdDMKNn
 NUCNBmV063D7Shed4W2uIafLfo3BLEwKr2pd6pOfOZHwOKPWblJ0v4KzVDLoKC7v
 CA5HB9BBz4KfSyp3fkm9B5VGJW938vRHVfx55IoakL+tfs67cKDYo8QEFHDuz0P5
 DrYNwKvp4a5llGQ6vdRKfeiN7vBea10795MI5vFJiGUHfY3pN99R4UUed7kaul58
 n6gqYukcPtIBqm8auxK037nngA+V0N2y0ceM1/aKUGaZVlCtSmEKXseYjbaiH6fP
 MEZSgZn4W5qnu2oQuKohfQsNHtY5WP9489GpByy+DS5QsbDaueg=
 =ovM7
 -----END PGP SIGNATURE-----

Merge tag 'for-5.3-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Two fixes that popped up during testing:

   - fix for sysfs-related code that adds/removes block groups, warnings
     appear during several fstests in connection with sysfs updates in
     5.3, the fix essentially replaces a workaround with scope NOFS and
     applies to 5.2-based branch too

   - add sanity check of trim range"

* tag 'for-5.3-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: trim: Check the range passed into to prevent overflow
  Btrfs: fix sysfs warning and missing raid sysfs directories
2019-08-18 09:51:48 -07:00
Linus Torvalds
c332f3a70e Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "A set of fixes for x86:

   - Fix the inconsistent error handling in the umwait init code

   - Rework the boot param zeroing so gcc9 stops complaining about out
     of bound memset. The resulting source code is actually more sane to
     read than the smart solution we had

   - Maintainers update so Tony gets involved when Intel models are
     added

   - Some more fallthrough fixes"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Save fields explicitly, zero out everything else
  MAINTAINERS, x86/CPU: Tony Luck will maintain asm/intel-family.h
  x86/fpu/math-emu: Address fallthrough warnings
  x86/apic/32: Fix yet another implicit fallthrough warning
  x86/umwait: Fix error handling in umwait_init()
2019-08-18 09:45:42 -07:00
Linus Torvalds
645c03aaca Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fix from Thomas Gleixner:
 "A single fix for a EFI mixed mode regression caused by recent rework
  which did not take the firmware bitwidth into account"

* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi-stub: Fix get_efi_config_table on mixed-mode setups
2019-08-18 09:36:51 -07:00