By moving this into get_domain_for_dev() we can make dmar_insert_dev_info()
suitable for use with "special" domains such as the si_domain, which
currently use domain_add_dev_info().
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
It's not only for PCI devices any more, and the scope information for an
ACPI device provides the bus and devfn so that has to be stored here too.
It is the device pointer itself which needs to be protected with RCU,
so the __rcu annotation follows it into the definition of struct
dmar_dev_scope, since we're no longer just passing arrays of device
pointers around.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
In commit 2e12bc29 ("intel-iommu: Default to non-coherent for domains
unattached to iommus") we decided to err on the side of caution and
always assume that it's possible that a device will be attached which is
behind a non-coherent IOMMU.
In some cases, however, that just *cannot* happen. If there *are* no
IOMMUs in the system which are non-coherent, then we don't need to do
it. And flushing the dcache is a *significant* performance hit.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
There is a race condition between the existing clear/free code and the
hardware. The IOMMU is actually permitted to cache the intermediate
levels of the page tables, and doesn't need to walk the table from the
very top of the PGD each time. So the existing back-to-back calls to
dma_pte_clear_range() and dma_pte_free_pagetable() can lead to a
use-after-free where the IOMMU reads from a freed page table.
When freeing page tables we actually need to do the IOTLB flush, with
the 'invalidation hint' bit clear to indicate that it's not just a
leaf-node flush, after unlinking each page table page from the next level
up but before actually freeing it.
So in the rewritten domain_unmap() we just return a list of pages (using
pg->freelist to make a list of them), and then the caller is expected to
do the appropriate IOTLB flush (or tear down the domain completely,
whatever), before finally calling dma_free_pagelist() to free the pages.
As an added bonus, we no longer need to flush the CPU's data cache for
pages which are about to be *removed* from the page table hierarchy anyway,
in the non-cache-coherent case. This drastically improves the performance
of large unmaps.
As a side-effect of all these changes, this also fixes the fact that
intel_iommu_unmap() was neglecting to free the page tables for the range
in question after clearing them.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
We have this horrid API where iommu_unmap() can unmap more than it's asked
to, if the IOVA in question happens to be mapped with a large page.
Instead of propagating this nonsense to the point where we end up returning
the page order from dma_pte_clear_range(), let's just do it once and adjust
the 'size' parameter accordingly.
Augment pfn_to_dma_pte() to return the level at which the PTE was found,
which will also be useful later if we end up changing the API for
iommu_iova_to_phys() to behave the same way as is being discussed upstream.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Now we have a PCI bus notification based mechanism to update DMAR
device scope array, we could extend the mechanism to support boot
time initialization too, which will help to unify and simplify
the implementation.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Current Intel DMAR/IOMMU driver assumes that all PCI devices associated
with DMAR/RMRR/ATSR device scope arrays are created at boot time and
won't change at runtime, so it caches pointers of associated PCI device
object. That assumption may be wrong now due to:
1) introduction of PCI host bridge hotplug
2) PCI device hotplug through sysfs interfaces.
Wang Yijing has tried to solve this issue by caching <bus, dev, func>
tupple instead of the PCI device object pointer, but that's still
unreliable because PCI bus number may change in case of hotplug.
Please refer to http://lkml.org/lkml/2013/11/5/64
Message from Yingjing's mail:
after remove and rescan a pci device
[ 611.857095] dmar: DRHD: handling fault status reg 2
[ 611.857109] dmar: DMAR:[DMA Read] Request device [86:00.3] fault addr ffff7000
[ 611.857109] DMAR:[fault reason 02] Present bit in context entry is clear
[ 611.857524] dmar: DRHD: handling fault status reg 102
[ 611.857534] dmar: DMAR:[DMA Read] Request device [86:00.3] fault addr ffff6000
[ 611.857534] DMAR:[fault reason 02] Present bit in context entry is clear
[ 611.857936] dmar: DRHD: handling fault status reg 202
[ 611.857947] dmar: DMAR:[DMA Read] Request device [86:00.3] fault addr ffff5000
[ 611.857947] DMAR:[fault reason 02] Present bit in context entry is clear
[ 611.858351] dmar: DRHD: handling fault status reg 302
[ 611.858362] dmar: DMAR:[DMA Read] Request device [86:00.3] fault addr ffff4000
[ 611.858362] DMAR:[fault reason 02] Present bit in context entry is clear
[ 611.860819] IPv6: ADDRCONF(NETDEV_UP): eth3: link is not ready
[ 611.860983] dmar: DRHD: handling fault status reg 402
[ 611.860995] dmar: INTR-REMAP: Request device [[86:00.3] fault index a4
[ 611.860995] INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear
This patch introduces a new mechanism to update the DRHD/RMRR/ATSR device scope
caches by hooking PCI bus notification.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Global DMA and interrupt remapping resources may be accessed in
interrupt context, so use RCU instead of rwsem to protect them
in such cases.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Introduce a global rwsem dmar_global_lock, which will be used to
protect DMAR related global data structures from DMAR/PCI/memory
device hotplug operations in process context.
DMA and interrupt remapping related data structures are read most,
and only change when memory/PCI/DMAR hotplug event happens.
So a global rwsem solution is adopted for balance between simplicity
and performance.
For interrupt remapping driver, function intel_irq_remapping_supported(),
dmar_table_init(), intel_enable_irq_remapping(), disable_irq_remapping(),
reenable_irq_remapping() and enable_drhd_fault_handling() etc
are called during booting, suspending and resuming with interrupt
disabled, so no need to take the global lock.
For interrupt remapping entry allocation, the locking model is:
down_read(&dmar_global_lock);
/* Find corresponding iommu */
iommu = map_hpet_to_ir(id);
if (iommu)
/*
* Allocate remapping entry and mark entry busy,
* the IOMMU won't be hot-removed until the
* allocated entry has been released.
*/
index = alloc_irte(iommu, irq, 1);
up_read(&dmar_global_lock);
For DMA remmaping driver, we only uses the dmar_global_lock rwsem to
protect functions which are only called in process context. For any
function which may be called in interrupt context, we will use RCU
to protect them in following patches.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Introduce for_each_dev_scope()/for_each_active_dev_scope() to walk
{active} device scope entries. This will help following RCU lock
related patches.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Reduce duplicated code to handle virtual machine domains, there's no
functionality changes. It also improves code readability.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Enhance function get_domain_for_dev() to release allocated resources
if failed to create domain for PCIe endpoint, otherwise the allocated
resources will get lost.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Function get_domain_for_dev() is a little complex, simplify it
by factoring out dmar_search_domain_by_dev_info() and
dmar_insert_dev_info().
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Move private structures and variables into intel-iommu.c, which will
help to simplify locking policy for hotplug. Also delete redundant
declarations.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Function device_notifier() in intel-iommu.c only remove domain_device_info
data structure associated with a PCI device when handling PCI device
driver unbinding events. If a PCI device has never been bound to a PCI
device driver, there won't be BUS_NOTIFY_UNBOUND_DRIVER event when
hot-removing the PCI device. So associated domain_device_info data
structure may get lost.
On the other hand, if iommu_pass_through is enabled, function
iommu_prepare_static_indentify_mapping() will create domain_device_info
data structure for each PCIe to PCIe bridge and PCIe endpoint,
no matter whether there are drivers associated with those PCIe devices
or not. So those domain_device_info data structures will get lost when
hot-removing the assocated PCIe devices if they have never bound to
any PCI device driver.
To be even worse, it's not only an memory leak issue, but also an
caching of stale information bug because the memory are kept in
device_domain_list and domain->devices lists.
Fix the bug by trying to remove domain_device_info data structure when
handling BUS_NOTIFY_DEL_DEVICE event.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Function device_notifier() in intel-iommu.c fails to remove
device_domain_info data structures for PCI devices if they are
associated with si_domain because iommu_no_mapping() returns true
for those PCI devices. This will cause memory leak and caching of
stale information in domain->devices list.
So fix the issue by not calling iommu_no_mapping() and skipping check
of iommu_pass_through.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
A few patches have been queued up for this merge window:
* Improvements for the ARM-SMMU driver
(IOMMU_EXEC support, IOMMU group support)
* Updates and fixes for the shmobile IOMMU driver
* Various fixes to generic IOMMU code and the
Intel IOMMU driver
* Some cleanups in IOMMU drivers (dev_is_pci() usage)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJS6XrCAAoJECvwRC2XARrjgy4P/itemtg2U+603Ldje8WcPo0E
OCO/0VVSmCTYKUJDZY0hiVwqmhe5gFL3Hm/gGwkS0+UJenFXMmi+aVaPp4pCpgH+
dL2HD3dIEvi14bisrdxG/8MdR6mIx0qzKtnZLkKSR4LXwucyLHvC/DaCoOytb7Yk
7s+eEuo0hj0jAkiqSG/zLEtKElTEnoAAkLOjMy46orecJ5q4HusPZekLtWZs2ETe
x3NS63Unb9g1iSQJWIA7HnQlxWIr2+iynoamHHJRiVFzqRF0W0sGvQY3auG0DSCn
70WRNE1rKfEkfXMJxosRQ4394YUQdAkt8MBENNcJcC6E1n5PBi0cEZXH6mCnEIlG
jXzIKUY9fz68ZboaqIxXv4Hb+JLlPXCvPBvQzIQiKRgVxd8nncEjn5I9MHdf+je5
BmJlzJLJvP4cFvW8Hc8k2Oq101b1kEcSCLARWWvE9/bk9xIUyrqBkR4XjC0vb6qq
1HbKVdZ7KFKCkBHy9xMpr7CUjKiDiiLeUmqlhyjcK9spicuNIZQnC11HemL6/USP
oR6Ext9RGhvz+ch656+5+L6f6FURVP8/ywKiJ3RjmvXV5/fCYo3WMitOB2qzlWCy
SYXAczAOMOdOo+1Dxbghrr+7HzUWPqgfPmntZEPGMZhfuZ6xXr+7pGLjAhHb4vcR
SZxqkDo1cprqrR9KFAWC
=YKLk
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU Updates from Joerg Roedel:
"A few patches have been queued up for this merge window:
- improvements for the ARM-SMMU driver (IOMMU_EXEC support, IOMMU
group support)
- updates and fixes for the shmobile IOMMU driver
- various fixes to generic IOMMU code and the Intel IOMMU driver
- some cleanups in IOMMU drivers (dev_is_pci() usage)"
* tag 'iommu-updates-v3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (36 commits)
iommu/vt-d: Fix signedness bug in alloc_irte()
iommu/vt-d: free all resources if failed to initialize DMARs
iommu/vt-d, trivial: clean sparse warnings
iommu/vt-d: fix wrong return value of dmar_table_init()
iommu/vt-d: release invalidation queue when destroying IOMMU unit
iommu/vt-d: fix access after free issue in function free_dmar_iommu()
iommu/vt-d: keep shared resources when failed to initialize iommu devices
iommu/vt-d: fix invalid memory access when freeing DMAR irq
iommu/vt-d, trivial: simplify code with existing macros
iommu/vt-d, trivial: use defined macro instead of hardcoding
iommu/vt-d: mark internal functions as static
iommu/vt-d, trivial: clean up unused code
iommu/vt-d, trivial: check suitable flag in function detect_intel_iommu()
iommu/vt-d, trivial: print correct domain id of static identity domain
iommu/vt-d, trivial: refine support of 64bit guest address
iommu/vt-d: fix resource leakage on error recovery path in iommu_init_domains()
iommu/vt-d: fix a race window in allocating domain ID for virtual machines
iommu/vt-d: fix PCI device reference leakage on error recovery path
drm/msm: Fix link error with !MSM_IOMMU
iommu/vt-d: use dedicated bitmap to track remapping entry allocation status
...
dma_pte_free_level() has an off-by-one error when checking whether a pte
is completely covered by a range. Take for example the case of
attempting to free pfn 0x0 - 0x1ff, ie. 512 entries covering the first
2M superpage.
The level_size() is 0x200 and we test:
static void dma_pte_free_level(...
...
if (!(0 > 0 || 0x1ff < 0 + 0x200)) {
...
}
Clearly the 2nd test is true, which means we fail to take the branch to
clear and free the pagetable entry. As a result, we're leaking
pagetables and failing to install new pages over the range.
This was found with a PCI device assigned to a QEMU guest using vfio-pci
without a VGA device present. The first 1M of guest address space is
mapped with various combinations of 4K pages, but eventually the range
is entirely freed and replaced with a 2M contiguous mapping.
intel-iommu errors out with something like:
ERROR: DMA PTE for vPFN 0x0 already set (to 5c2b8003 not 849c00083)
In this case 5c2b8003 is the pointer to the previous leaf page that was
neither freed nor cleared and 849c00083 is the superpage entry that
we're trying to replace it with.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enhance intel_iommu_init() to free all resources if failed to
initialize DMAR hardware.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Clean up most sparse warnings in Intel DMA and interrupt remapping
drivers.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Simplify vt-d related code with existing macros and introduce a new
macro for_each_active_drhd_unit() to enumerate all active DRHD unit.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Remove dead code from VT-d related files.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Conflicts:
drivers/iommu/dmar.c
Field si_domain->id is set by iommu_attach_domain(), so we should only
print domain id for static identity domain after calling
iommu_attach_domain(si_domain, iommu), otherwise it's always zero.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
In Intel IOMMU driver, it calculate page table level from adjusted guest
address width as 'level = (agaw - 30) / 9', which assumes (agaw -30)
could be divided by 9. On the other hand, 64bit is a valid agaw and
(64 - 30) can't be divided by 9, so it needs special handling.
This patch enhances Intel IOMMU driver to correctly handle 64bit agaw.
It's mainly for code readability because there's no hardware supporting
64bit agaw yet.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Release allocated resources on error recovery path in function
iommu_init_domains().
Also improve printk messages in iommu_init_domains().
Acked-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Function intel_iommu_domain_init() may be concurrently called by upper
layer without serialization, so use atomic_t to protect domain id
allocation.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Use PCI standard marco dev_is_pci() instead of directly compare
pci_bus_type to check whether it is pci device.
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
The BUG_ON in drivers/iommu/intel-iommu.c:785 can be triggered from userspace via
VFIO by calling the VFIO_IOMMU_MAP_DMA ioctl on a vfio device with any address
beyond the addressing capabilities of the IOMMU. The problem is that the ioctl code
calls iommu_iova_to_phys before it calls iommu_map. iommu_map handles the case that
it gets addresses beyond the addressing capabilities of its IOMMU.
intel_iommu_iova_to_phys does not.
This patch fixes iommu_iova_to_phys to return NULL for addresses beyond what the
IOMMU can handle. This in turn causes the ioctl call to fail in iommu_map and
(correctly) return EFAULT to the user with a helpful warning message in the kernel
log.
Signed-off-by: Julian Stecklina <jsteckli@os.inf.tu-dresden.de>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
At best the current code only seems to free the leaf pagetables and
the root. If you're unlucky enough to have a large gap (like any
QEMU guest with more than 3G of memory), only the first chunk of leaf
pagetables are freed (plus the root). This is a massive memory leak.
This patch re-writes the pagetable freeing function to use a
recursive algorithm and manages to not only free all the pagetables,
but does it without any apparent performance loss versus the current
broken version.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
If a device is multifunction and does not have ACS enabled then we
assume that the entire package lacks ACS and use function 0 as the
base of the group. The PCIe spec however states that components are
permitted to implement ACS on some, none, or all of their applicable
functions. It's therefore conceivable that function 0 may be fully
independent and support ACS while other functions do not. Instead
use the lowest function of the slot that does not have ACS enabled
as the base of the group. This may be the current device, which is
intentional. So long as we use a consistent algorithm, all the
non-ACS functions will be grouped together and ACS functions will
get separate groups.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
The swap_pci_ref function is used by the IOMMU API code for
swapping pci device pointers, while determining the iommu
group for the device.
Currently this function was being implemented for different
IOMMU drivers. This patch moves the function to a new file,
drivers/iommu/pci.h so that the implementation can be
shared across various IOMMU drivers.
Signed-off-by: Varun Sethi <Varun.Sethi@freescale.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>