linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-26 06:02:05 +00:00

Author	SHA1	Message	Date
Linus Torvalds	463f46e114	iommufd for 6.7 This branch has three new iommufd capabilities: - Dirty tracking for DMA. AMD/ARM/Intel CPUs can now record if a DMA writes to a page in the IOPTEs within the IO page table. This can be used to generate a record of what memory is being dirtied by DMA activities during a VM migration process. A VMM like qemu will combine the IOMMU dirty bits with the CPU's dirty log to determine what memory to transfer. VFIO already has a DMA dirty tracking framework that requires PCI devices to implement tracking HW internally. The iommufd version provides an alternative that the VMM can select, if available. The two are designed to have very similar APIs. - Userspace controlled attributes for hardware page tables (HWPT/iommu_domain). There are currently a few generic attributes for HWPTs (support dirty tracking, and parent of a nest). This is an entry point for the userspace iommu driver to control the HW in detail. - Nested translation support for HWPTs. This is a 2D translation scheme similar to the CPU where a DMA goes through a first stage to determine an intermediate address which is then translated trough a second stage to a physical address. Like for CPU translation the first stage table would exist in VM controlled memory and the second stage is in the kernel and matches the VM's guest to physical map. As every IOMMU has a unique set of parameter to describe the S1 IO page table and its associated parameters the userspace IOMMU driver has to marshal the information into the correct format. This is 1/3 of the feature, it allows creating the nested translation and binding it to VFIO devices, however the API to support IOTLB and ATC invalidation of the stage 1 io page table, and forwarding of IO faults are still in progress. The series includes AMD and Intel support for dirty tracking. Intel support for nested translation. Along the way are a number of internal items: - New iommu core items: ops->domain_alloc_user(), ops->set_dirty_tracking, ops->read_and_clear_dirty(), IOMMU_DOMAIN_NESTED, and iommu_copy_struct_from_user - UAF fix in iopt_area_split() - Spelling fixes and some test suite improvement -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZUDu2wAKCRCFwuHvBreF YcdeAQDaBmjyGLrRIlzPyohF6FrombyWo2512n51Hs8IHR4IvQEA3oRNgQ2tsJRr 1UPuOqnOD5T/oVX6AkUPRBwanCUQwwM= =nyJ3 -----END PGP SIGNATURE----- Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "This brings three new iommufd capabilities: - Dirty tracking for DMA. AMD/ARM/Intel CPUs can now record if a DMA writes to a page in the IOPTEs within the IO page table. This can be used to generate a record of what memory is being dirtied by DMA activities during a VM migration process. A VMM like qemu will combine the IOMMU dirty bits with the CPU's dirty log to determine what memory to transfer. VFIO already has a DMA dirty tracking framework that requires PCI devices to implement tracking HW internally. The iommufd version provides an alternative that the VMM can select, if available. The two are designed to have very similar APIs. - Userspace controlled attributes for hardware page tables (HWPT/iommu_domain). There are currently a few generic attributes for HWPTs (support dirty tracking, and parent of a nest). This is an entry point for the userspace iommu driver to control the HW in detail. - Nested translation support for HWPTs. This is a 2D translation scheme similar to the CPU where a DMA goes through a first stage to determine an intermediate address which is then translated trough a second stage to a physical address. Like for CPU translation the first stage table would exist in VM controlled memory and the second stage is in the kernel and matches the VM's guest to physical map. As every IOMMU has a unique set of parameter to describe the S1 IO page table and its associated parameters the userspace IOMMU driver has to marshal the information into the correct format. This is 1/3 of the feature, it allows creating the nested translation and binding it to VFIO devices, however the API to support IOTLB and ATC invalidation of the stage 1 io page table, and forwarding of IO faults are still in progress. The series includes AMD and Intel support for dirty tracking. Intel support for nested translation. Along the way are a number of internal items: - New iommu core items: ops->domain_alloc_user(), ops->set_dirty_tracking, ops->read_and_clear_dirty(), IOMMU_DOMAIN_NESTED, and iommu_copy_struct_from_user - UAF fix in iopt_area_split() - Spelling fixes and some test suite improvement" * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (52 commits) iommufd: Organize the mock domain alloc functions closer to Joerg's tree iommufd/selftest: Fix page-size check in iommufd_test_dirty() iommufd: Add iopt_area_alloc() iommufd: Fix missing update of domains_itree after splitting iopt_area iommu/vt-d: Disallow read-only mappings to nest parent domain iommu/vt-d: Add nested domain allocation iommu/vt-d: Set the nested domain to a device iommu/vt-d: Make domain attach helpers to be extern iommu/vt-d: Add helper to setup pasid nested translation iommu/vt-d: Add helper for nested domain allocation iommu/vt-d: Extend dmar_domain to support nested domain iommufd: Add data structure for Intel VT-d stage-1 domain allocation iommu/vt-d: Enhance capability check for nested parent domain allocation iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with nested HWPTs iommufd/selftest: Add nested domain allocation for mock domain iommu: Add iommu_copy_struct_from_user helper iommufd: Add a nested HW pagetable object iommu: Pass in parent domain with user_data to domain_alloc_user op iommufd: Share iommufd_hwpt_alloc with IOMMUFD_OBJ_HWPT_NESTED iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetable ...	2023-11-01 16:44:56 -10:00
Linus Torvalds	1e0c505e13	asm-generic updates for v6.7 The ia64 architecture gets its well-earned retirement as planned, now that there is one last (mostly) working release that will be maintained as an LTS kernel. The architecture specific system call tables are updated for the added map_shadow_stack() syscall and to remove references to the long-gone sys_lookup_dcookie() syscall. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmVC40IACgkQYKtH/8kJ Uidhmw/9EX+aWSXGoObJ3fngaNSMw+PmrEuP8qEKBHxfKHcCdX3hc451Oh4GlhaQ tru91pPwgNvN2/rfoKusxT+V4PemGIzfNni/04rp+P0kvmdw5otQ2yNhsQNsfVmq XGWvkxF4P2GO6bkjjfR/1dDq7GtlyXtwwPDKeLbYb6TnJOZjtx+EAN27kkfSn1Ms R4Sa3zJ+DfHUmHL5S9g+7UD/CZ5GfKNmIskI4Mz5GsfoUz/0iiU+Bge/9sdcdSJQ kmbLy5YnVzfooLZ3TQmBFsO3iAMWb0s/mDdtyhqhTVmTUshLolkPYyKnPFvdupyv shXcpEST2XJNeaDRnL2K4zSCdxdbnCZHDpjfl9wfioBg7I8NfhXKpf1jYZHH1de4 LXq8ndEFEOVQw/zSpYWfQq1sux8Jiqr+UK/ukbVeFWiGGIUs91gEWtPAf8T0AZo9 ujkJvaWGl98O1g5wmBu0/dAR6QcFJMDfVwbmlIFpU8O+MEaz6X8mM+O5/T0IyTcD eMbAUjj4uYcU7ihKzHEv/0SS9Of38kzff67CLN5k8wOP/9NlaGZ78o1bVle9b52A BdhrsAefFiWHp1jT6Y9Rg4HOO/TguQ9e6EWSKOYFulsiLH9LEFaB9RwZLeLytV0W vlAgY9rUW77g1OJcb7DoNv33nRFuxsKqsnz3DEIXtgozo9CzbYI= =H1vH -----END PGP SIGNATURE----- Merge tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull ia64 removal and asm-generic updates from Arnd Bergmann: - The ia64 architecture gets its well-earned retirement as planned, now that there is one last (mostly) working release that will be maintained as an LTS kernel. - The architecture specific system call tables are updated for the added map_shadow_stack() syscall and to remove references to the long-gone sys_lookup_dcookie() syscall. * tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: hexagon: Remove unusable symbols from the ptrace.h uapi asm-generic: Fix spelling of architecture arch: Reserve map_shadow_stack() syscall number for all architectures syscalls: Cleanup references to sys_lookup_dcookie() Documentation: Drop or replace remaining mentions of IA64 lib/raid6: Drop IA64 support Documentation: Drop IA64 from feature descriptions kernel: Drop IA64 support from sig_fault handlers arch: Remove Itanium (IA-64) architecture	2023-11-01 15:28:33 -10:00
Jason Gunthorpe	b2b67c997b	iommufd: Organize the mock domain alloc functions closer to Joerg's tree Patches in Joerg's iommu tree to convert the mock driver to use domain_alloc_paging() that clash badly with the way the selftest changes for nesting were structured. Massage the selftest so that it looks closer the code after the domain_alloc_paging() conversion to ease the merge. Change __mock_domain_alloc_paging() into mock_domain_alloc_paging() in the same way as the iommu tree. The merge resolution then trivially takes both and deletes mock_domain_alloc(). Link: https://lore.kernel.org/r/0-v1-90a855762c96+19de-mock_merge_jgg@nvidia.com Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-30 18:01:56 -03:00
Joao Martins	2e22aac3ea	iommufd/selftest: Fix page-size check in iommufd_test_dirty() iommufd_test_dirty()/IOMMU_TEST_OP_DIRTY sets the dirty bits in the mock domain implementation that the userspace side validates against what it obtains via the UAPI. However in introducing iommufd_test_dirty() it forgot to validate page_size being 0 leading to two possible divide-by-zero problems: one at the beginning when calculating @max and while calculating the IOVA in the XArray PFN tracking list. While at it, validate the length to require non-zero value as well, as we can't be allocating a 0-sized bitmap. Link: https://lore.kernel.org/r/20231030113446.7056-1-joao.m.martins@oracle.com Reported-by: syzbot+25dc7383c30ecdc83c38@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-iommu/00000000000005f6aa0608b9220f@google.com/ Fixes: `a9af47e382` ("iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_BITMAP") Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-30 12:03:20 -03:00
Jason Gunthorpe	361d744ddd	iommufd: Add iopt_area_alloc() We never initialize the two interval tree nodes, and zero fill is not the same as RB_CLEAR_NODE. This can hide issues where we missed adding the area to the trees. Factor out the allocation and clear the two nodes. Fixes: `51fe6141f0` ("iommufd: Data structure to provide IOVA to PFN mapping") Link: https://lore.kernel.org/r/20231030145035.GG691768@ziepe.ca Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-30 11:36:04 -03:00
Koichiro Den	e7250ab7ca	iommufd: Fix missing update of domains_itree after splitting iopt_area In iopt_area_split(), if the original iopt_area has filled a domain and is linked to domains_itree, pages_nodes have to be properly reinserted. Otherwise the domains_itree becomes corrupted and we will UAF. Fixes: `51fe6141f0` ("iommufd: Data structure to provide IOVA to PFN mapping") Link: https://lore.kernel.org/r/20231027162941.2864615-2-den@valinux.co.jp Cc: stable@vger.kernel.org Signed-off-by: Koichiro Den <den@valinux.co.jp> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-30 11:36:04 -03:00
Lu Baolu	6e6c6d6bc6	iommu: Avoid unnecessary cache invalidations The iommu_create_device_direct_mappings() only needs to flush the caches when the mappings are changed in the affected domain. This is not true for non-DMA domains, or for devices attached to the domain that have no reserved regions. To avoid unnecessary cache invalidations, add a check before iommu_flush_iotlb_all(). Fixes: `a48ce36e27` ("iommu: Prevent RESV_DIRECT devices from blocking domains") Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Henry Willard <henry.willard@oracle.com> Link: https://lore.kernel.org/r/20231026084942.17387-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>	2023-10-27 08:57:09 +02:00
Lu Baolu	03476e687e	iommu/vt-d: Disallow read-only mappings to nest parent domain When remapping hardware is configured by system software in scalable mode as Nested (PGTT=011b) and with PWSNP field Set in the PASID-table-entry, it may Set Accessed bit and Dirty bit (and Extended Access bit if enabled) in first-stage page-table entries even when second-stage mappings indicate that corresponding first-stage page-table is Read-Only. As the result, contents of pages designated by VMM as Read-Only can be modified by IOMMU via PML5E (PML4E for 4-level tables) access as part of address translation process due to DMAs issued by Guest. This disallows read-only mappings in the domain that is supposed to be used as nested parent. Reference from Sapphire Rapids Specification Update [1], errata details, SPR17. Userspace should know this limitation by checking the IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 flag reported in the IOMMU_GET_HW_INFO ioctl. [1] https://www.intel.com/content/www/us/en/content-details/772415/content-details.html Link: https://lore.kernel.org/r/20231026044216.64964-9-yi.l.liu@intel.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-26 11:16:34 -03:00
Lu Baolu	b41e38e225	iommu/vt-d: Add nested domain allocation This adds the support for IOMMU_HWPT_DATA_VTD_S1 type. And 'nested_parent' is added to mark the nested parent domain to sanitize the input parent domain. Link: https://lore.kernel.org/r/20231026044216.64964-8-yi.l.liu@intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-26 11:16:34 -03:00
Yi Liu	9838f2bb6b	iommu/vt-d: Set the nested domain to a device This adds the helper for setting the nested domain to a device hence enable nested domain usage on Intel VT-d. Link: https://lore.kernel.org/r/20231026044216.64964-7-yi.l.liu@intel.com Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-26 11:16:34 -03:00
Yi Liu	d86724d4dc	iommu/vt-d: Make domain attach helpers to be extern This makes the helpers visible to nested.c. Link: https://lore.kernel.org/r/20231026044216.64964-6-yi.l.liu@intel.com Suggested-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-26 11:16:34 -03:00
Lu Baolu	111bf85c68	iommu/vt-d: Add helper to setup pasid nested translation The configurations are passed in from the user when the user domain is allocated. This helper interprets these configurations according to the data structure defined in uapi/linux/iommufd.h. The EINVAL error will be returned if any of configurations are not compatible with the hardware capabilities. The caller can retry with another compatible user domain. The encoding of fields of each pasid entry is defined in section 9.6 of the VT-d spec. Link: https://lore.kernel.org/r/20231026044216.64964-5-yi.l.liu@intel.com Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-26 11:16:33 -03:00
Lu Baolu	79ae1eccd3	iommu/vt-d: Add helper for nested domain allocation This adds helper for accepting user parameters and allocate a nested domain. Link: https://lore.kernel.org/r/20231026044216.64964-4-yi.l.liu@intel.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-26 11:16:33 -03:00
Lu Baolu	04f261ac23	iommu/vt-d: Extend dmar_domain to support nested domain The nested domain fields are exclusive to those that used for a DMA remapping domain. Use union to avoid memory waste. Link: https://lore.kernel.org/r/20231026044216.64964-3-yi.l.liu@intel.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-26 11:16:33 -03:00
Yi Liu	a2cdecdf9d	iommu/vt-d: Enhance capability check for nested parent domain allocation This adds the scalable mode check before allocating the nested parent domain as checking nested capability is not enough. User may turn off scalable mode which also means no nested support even if the hardware supports it. Fixes: `c97d1b20d3` ("iommu/vt-d: Add domain_alloc_user op") Link: https://lore.kernel.org/r/20231024150011.44642-1-yi.l.liu@intel.com Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-26 11:16:11 -03:00
Nicolin Chen	65fe32f7a4	iommufd/selftest: Add nested domain allocation for mock domain Add nested domain support in the ->domain_alloc_user op with some proper sanity checks. Then, add a domain_nested_ops for all nested domains and split the get_md_pagetable helper into paging and nested helpers. Also, add an iotlb as a testing property of a nested domain. Link: https://lore.kernel.org/r/20231026043938.63898-10-yi.l.liu@intel.com Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-26 11:15:57 -03:00
Nicolin Chen	bd529dbb66	iommufd: Add a nested HW pagetable object IOMMU_HWPT_ALLOC already supports iommu_domain allocation for usersapce. But it can only allocate a hw_pagetable that associates to a given IOAS, i.e. only a kernel-managed hw_pagetable of IOMMUFD_OBJ_HWPT_PAGING type. IOMMU drivers can now support user-managed hw_pagetables, for two-stage translation use cases that require user data input from the user space. Add a new IOMMUFD_OBJ_HWPT_NESTED type with its abort/destroy(). Pair it with a new iommufd_hwpt_nested structure and its to_hwpt_nested() helper. Update the to_hwpt_paging() helper, so a NESTED-type hw_pagetable can be handled in the callers, for example iommufd_hw_pagetable_enforce_rr(). Screen the inputs including the parent PAGING-type hw_pagetable that has a need of a new nest_parent flag in the iommufd_hwpt_paging structure. Extend the IOMMU_HWPT_ALLOC ioctl to accept an IOMMU driver specific data input which is tagged by the enum iommu_hwpt_data_type. Also, update the @pt_id to accept hwpt_id too besides an ioas_id. Then, use them to allocate a hw_pagetable of IOMMUFD_OBJ_HWPT_NESTED type using the iommufd_hw_pagetable_alloc_nested() allocator. Link: https://lore.kernel.org/r/20231026043938.63898-8-yi.l.liu@intel.com Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Co-developed-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-26 11:15:57 -03:00
Yi Liu	2bdabb8e82	iommu: Pass in parent domain with user_data to domain_alloc_user op domain_alloc_user op already accepts user flags for domain allocation, add a parent domain pointer and a driver specific user data support as well. The user data would be tagged with a type for iommu drivers to add their own driver specific user data per hw_pagetable. Add a struct iommu_user_data as a bundle of data_ptr/data_len/type from an iommufd core uAPI structure. Make the user data opaque to the core, since a userspace driver must match the kernel driver. In the future, if drivers share some common parameter, there would be a generic parameter as well. Link: https://lore.kernel.org/r/20231026043938.63898-7-yi.l.liu@intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Co-developed-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-26 11:15:57 -03:00
Nicolin Chen	b5021cb264	iommufd: Share iommufd_hwpt_alloc with IOMMUFD_OBJ_HWPT_NESTED Allow iommufd_hwpt_alloc() to have a common routine but jump to different allocators corresponding to different user input pt_obj types, either an IOMMUFD_OBJ_IOAS for a PAGING hwpt or an IOMMUFD_OBJ_HWPT_PAGING as the parent for a NESTED hwpt. Also, move the "flags" validation to the hwpt allocator (paging), so that later the hwpt_nested allocator can do its own separate flags validation. Link: https://lore.kernel.org/r/20231026043938.63898-6-yi.l.liu@intel.com Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-26 11:15:57 -03:00
Nicolin Chen	89db31635c	iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetable To prepare for IOMMUFD_OBJ_HWPT_NESTED, derive struct iommufd_hwpt_paging from struct iommufd_hw_pagetable, by leaving the common members in struct iommufd_hw_pagetable. Add a __iommufd_object_alloc and to_hwpt_paging() helpers for the new structure. Then, update "hwpt" to "hwpt_paging" throughout the files, accordingly. Link: https://lore.kernel.org/r/20231026043938.63898-5-yi.l.liu@intel.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-26 11:15:56 -03:00
Jason Gunthorpe	58d84f430d	iommufd/device: Wrap IOMMUFD_OBJ_HWPT_PAGING-only configurations Some of the configurations during the attach/replace() should only apply to IOMMUFD_OBJ_HWPT_PAGING. Once IOMMUFD_OBJ_HWPT_NESTED gets introduced in a following patch, keeping them unconditionally in the common routine will not work. Wrap all of those PAGING-only configurations together into helpers. Do a hwpt_is_paging check whenever calling them or their fallback routines. Link: https://lore.kernel.org/r/20231026043938.63898-4-yi.l.liu@intel.com Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-26 11:15:56 -03:00
Jason Gunthorpe	9744a7ab62	iommufd: Rename IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING To add a new IOMMUFD_OBJ_HWPT_NESTED, rename the HWPT object to confine it to PAGING hwpts/domains. The following patch will separate the hwpt structure as well. Link: https://lore.kernel.org/r/20231026043938.63898-3-yi.l.liu@intel.com Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-26 11:15:56 -03:00
Nicolin Chen	2ccabf81dd	iommufd: Only enforce cache coherency in iommufd_hw_pagetable_alloc According to the conversation in the following link: https://lore.kernel.org/linux-iommu/20231020135501.GG3952@nvidia.com/ The enforce_cache_coherency should be set/enforced in the hwpt allocation routine. The iommu driver in its attach_dev() op should decide whether to reject or not a device that doesn't match with the configuration of cache coherency. Drop the enforce_cache_coherency piece in the attach/replace() and move the remaining "num_devices" piece closer to the refcount that is using it. Accordingly drop its function prototype in the header and mark it static. Also add some extra comments to clarify the expected behaviors. Link: https://lore.kernel.org/r/20231024012958.30842-1-nicolinc@nvidia.com Suggested-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 12:56:37 -03:00
Joao Martins	0795b305da	iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag Change test_mock_dirty_bitmaps() to pass a flag where it specifies the flag under test. The test does the same thing as the GET_DIRTY_BITMAP regular test. Except that it tests whether the dirtied bits are fetched all the same a second time, as opposed to observing them cleared. Link: https://lore.kernel.org/r/20231024135109.73787-19-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:44 -03:00
Joao Martins	ae36fe70ce	iommufd/selftest: Test out_capabilities in IOMMU_GET_HW_INFO Enumerate the capabilities from the mock device and test whether it advertises as expected. Include it as part of the iommufd_dirty_tracking fixture. Link: https://lore.kernel.org/r/20231024135109.73787-18-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:44 -03:00
Joao Martins	a9af47e382	iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_BITMAP Add a new test ioctl for simulating the dirty IOVAs in the mock domain, and implement the mock iommu domain ops that get the dirty tracking supported. The selftest exercises the usual main workflow of: 1) Setting dirty tracking from the iommu domain 2) Read and clear dirty IOPTEs Different fixtures will test different IOVA range sizes, that exercise corner cases of the bitmaps. Link: https://lore.kernel.org/r/20231024135109.73787-17-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:44 -03:00
Joao Martins	7adf267d66	iommufd/selftest: Test IOMMU_HWPT_SET_DIRTY_TRACKING Change mock_domain to supporting dirty tracking and add tests to exercise the new SET_DIRTY_TRACKING API in the iommufd_dirty_tracking selftest fixture. Link: https://lore.kernel.org/r/20231024135109.73787-16-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:44 -03:00
Joao Martins	266ce58989	iommufd/selftest: Test IOMMU_HWPT_ALLOC_DIRTY_TRACKING In order to selftest the iommu domain dirty enforcing implement the mock_domain necessary support and add a new dev_flags to test that the hwpt_alloc/attach_device fails as expected. Expand the existing mock_domain fixture with a enforce_dirty test that exercises the hwpt_alloc and device attachment. Link: https://lore.kernel.org/r/20231024135109.73787-15-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:44 -03:00
Joao Martins	e04b23c8d4	iommufd/selftest: Expand mock_domain with dev_flags Expand mock_domain test to be able to manipulate the device capabilities. This allows testing with mockdev without dirty tracking support advertised and thus make sure enforce_dirty test does the expected. To avoid breaking IOMMUFD_TEST UABI replicate the mock_domain struct and thus add an input dev_flags at the end. Link: https://lore.kernel.org/r/20231024135109.73787-14-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:44 -03:00
Joao Martins	f35f22cc76	iommu/vt-d: Access/Dirty bit support for SS domains IOMMU advertises Access/Dirty bits for second-stage page table if the extended capability DMAR register reports it (ECAP, mnemonic ECAP.SSADS). The first stage table is compatible with CPU page table thus A/D bits are implicitly supported. Relevant Intel IOMMU SDM ref for first stage table "3.6.2 Accessed, Extended Accessed, and Dirty Flags" and second stage table "3.7.2 Accessed and Dirty Flags". First stage page table is enabled by default so it's allowed to set dirty tracking and no control bits needed, it just returns 0. To use SSADS, set bit 9 (SSADE) in the scalable-mode PASID table entry and flush the IOTLB via pasid_flush_caches() following the manual. Relevant SDM refs: "3.7.2 Accessed and Dirty Flags" "6.5.3.3 Guidance to Software for Invalidations, Table 23. Guidance to Software for Invalidations" PTE dirty bit is located in bit 9 and it's cached in the IOTLB so flush IOTLB to make sure IOMMU attempts to set the dirty bit again. Note that iommu_dirty_bitmap_record() will add the IOVA to iotlb_gather and thus the caller of the iommu op will flush the IOTLB. Relevant manuals over the hardware translation is chapter 6 with some special mention to: "6.2.3.1 Scalable-Mode PASID-Table Entry Programming Considerations" "6.2.4 IOTLB" Select IOMMUFD_DRIVER only if IOMMUFD is enabled, given that IOMMU dirty tracking requires IOMMUFD. Link: https://lore.kernel.org/r/20231024135109.73787-13-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:43 -03:00
Joao Martins	421a511a29	iommu/amd: Access/Dirty bit support in IOPTEs IOMMU advertises Access/Dirty bits if the extended feature register reports it. Relevant AMD IOMMU SDM ref[0] "1.3.8 Enhanced Support for Access and Dirty Bits" To enable it set the DTE flag in bits 7 and 8 to enable access, or access+dirty. With that, the IOMMU starts marking the D and A flags on every Memory Request or ATS translation request. It is on the VMM side to steer whether to enable dirty tracking or not, rather than wrongly doing in IOMMU. Relevant AMD IOMMU SDM ref [0], "Table 7. Device Table Entry (DTE) Field Definitions" particularly the entry "HAD". To actually toggle on and off it's relatively simple as it's setting 2 bits on DTE and flush the device DTE cache. To get what's dirtied use existing AMD io-pgtable support, by walking the pagetables over each IOVA, with fetch_pte(). The IOTLB flushing is left to the caller (much like unmap), and iommu_dirty_bitmap_record() is the one adding page-ranges to invalidate. This allows caller to batch the flush over a big span of IOVA space, without the iommu wondering about when to flush. Worthwhile sections from AMD IOMMU SDM: "2.2.3.1 Host Access Support" "2.2.3.2 Host Dirty Support" For details on how IOMMU hardware updates the dirty bit see, and expects from its consequent clearing by CPU: "2.2.7.4 Updating Accessed and Dirty Bits in the Guest Address Tables" "2.2.7.5 Clearing Accessed and Dirty Bits" Quoting the SDM: "The setting of accessed and dirty status bits in the page tables is visible to both the CPU and the peripheral when sharing guest page tables. The IOMMU interlocked operations to update A and D bits must be 64-bit operations and naturally aligned on a 64-bit boundary" .. and for the IOMMU update sequence to Dirty bit, essentially is states: 1. Decodes the read and write intent from the memory access. 2. If P=0 in the page descriptor, fail the access. 3. Compare the A & D bits in the descriptor with the read and write intent in the request. 4. If the A or D bits need to be updated in the descriptor: * Start atomic operation. * Read the descriptor as a 64-bit access. * If the descriptor no longer appears to require an update, release the atomic lock with no further action and continue to step 5. * Calculate the new A & D bits. * Write the descriptor as a 64-bit access. * End atomic operation. 5. Continue to the next stage of translation or to the memory access. Access/Dirty bits readout also need to consider the non-default page-sizes (aka replicated PTEs as mentined by manual), as AMD supports all powers of two (except 512G) page sizes. Select IOMMUFD_DRIVER only if IOMMUFD is enabled considering that IOMMU dirty tracking requires IOMMUFD. Link: https://lore.kernel.org/r/20231024135109.73787-12-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:43 -03:00
Joao Martins	134288158a	iommu/amd: Add domain_alloc_user based domain allocation Add the domain_alloc_user op implementation. To that end, refactor amd_iommu_domain_alloc() to receive a dev pointer and flags, while renaming it too, such that it becomes a common function shared with domain_alloc_user() implementation. The sole difference with domain_alloc_user() is that we initialize also other fields that iommu_domain_alloc() does. It lets it return the iommu domain correctly initialized in one function. This is in preparation to add dirty enforcement on AMD implementation of domain_alloc_user. Link: https://lore.kernel.org/r/20231024135109.73787-11-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:43 -03:00
Joao Martins	609848132c	iommufd: Add a flag to skip clearing of IOPTE dirty VFIO has an operation where it unmaps an IOVA while returning a bitmap with the dirty data. In reality the operation doesn't quite query the IO pagetables that the PTE was dirty or not. Instead it marks as dirty on anything that was mapped, and doing so in one syscall. In IOMMUFD the equivalent is done in two operations by querying with GET_DIRTY_IOVA followed by UNMAP_IOVA. However, this would incur two TLB flushes given that after clearing dirty bits IOMMU implementations require invalidating their IOTLB, plus another invalidation needed for the UNMAP. To allow dirty bits to be queried faster, add a flag (IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR) that requests to not clear the dirty bits from the PTE (but just reading them), under the expectation that the next operation is the unmap. An alternative is to unmap and just perpectually mark as dirty as that's the same behaviour as today. So here equivalent functionally can be provided with unmap alone, and if real dirty info is required it will amortize the cost while querying. There's still a race against DMA where in theory the unmap of the IOVA (when the guest invalidates the IOTLB via emulated iommu) would race against the VF performing DMA on the same IOVA. As discussed in [0], we are accepting to resolve this race as throwing away the DMA and it doesn't matter if it hit physical DRAM or not, the VM can't tell if we threw it away because the DMA was blocked or because we failed to copy the DRAM. [0] https://lore.kernel.org/linux-iommu/20220502185239.GR8364@nvidia.com/ Link: https://lore.kernel.org/r/20231024135109.73787-10-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:43 -03:00
Joao Martins	7623683857	iommufd: Add capabilities to IOMMU_GET_HW_INFO Extend IOMMUFD_CMD_GET_HW_INFO op to query generic iommu capabilities for a given device. Capabilities are IOMMU agnostic and use device_iommu_capable() API passing one of the IOMMU_CAP_*. Enumerate IOMMU_CAP_DIRTY_TRACKING for now in the out_capabilities field returned back to userspace. Link: https://lore.kernel.org/r/20231024135109.73787-9-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:43 -03:00
Joao Martins	b9a60d6f85	iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP Connect a hw_pagetable to the IOMMU core dirty tracking read_and_clear_dirty iommu domain op. It exposes all of the functionality for the UAPI that read the dirtied IOVAs while clearing the Dirty bits from the PTEs. In doing so, add an IO pagetable API iopt_read_and_clear_dirty_data() that performs the reading of dirty IOPTEs for a given IOVA range and then copying back to userspace bitmap. Underneath it uses the IOMMU domain kernel API which will read the dirty bits, as well as atomically clearing the IOPTE dirty bit and flushing the IOTLB at the end. The IOVA bitmaps usage takes care of the iteration of the bitmaps user pages efficiently and without copies. Within the iterator function we iterate over io-pagetable contigous areas that have been mapped. Contrary to past incantation of a similar interface in VFIO the IOVA range to be scanned is tied in to the bitmap size, thus the application needs to pass a appropriately sized bitmap address taking into account the iova range being passed and page size ... as opposed to allowing bitmap-iova != iova. Link: https://lore.kernel.org/r/20231024135109.73787-8-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:43 -03:00
Joao Martins	e2a4b29478	iommufd: Add IOMMU_HWPT_SET_DIRTY_TRACKING Every IOMMU driver should be able to implement the needed iommu domain ops to control dirty tracking. Connect a hw_pagetable to the IOMMU core dirty tracking ops, specifically the ability to enable/disable dirty tracking on an IOMMU domain (hw_pagetable id). To that end add an io_pagetable kernel API to toggle dirty tracking: * iopt_set_dirty_tracking(iopt, [domain], state) The intended caller of this is via the hw_pagetable object that is created. Internally it will ensure the leftover dirty state is cleared /right before/ dirty tracking starts. This is also useful for iommu drivers which may decide that dirty tracking is always-enabled at boot without wanting to toggle dynamically via corresponding iommu domain op. Link: https://lore.kernel.org/r/20231024135109.73787-7-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:43 -03:00
Joao Martins	5f9bdbf4c6	iommufd: Add a flag to enforce dirty tracking on attach Throughout IOMMU domain lifetime that wants to use dirty tracking, some guarantees are needed such that any device attached to the iommu_domain supports dirty tracking. The idea is to handle a case where IOMMU in the system are assymetric feature-wise and thus the capability may not be supported for all devices. The enforcement is done by adding a flag into HWPT_ALLOC namely: IOMMU_HWPT_ALLOC_DIRTY_TRACKING .. Passed in HWPT_ALLOC ioctl() flags. The enforcement is done by creating a iommu_domain via domain_alloc_user() and validating the requested flags with what the device IOMMU supports (and failing accordingly) advertised). Advertising the new IOMMU domain feature flag requires that the individual iommu driver capability is supported when a future device attachment happens. Link: https://lore.kernel.org/r/20231024135109.73787-6-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:42 -03:00
Joao Martins	13578d4ebe	iommufd/iova_bitmap: Move symbols to IOMMUFD namespace Have the IOVA bitmap exported symbols adhere to the IOMMUFD symbol export convention i.e. using the IOMMUFD namespace. In doing so, import the namespace in the current users. This means VFIO and the vfio-pci drivers that use iova_bitmap_set(). Link: https://lore.kernel.org/r/20231024135109.73787-4-joao.m.martins@oracle.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:42 -03:00
Joao Martins	8c9c727b61	vfio: Move iova_bitmap into iommufd Both VFIO and IOMMUFD will need iova bitmap for storing dirties and walking the user bitmaps, so move to the common dependency into IOMMUFD. In doing so, create the symbol IOMMUFD_DRIVER which designates the builtin code that will be used by drivers when selected. Today this means MLX5_VFIO_PCI and PDS_VFIO_PCI. IOMMU drivers will do the same (in future patches) when supporting dirty tracking and select IOMMUFD_DRIVER accordingly. Given that the symbol maybe be disabled, add header definitions in iova_bitmap.h for when IOMMUFD_DRIVER=n Link: https://lore.kernel.org/r/20231024135109.73787-3-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-24 11:58:42 -03:00
Yi Liu	c97d1b20d3	iommu/vt-d: Add domain_alloc_user op Add the domain_alloc_user() op implementation. It supports allocating domains to be used as parent under nested translation. Unlike other drivers VT-D uses only a single page table format so it only needs to check if the HW can support nesting. Link: https://lore.kernel.org/r/20230928071528.26258-7-yi.l.liu@intel.com Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-10 13:31:24 -03:00
Yi Liu	408663619f	iommufd/selftest: Add domain_alloc_user() support in iommu mock Add mock_domain_alloc_user() and a new test case for IOMMU_HWPT_ALLOC_NEST_PARENT. Link: https://lore.kernel.org/r/20230928071528.26258-6-yi.l.liu@intel.com Co-developed-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-10 13:31:24 -03:00
Yi Liu	4ff5421633	iommufd: Support allocating nested parent domain Extend IOMMU_HWPT_ALLOC to allocate domains to be used as parent (stage-2) in nested translation. Add IOMMU_HWPT_ALLOC_NEST_PARENT to the uAPI. Link: https://lore.kernel.org/r/20230928071528.26258-5-yi.l.liu@intel.com Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-10 13:31:24 -03:00
Yi Liu	89d63875d8	iommufd: Flow user flags for domain allocation to domain_alloc_user() Extends iommufd_hw_pagetable_alloc() to accept user flags, the uAPI will provide the flags. Link: https://lore.kernel.org/r/20230928071528.26258-4-yi.l.liu@intel.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-10 13:31:24 -03:00
Yi Liu	7975b72208	iommufd: Use the domain_alloc_user() op for domain allocation Make IOMMUFD use iommu_domain_alloc_user() by default for iommu_domain creation. IOMMUFD needs to support iommu_domain allocation with parameters from userspace in nested support, and a driver is expected to implement everything under this op. If the iommu driver doesn't provide domain_alloc_user callback then IOMMUFD falls back to use iommu_domain_alloc() with an UNMANAGED type if possible. Link: https://lore.kernel.org/r/20230928071528.26258-3-yi.l.liu@intel.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Co-developed-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2023-10-10 13:31:24 -03:00
Zhang Rui	59df44bfb0	iommu/vt-d: Avoid memory allocation in iommu_suspend() The iommu_suspend() syscore suspend callback is invoked with IRQ disabled. Allocating memory with the GFP_KERNEL flag may re-enable IRQs during the suspend callback, which can cause intermittent suspend/hibernation problems with the following kernel traces: Calling iommu_suspend+0x0/0x1d0 ------------[ cut here ]------------ WARNING: CPU: 0 PID: 15 at kernel/time/timekeeping.c:868 ktime_get+0x9b/0xb0 ... CPU: 0 PID: 15 Comm: rcu_preempt Tainted: G U E 6.3-intel #r1 RIP: 0010:ktime_get+0x9b/0xb0 ... Call Trace: <IRQ> tick_sched_timer+0x22/0x90 ? __pfx_tick_sched_timer+0x10/0x10 __hrtimer_run_queues+0x111/0x2b0 hrtimer_interrupt+0xfa/0x230 __sysvec_apic_timer_interrupt+0x63/0x140 sysvec_apic_timer_interrupt+0x7b/0xa0 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x1f/0x30 ... ------------[ cut here ]------------ Interrupts enabled after iommu_suspend+0x0/0x1d0 WARNING: CPU: 0 PID: 27420 at drivers/base/syscore.c:68 syscore_suspend+0x147/0x270 CPU: 0 PID: 27420 Comm: rtcwake Tainted: G U W E 6.3-intel #r1 RIP: 0010:syscore_suspend+0x147/0x270 ... Call Trace: <TASK> hibernation_snapshot+0x25b/0x670 hibernate+0xcd/0x390 state_store+0xcf/0xe0 kobj_attr_store+0x13/0x30 sysfs_kf_write+0x3f/0x50 kernfs_fop_write_iter+0x128/0x200 vfs_write+0x1fd/0x3c0 ksys_write+0x6f/0xf0 __x64_sys_write+0x1d/0x30 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Given that only 4 words memory is needed, avoid the memory allocation in iommu_suspend(). CC: stable@kernel.org Fixes: `33e0715710` ("iommu/vt-d: Avoid GFP_ATOMIC where it is not needed") Signed-off-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Ooi, Chin Hao <chin.hao.ooi@intel.com> Link: https://lore.kernel.org/r/20230921093956.234692-1-rui.zhang@intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20230925120417.55977-2-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>	2023-09-25 16:10:36 +02:00
Hector Martin	c7bd8a1f45	iommu/apple-dart: Handle DMA_FQ domains in attach_dev() Commit `a4fdd97622` ("iommu: Use flush queue capability") hid the IOMMU_DOMAIN_DMA_FQ domain type from domain allocation. A check was introduced in iommu_dma_init_domain() to fall back if not supported, but this check runs too late: by that point, devices have been attached to the IOMMU, and apple-dart's attach_dev() callback does not expect IOMMU_DOMAIN_DMA_FQ domains. Change the logic so the IOMMU_DOMAIN_DMA codepath is the default, instead of explicitly enumerating all types. Fixes an apple-dart regression in v6.5. Cc: regressions@lists.linux.dev Cc: stable@vger.kernel.org Suggested-by: Robin Murphy <robin.murphy@arm.com> Fixes: `a4fdd97622` ("iommu: Use flush queue capability") Signed-off-by: Hector Martin <marcan@marcan.st> Reviewed-by: Neal Gompa <neal@gompa.dev> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230922-iommu-type-regression-v2-1-689b2ba9b673@marcan.st Signed-off-by: Joerg Roedel <jroedel@suse.de>	2023-09-25 11:37:19 +02:00
Joerg Roedel	7accef5353	Arm SMMU fixes for 6.6 -rc - Fix TLB range command encoding when TTL, Num and Scale are all zero - Fix soft lockup by limiting TLB invalidation ops issued by SVA - Fix clocks description for SDM630 platform in arm-smmu DT binding -----BEGIN PGP SIGNATURE----- iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmUNbgwQHHdpbGxAa2Vy bmVsLm9yZwAKCRC3rHDchMFjND6PCACVc83E8PykHXnkffJHyBjBLG9l5wAmaA13 kPqcfevXrSUBL7BcqhskydodQ61+23I3zqgQce+6lXzvSIDwfXqZDUghQWpthYNh 5ZWeQy/R8XS90MLLMsLl0b7UzwdlZYZPZKY3VPHQJKP0otHrDVB3prz/CQb7Ccnl WGGwz2ZgsCatqf599ALFHsE6Xw3GWCODxhrC0FBF7lnyyde/6ElhM/rgrPER4UHz IsjF9cwlVgQ4MloWjoJb2MtSUGHnMi9fmHktTrriI07ReJYdI0LmwjJtgBv48Uzr CcLbgOyYsHvsz0hGIJZwXR3+FfrVNuoWBo5/AiZQvAHRVxnftGrZ =a32m -----END PGP SIGNATURE----- Merge tag 'arm-smmu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into iommu/fixes Arm SMMU fixes for 6.6 -rc - Fix TLB range command encoding when TTL, Num and Scale are all zero - Fix soft lockup by limiting TLB invalidation ops issued by SVA - Fix clocks description for SDM630 platform in arm-smmu DT binding	2023-09-25 11:34:53 +02:00
Yong Wu	b07eba71a5	iommu/mediatek: Fix share pgtable for iova over 4GB In mt8192/mt8186, there is only one MM IOMMU that supports 16GB iova space, which is shared by display, vcodec and camera. These two SoC use one pgtable and have not the flag SHARE_PGTABLE, we should also keep share pgtable for this case. In mtk_iommu_domain_finalise, MM IOMMU always share pgtable, thus remove the flag SHARE_PGTABLE checking. Infra IOMMU always uses independent pgtable. Fixes: `cf69ef46db` ("iommu/mediatek: Fix two IOMMU share pagetable issue") Reported-by: Laura Nao <laura.nao@collabora.com> Closes: https://lore.kernel.org/linux-iommu/20230818154156.314742-1-laura.nao@collabora.com/ Signed-off-by: Yong Wu <yong.wu@mediatek.com> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Tested-by: Laura Nao <laura.nao@collabora.com> Link: https://lore.kernel.org/r/20230819081443.8333-1-yong.wu@mediatek.com Signed-off-by: Joerg Roedel <jroedel@suse.de>	2023-09-25 11:34:02 +02:00
Nicolin Chen	d5afb4b47e	iommu/arm-smmu-v3: Fix soft lockup triggered by arm_smmu_mm_invalidate_range When running an SVA case, the following soft lockup is triggered: -------------------------------------------------------------------- watchdog: BUG: soft lockup - CPU#244 stuck for 26s! pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : arm_smmu_cmdq_issue_cmdlist+0x178/0xa50 lr : arm_smmu_cmdq_issue_cmdlist+0x150/0xa50 sp : ffff8000d83ef290 x29: ffff8000d83ef290 x28: 000000003b9aca00 x27: 0000000000000000 x26: ffff8000d83ef3c0 x25: da86c0812194a0e8 x24: 0000000000000000 x23: 0000000000000040 x22: ffff8000d83ef340 x21: ffff0000c63980c0 x20: 0000000000000001 x19: ffff0000c6398080 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: ffff3000b4a3bbb0 x14: ffff3000b4a30888 x13: ffff3000b4a3cf60 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : ffffc08120e4d6bc x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000048cfa x5 : 0000000000000000 x4 : 0000000000000001 x3 : 000000000000000a x2 : 0000000080000000 x1 : 0000000000000000 x0 : 0000000000000001 Call trace: arm_smmu_cmdq_issue_cmdlist+0x178/0xa50 __arm_smmu_tlb_inv_range+0x118/0x254 arm_smmu_tlb_inv_range_asid+0x6c/0x130 arm_smmu_mm_invalidate_range+0xa0/0xa4 __mmu_notifier_invalidate_range_end+0x88/0x120 unmap_vmas+0x194/0x1e0 unmap_region+0xb4/0x144 do_mas_align_munmap+0x290/0x490 do_mas_munmap+0xbc/0x124 __vm_munmap+0xa8/0x19c __arm64_sys_munmap+0x28/0x50 invoke_syscall+0x78/0x11c el0_svc_common.constprop.0+0x58/0x1c0 do_el0_svc+0x34/0x60 el0_svc+0x2c/0xd4 el0t_64_sync_handler+0x114/0x140 el0t_64_sync+0x1a4/0x1a8 -------------------------------------------------------------------- Note that since 6.6-rc1 the arm_smmu_mm_invalidate_range above is renamed to "arm_smmu_mm_arch_invalidate_secondary_tlbs", yet the problem remains. The commit `06ff87bae8` ("arm64: mm: remove unused functions and variable protoypes") fixed a similar lockup on the CPU MMU side. Yet, it can occur to SMMU too, since arm_smmu_mm_arch_invalidate_secondary_tlbs() is called typically next to MMU tlb flush function, e.g. tlb_flush_mmu_tlbonly { tlb_flush { __flush_tlb_range { // check MAX_TLBI_OPS } } mmu_notifier_arch_invalidate_secondary_tlbs { arm_smmu_mm_arch_invalidate_secondary_tlbs { // does not check MAX_TLBI_OPS } } } Clone a CMDQ_MAX_TLBI_OPS from the MAX_TLBI_OPS in tlbflush.h, since in an SVA case SMMU uses the CPU page table, so it makes sense to align with the tlbflush code. Then, replace per-page TLBI commands with a single per-asid TLBI command, if the request size hits this threshold. Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Link: https://lore.kernel.org/r/20230920052257.8615-1-nicolinc@nvidia.com Signed-off-by: Will Deacon <will@kernel.org>	2023-09-22 11:15:42 +01:00
Robin Murphy	eb6c97647b	iommu/arm-smmu-v3: Avoid constructing invalid range commands Although io-pgtable's non-leaf invalidations are always for full tables, I missed that SVA also uses non-leaf invalidations, while being at the mercy of whatever range the MMU notifier throws at it. This means it definitely wants the previous TTL fix as well, since it also doesn't know exactly which leaf level(s) may need invalidating, but it can also give us less-aligned ranges wherein certain corners may lead to building an invalid command where TTL, Num and Scale are all 0. It should be fine to handle this by over-invalidating an extra page, since falling back to a non-range command opens up a whole can of errata-flavoured worms. Fixes: `6833b8f2e1` ("iommu/arm-smmu-v3: Set TTL invalidation hint better") Reported-by: Rui Zhu <zhurui3@huawei.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/b99cfe71af2bd93a8a2930f20967fb2a4f7748dd.1694432734.git.robin.murphy@arm.com Signed-off-by: Will Deacon <will@kernel.org>	2023-09-18 10:16:24 +01:00

1 2 3 4 5 ...

4796 Commits