mirror of
https://github.com/torvalds/linux.git
synced 2024-11-28 23:21:31 +00:00
5c20772738
141 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Shivaprasad G Bhat
|
f431a8cde7 |
powerpc/iommu: Reimplement the iommu_table_group_ops for pSeries
PPC64 IOMMU API defines iommu_table_group_ops which handles DMA windows for PEs, their ownership transfer, create/set/unset the TCE tables for the Dynamic DMA wundows(DDW). VFIOS uses these APIs for support on POWER. The commit |
||
Shivaprasad G Bhat
|
35146eadcb |
powerpc/iommu: Move dev_has_iommu_table() to iommu.c
Move function dev_has_iommu_table() to powerpc/kernel/iommu.c as it is going to be used by machine specific iommu code as well in subsequent patches. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171923274748.1397.6274953248403106679.stgit@linux.ibm.com |
||
Shivaprasad G Bhat
|
b09c031d94 |
powerpc/iommu: Move pSeries specific functions to pseries/iommu.c
The PowerNV specific table_group_ops are defined in powernv/pci-ioda.c. The pSeries specific table_group_ops are sitting in the generic powerpc file. Move it to where it actually belong(pseries/iommu.c). The functions are currently defined even for CONFIG_PPC_POWERNV which are unused on PowerNV. Only code movement, no functional changes intended. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171923269701.1397.15758640002786937132.stgit@linux.ibm.com |
||
Linus Torvalds
|
61307b7be4 |
The usual shower of singleton fixes and minor series all over MM,
documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZkgQYwAKCRDdBJ7gKXxA jrdKAP9WVJdpEcXxpoub/vVE0UWGtffr8foifi9bCwrQrGh5mgEAx7Yf0+d/oBZB nvA4E0DcPrUAFy144FNM0NTCb7u9vAw= =V3R/ -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ... |
||
Kent Overstreet
|
0069455bcb |
fix missing vmalloc.h includes
Patch series "Memory allocation profiling", v6. Overview: Low overhead [1] per-callsite memory allocation profiling. Not just for debug kernels, overhead low enough to be deployed in production. Example output: root@moria-kvm:~# sort -rn /proc/allocinfo 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext 56373248 4737 mm/slub.c:2259 func:alloc_slab_page 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start 3940352 962 mm/memory.c:4214 func:alloc_anon_folio 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node ... Usage: kconfig options: - CONFIG_MEM_ALLOC_PROFILING - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT - CONFIG_MEM_ALLOC_PROFILING_DEBUG adds warnings for allocations that weren't accounted because of a missing annotation sysctl: /proc/sys/vm/mem_profiling Runtime info: /proc/allocinfo Notes: [1]: Overhead To measure the overhead we are comparing the following configurations: (1) Baseline with CONFIG_MEMCG_KMEM=n (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1) (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y Performance overhead: To evaluate performance we implemented an in-kernel test executing multiple get_free_page/free_page and kmalloc/kfree calls with allocation sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU affinity set to a specific CPU to minimize the noise. Below are results from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on 56 core Intel Xeon: kmalloc pgalloc (1 baseline) 6.764s 16.902s (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) (6 def disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) (7 def enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) Memory overhead: Kernel size: text data bss dec diff (1) 26515311 18890222 17018880 62424413 (2) 26524728 19423818 16740352 62688898 264485 (3) 26524724 19423818 16740352 62688894 264481 (4) 26524728 19423818 16740352 62688898 264485 (5) 26541782 18964374 16957440 62463596 39183 Memory consumption on a 56 core Intel CPU with 125GB of memory: Code tags: 192 kB PageExts: 262144 kB (256MB) SlabExts: 9876 kB (9.6MB) PcpuExts: 512 kB (0.5MB) Total overhead is 0.2% of total memory. Benchmarks: Hackbench tests run 100 times: hackbench -s 512 -l 200 -g 15 -f 25 -P baseline disabled profiling enabled profiling avg 0.3543 0.3559 (+0.0016) 0.3566 (+0.0023) stdev 0.0137 0.0188 0.0077 hackbench -l 10000 baseline disabled profiling enabled profiling avg 6.4218 6.4306 (+0.0088) 6.5077 (+0.0859) stdev 0.0933 0.0286 0.0489 stress-ng tests: stress-ng --class memory --seq 4 -t 60 stress-ng --class cpu --seq 4 -t 60 Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/ [2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/ This patch (of 37): The next patch drops vmalloc.h from a system header in order to fix a circular dependency; this adds it to all the files that were pulling it in implicitly. [kent.overstreet@linux.dev: fix arch/alpha/lib/memcpy.c] Link: https://lkml.kernel.org/r/20240327002152.3339937-1-kent.overstreet@linux.dev [surenb@google.com: fix arch/x86/mm/numa_32.c] Link: https://lkml.kernel.org/r/20240402180933.1663992-1-surenb@google.com [kent.overstreet@linux.dev: a few places were depending on sizes.h] Link: https://lkml.kernel.org/r/20240404034744.1664840-1-kent.overstreet@linux.dev [arnd@arndb.de: fix mm/kasan/hw_tags.c] Link: https://lkml.kernel.org/r/20240404124435.3121534-1-arnd@kernel.org [surenb@google.com: fix arc build] Link: https://lkml.kernel.org/r/20240405225115.431056-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-2-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Shivaprasad G Bhat
|
5bd31ab5f7 |
powerpc/iommu: Refactor spapr_tce_platform_iommu_attach_dev()
The patch makes the iommu_group_get() call only when using it thereby avoiding the unnecessary get & put for domain already being set case. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/170800513841.2411.13524607664262048895.stgit@linux.ibm.com |
||
Linus Torvalds
|
ab0a97cffa |
powerpc fixes for 6.8 #4
- Fix a crash when hot adding a PCI device to an LPAR since recent changes. - Fix nested KVM level-2 guest reboot failure due to empty 'arch_compat'. Thanks to: Amit Machhiwal, Aneesh Kumar K.V (IBM), Brian King, Gaurav Batra, Vaibhav Jain. -----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmXaf8kTHG1wZUBlbGxl cm1hbi5pZC5hdQAKCRBR6+o8yOGlgJUyEACK+gmbPJxVwRnt60e7l8SJrjaXE9ZH llI2ffSM63KOL0qeJlWj8EfNcmU3ihCnMpVS4VHAIvEvCq8KBdcHh7EfLmIjMSl2 oHfcb+dpdnWpQUkI2jci6ptxv8Al7hmd5kP6X5TGn0XJaXWatW4tj7QDP574DCZY UfbHGvGnCVmlcxm3xhn7JzJT4Lgj9ehxgZEMHB68qfwU2rJgUXNITBzOsGwfq4c0 wiIdzVGMuJQdaFLt0JLNP2KY8kpVsAeKRYgSsLrkDkp+tKOIPvgeBmC221pOb1Y4 iZbpXR3h06WfFBnA2PJpdfuacN55/gF0U5BnyOj5tB7CTx65oTyX/BJLIocKWAlF hRRPVEYqBH0mZCHXn03+J/A2/iut1XxDlVMpEhjjZjb57CTYCApSpHer6shPw6PV QqG+iAr7lBCZ70qS2m2+mjPhgDf5TNlxlsip7S3xJJpJvScN/JyPDupDF7tQYuHD GjIOm8gSHL0i3F4L3NAZ9FtPRwxXcv85sWCTbE7jsfVOT/kkDGTwhx/vdPIR0la8 agDv/Q0FyyFHWd5IS4oqXFop6P3boulHexdIec/MIhBrPSZ6oG0ySzj3rqdI7VYh 7BYj/cHerKONR/BU6S3vKddh9RDD9Fe5ghEAgng2DxG+Za/m/M2v/76123/ZeZuw qQ8aiIvYOJ4DGA== =+tcX -----END PGP SIGNATURE----- Merge tag 'powerpc-6.8-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix a crash when hot adding a PCI device to an LPAR since recent changes - Fix nested KVM level-2 guest reboot failure due to empty 'arch_compat' Thanks to Amit Machhiwal, Aneesh Kumar K.V (IBM), Brian King, Gaurav Batra, and Vaibhav Jain. * tag 'powerpc-6.8-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat' powerpc/pseries/iommu: DLPAR add doesn't completely initialize pci_controller |
||
Gaurav Batra
|
a5c57fd2e9 |
powerpc/pseries/iommu: DLPAR add doesn't completely initialize pci_controller
When a PCI device is dynamically added, the kernel oopses with a NULL pointer dereference: BUG: Kernel NULL pointer dereference on read at 0x00000030 Faulting instruction address: 0xc0000000006bbe5c Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xsk_diag bonding nft_compat nf_tables nfnetlink rfkill binfmt_misc dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser libiscsi scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core mlxfw sd_mod t10_pi sg tls ibmvscsi ibmveth scsi_transport_srp vmx_crypto pseries_wdt psample dm_mirror dm_region_hash dm_log dm_mod fuse CPU: 17 PID: 2685 Comm: drmgr Not tainted 6.7.0-203405+ #66 Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries NIP: c0000000006bbe5c LR: c000000000a13e68 CTR: c0000000000579f8 REGS: c00000009924f240 TRAP: 0300 Not tainted (6.7.0-203405+) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24002220 XER: 20040006 CFAR: c000000000a13e64 DAR: 0000000000000030 DSISR: 40000000 IRQMASK: 0 ... NIP sysfs_add_link_to_group+0x34/0x94 LR iommu_device_link+0x5c/0x118 Call Trace: iommu_init_device+0x26c/0x318 (unreliable) iommu_device_link+0x5c/0x118 iommu_init_device+0xa8/0x318 iommu_probe_device+0xc0/0x134 iommu_bus_notifier+0x44/0x104 notifier_call_chain+0xb8/0x19c blocking_notifier_call_chain+0x64/0x98 bus_notify+0x50/0x7c device_add+0x640/0x918 pci_device_add+0x23c/0x298 of_create_pci_dev+0x400/0x884 of_scan_pci_dev+0x124/0x1b0 __of_scan_bus+0x78/0x18c pcibios_scan_phb+0x2a4/0x3b0 init_phb_dynamic+0xb8/0x110 dlpar_add_slot+0x170/0x3b8 [rpadlpar_io] add_slot_store.part.0+0xb4/0x130 [rpadlpar_io] kobj_attr_store+0x2c/0x48 sysfs_kf_write+0x64/0x78 kernfs_fop_write_iter+0x1b0/0x290 vfs_write+0x350/0x4a0 ksys_write+0x84/0x140 system_call_exception+0x124/0x330 system_call_vectored_common+0x15c/0x2ec Commit |
||
Linus Torvalds
|
c02197fc90 |
powerpc fixes for 6.8 #3
- Fix ftrace bug on boot caused by exit text sections with -fpatchable-function-entry. - Fix accuracy of stolen time on pseries since the switch to VIRT_CPU_ACCOUNTING_GEN. - Fix a crash in the IOMMU code when doing DLPAR remove. - Set pt_regs->link on scv entry to fix BPF stack unwinding. - Add missing PPC_FEATURE_BOOKE on 64-bit e5500/e6500, which broke gdb. - Fix boot on some 6xx platforms with STRICT_KERNEL_RWX enabled. - Fix build failures with KASAN enabled and 32KB stack size. - Some other minor fixes. Thanks to: Arnd Bergmann, Benjamin Gray, Christophe Leroy, David Engraf, Gaurav Batra, Jason Gunthorpe, Jiangfeng Xiao, Matthias Schiffer, Nathan Lynch, Naveen N Rao, Nicholas Piggin, Nysal Jan K.A, R Nageswara Sastry, Shivaprasad G Bhat, Shrikanth Hegde, Spoorthy, Srikar Dronamraju, Venkat Rao Bagalkote. -----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmXRQ64THG1wZUBlbGxl cm1hbi5pZC5hdQAKCRBR6+o8yOGlgNpvD/kBw+HCCTOCIG1R5qW76PE3zek5ikkn TmzmovQv51S2NH/NJ1vuy12/xs7kkiyKMcLi2G5Ua1HVaGtLRwn25pWsJZpWJii+ inBPCn8lRaXiDNqPCtF3xMvypWtLEvoQnUtH9If6XXEfmzo5TfoRuJdH0TF6eEuM A6abVONL7qYt/zGM2RhRrkVexznFk3SfF1UvKoR+6LMGVhgdW66mTKwcEt9KPn2X hOjtBShXQYR315qv3FJQdUQooiwdIqM7IaZf32oFoG1U/iHz+3wzHcG+83iggZEa jQMxthyeFjLWExT8dKwiLrTuaCa8B0bRKLypGcub1yh396/xcHv4KrX8XNJ3nQoL nKcQOcPkcd+ZVAfigu7wGUS12CKmuFLUTXspgGp3CJQKzBUfLMaVqlAY/DnKEgmc stQVi8pOv1puAE3qS2FK7HR0AdLTu0BRTdw8xfTOyfLeoYiGQQRLYnBhxb9HtwW7 HbVjpicE6VSth1BVGfgdmWH0n/a8cuuhXYOGzJ8ug1dCjgZc3zBISVx2B1yortri vypyMhZ8t4i6j8B2fFRSQ1O0PY/0NmoQ6Yg2JIwIjaO5IbWkyI/KjO5VgdZTkbuV 8i4VLBHvSUUQwd1wBLeNQFD9nLnyJAYo7qvvtBCntmUx6ZNrPihXP4fRjz/la5rJ I3xlArKK088RMw== =TiJM -----END PGP SIGNATURE----- Merge tag 'powerpc-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "This is a bit of a big batch for rc4, but just due to holiday hangover and because I didn't send any fixes last week due to a late revert request. I think next week should be back to normal. - Fix ftrace bug on boot caused by exit text sections with '-fpatchable-function-entry' - Fix accuracy of stolen time on pseries since the switch to VIRT_CPU_ACCOUNTING_GEN - Fix a crash in the IOMMU code when doing DLPAR remove - Set pt_regs->link on scv entry to fix BPF stack unwinding - Add missing PPC_FEATURE_BOOKE on 64-bit e5500/e6500, which broke gdb - Fix boot on some 6xx platforms with STRICT_KERNEL_RWX enabled - Fix build failures with KASAN enabled and 32KB stack size - Some other minor fixes Thanks to Arnd Bergmann, Benjamin Gray, Christophe Leroy, David Engraf, Gaurav Batra, Jason Gunthorpe, Jiangfeng Xiao, Matthias Schiffer, Nathan Lynch, Naveen N Rao, Nicholas Piggin, Nysal Jan K.A, R Nageswara Sastry, Shivaprasad G Bhat, Shrikanth Hegde, Spoorthy, Srikar Dronamraju, and Venkat Rao Bagalkote" * tag 'powerpc-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/iommu: Fix the missing iommu_group_put() during platform domain attach powerpc/pseries: fix accuracy of stolen time powerpc/ftrace: Ignore ftrace locations in exit text sections powerpc/cputable: Add missing PPC_FEATURE_BOOKE on PPC64 Book-E powerpc/kasan: Limit KASAN thread size increase to 32KB Revert "powerpc/pseries/iommu: Fix iommu initialisation during DLPAR add" powerpc: 85xx: mark local functions static powerpc: udbg_memcons: mark functions static powerpc/kasan: Fix addr error caused by page alignment powerpc/6xx: set High BAT Enable flag on G2_LE cores selftests/powerpc/papr_vpd: Check devfd before get_system_loc_code() powerpc/64: Set task pt_regs->link to the LR value on scv entry powerpc/pseries/iommu: Fix iommu initialisation during DLPAR add powerpc/pseries/papr-sysparm: use u8 arrays for payloads |
||
Shivaprasad G Bhat
|
0846dd77c8 |
powerpc/iommu: Fix the missing iommu_group_put() during platform domain attach
The function spapr_tce_platform_iommu_attach_dev() is missing to call
iommu_group_put() when the domain is already set. This refcount leak
shows up with BUG_ON() during DLPAR remove operation as:
KernelBug: Kernel bug in state 'None': kernel BUG at arch/powerpc/platforms/pseries/iommu.c:100!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=8192 NUMA pSeries
<snip>
Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_016) hv:phyp pSeries
NIP: c0000000000ff4d4 LR: c0000000000ff4cc CTR: 0000000000000000
REGS: c0000013aed5f840 TRAP: 0700 Tainted: G I (6.8.0-rc3-autotest-g99bd3cb0d12e)
MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 44002402 XER: 20040000
CFAR: c000000000a0d170 IRQMASK: 0
...
NIP iommu_reconfig_notifier+0x94/0x200
LR iommu_reconfig_notifier+0x8c/0x200
Call Trace:
iommu_reconfig_notifier+0x8c/0x200 (unreliable)
notifier_call_chain+0xb8/0x19c
blocking_notifier_call_chain+0x64/0x98
of_reconfig_notify+0x44/0xdc
of_detach_node+0x78/0xb0
ofdt_write.part.0+0x86c/0xbb8
proc_reg_write+0xf4/0x150
vfs_write+0xf8/0x488
ksys_write+0x84/0x140
system_call_exception+0x138/0x330
system_call_vectored_common+0x15c/0x2ec
The patch adds the missing iommu_group_put() call.
Fixes:
|
||
Michael Ellerman
|
1fba2bf8e9 |
Revert "powerpc/pseries/iommu: Fix iommu initialisation during DLPAR add"
This reverts commit
|
||
Gaurav Batra
|
ed8b94f6e0 |
powerpc/pseries/iommu: Fix iommu initialisation during DLPAR add
When a PCI device is dynamically added, the kernel oopses with a NULL pointer dereference: BUG: Kernel NULL pointer dereference on read at 0x00000030 Faulting instruction address: 0xc0000000006bbe5c Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xsk_diag bonding nft_compat nf_tables nfnetlink rfkill binfmt_misc dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser libiscsi scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core mlxfw sd_mod t10_pi sg tls ibmvscsi ibmveth scsi_transport_srp vmx_crypto pseries_wdt psample dm_mirror dm_region_hash dm_log dm_mod fuse CPU: 17 PID: 2685 Comm: drmgr Not tainted 6.7.0-203405+ #66 Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries NIP: c0000000006bbe5c LR: c000000000a13e68 CTR: c0000000000579f8 REGS: c00000009924f240 TRAP: 0300 Not tainted (6.7.0-203405+) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24002220 XER: 20040006 CFAR: c000000000a13e64 DAR: 0000000000000030 DSISR: 40000000 IRQMASK: 0 ... NIP sysfs_add_link_to_group+0x34/0x94 LR iommu_device_link+0x5c/0x118 Call Trace: iommu_init_device+0x26c/0x318 (unreliable) iommu_device_link+0x5c/0x118 iommu_init_device+0xa8/0x318 iommu_probe_device+0xc0/0x134 iommu_bus_notifier+0x44/0x104 notifier_call_chain+0xb8/0x19c blocking_notifier_call_chain+0x64/0x98 bus_notify+0x50/0x7c device_add+0x640/0x918 pci_device_add+0x23c/0x298 of_create_pci_dev+0x400/0x884 of_scan_pci_dev+0x124/0x1b0 __of_scan_bus+0x78/0x18c pcibios_scan_phb+0x2a4/0x3b0 init_phb_dynamic+0xb8/0x110 dlpar_add_slot+0x170/0x3b8 [rpadlpar_io] add_slot_store.part.0+0xb4/0x130 [rpadlpar_io] kobj_attr_store+0x2c/0x48 sysfs_kf_write+0x64/0x78 kernfs_fop_write_iter+0x1b0/0x290 vfs_write+0x350/0x4a0 ksys_write+0x84/0x140 system_call_exception+0x124/0x330 system_call_vectored_common+0x15c/0x2ec Commit |
||
Shivaprasad G Bhat
|
d2d00e1580 |
powerpc: iommu: Bring back table group release_ownership() call
The commit |
||
Linus Torvalds
|
4bbdb725a3 |
IOMMU Updates for Linux v6.7
Including: - Core changes: - Make default-domains mandatory for all IOMMU drivers - Remove group refcounting - Add generic_single_device_group() helper and consolidate drivers - Cleanup map/unmap ops - Scaling improvements for the IOVA rcache depot - Convert dart & iommufd to the new domain_alloc_paging() - ARM-SMMU: - Device-tree binding update: - Add qcom,sm7150-smmu-v2 for Adreno on SM7150 SoC - SMMUv2: - Support for Qualcomm SDM670 (MDSS) and SM7150 SoCs - SMMUv3: - Large refactoring of the context descriptor code to move the CD table into the master, paving the way for '->set_dev_pasid()' support on non-SVA domains - Minor cleanups to the SVA code - Intel VT-d: - Enable debugfs to dump domain attached to a pasid - Remove an unnecessary inline function. - AMD IOMMU: - Initial patches for SVA support (not complete yet) - S390 IOMMU: - DMA-API conversion and optimized IOTLB flushing - Some smaller fixes and improvements -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmVJFcEACgkQK/BELZcB GuMgDxAAsnYVQjQ7wRkwR0rHARuEaJ+Lz2vkLNH+uYXjBzhFe2bT+ykMcZysAkdK A5PMLOFT5Etf+PAqOM0CoIGQFOefAId6uGl7S61Fp9ZWDKhMrOBFWhxGOaufA1Du tNvt3i66hwPSDZa82kY3wRCluYtj0aBBzmM6ZTwBwFZdQ7LABMtE8OxisqncVvq0 H6vhV213fqvhCFSQJ6PnTAEiv70WvWBWygA+Z/gwYf9hypZQae91PNXdK9313a9z OvCzGBkL/R5/3KkJd88UhFwyYzyNGxq/DmH1etawYR5gYZ8UT/Z/sYpcx9hlO7qr eENPqeQc+YHZXpKqkaq66HBA1FSnXUqRZLl4cVaZahRRMe/yArsBM6R0W1AfkMAR rZxwHKoHUWeuHQLMVvmSDNL57h/GJJpTXjRc8HMxLZkVp+ScvnT5XCYHWWzRdCdx TcC/pJ1tet0FQ8rw09ovlwpGVA6eojWvcpVbLVLfGN8ZWViSVfvNFoPNb7HsGK6M iRi+L41Y7s63cyogC/Gsae2RAvYv29ZpvE91lmon2u+VBlTpMdOFX9EhWS6RqOBF cV30bhsw0dyCB7v5jDPtABYEOaR6l1mPLhn1gX3u0Ue/tmPhLX69k4bVWBY6wP3p gmmJD9ub8FuPQtFCGPE7/8ZINjGGrfiKO24DNI2Ty3XEeq21hU4= =UyWC -----END PGP SIGNATURE----- Merge tag 'iommu-updates-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu updates from Joerg Roedel: "Core changes: - Make default-domains mandatory for all IOMMU drivers - Remove group refcounting - Add generic_single_device_group() helper and consolidate drivers - Cleanup map/unmap ops - Scaling improvements for the IOVA rcache depot - Convert dart & iommufd to the new domain_alloc_paging() ARM-SMMU: - Device-tree binding update: - Add qcom,sm7150-smmu-v2 for Adreno on SM7150 SoC - SMMUv2: - Support for Qualcomm SDM670 (MDSS) and SM7150 SoCs - SMMUv3: - Large refactoring of the context descriptor code to move the CD table into the master, paving the way for '->set_dev_pasid()' support on non-SVA domains - Minor cleanups to the SVA code Intel VT-d: - Enable debugfs to dump domain attached to a pasid - Remove an unnecessary inline function AMD IOMMU: - Initial patches for SVA support (not complete yet) S390 IOMMU: - DMA-API conversion and optimized IOTLB flushing And some smaller fixes and improvements" * tag 'iommu-updates-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (102 commits) iommu/dart: Remove the force_bypass variable iommu/dart: Call apple_dart_finalize_domain() as part of alloc_paging() iommu/dart: Convert to domain_alloc_paging() iommu/dart: Move the blocked domain support to a global static iommu/dart: Use static global identity domains iommufd: Convert to alloc_domain_paging() iommu/vt-d: Use ops->blocked_domain iommu/vt-d: Update the definition of the blocking domain iommu: Move IOMMU_DOMAIN_BLOCKED global statics to ops->blocked_domain Revert "iommu/vt-d: Remove unused function" iommu/amd: Remove DMA_FQ type from domain allocation path iommu: change iommu_map_sgtable to return signed values iommu/virtio: Add __counted_by for struct viommu_request and use struct_size() iommu/vt-d: debugfs: Support dumping a specified page table iommu/vt-d: debugfs: Create/remove debugfs file per {device, pasid} iommu/vt-d: debugfs: Dump entry pointing to huge page iommu/vt-d: Remove unused function iommu/arm-smmu-v3-sva: Remove bond refcount iommu/arm-smmu-v3-sva: Remove unused iommu_sva handle iommu/arm-smmu-v3: Rename cdcfg to cd_table ... |
||
Jason Gunthorpe
|
e5d8be7406 |
iommu: Move IOMMU_DOMAIN_BLOCKED global statics to ops->blocked_domain
Following the pattern of identity domains, just assign the BLOCKED domain global statics to a value in ops. Update the core code to use the global static directly. Update powerpc to use the new scheme and remove its empty domain_alloc callback. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Sven Peter <sven@svenpeter.dev> Link: https://lore.kernel.org/r/1-v2-bff223cf6409+282-dart_paging_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de> |
||
Benjamin Gray
|
419d5d112c |
powerpc: Remove extern from function implementations
Sparse reports several function implementations annotated with extern. This is clearly incorrect, likely just copied from an actual extern declaration in another file. Fix the sparse warnings by removing extern. Signed-off-by: Benjamin Gray <bgray@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231011053711.93427-6-bgray@linux.ibm.com |
||
Jason Gunthorpe
|
a8ca9fc913 |
powerpc/iommu: Do not do platform domain attach atctions after probe
POWER throws a splat at boot, it looks like the DMA ops were probably
changed while a driver was attached. Something is still weird about how
power sequences its bootup. Previously this was hidden since the core
iommu code did nothing during probe, now it calls
spapr_tce_platform_iommu_attach_dev().
Make spapr_tce_platform_iommu_attach_dev() do nothing on the probe time
call like it did before.
WARNING: CPU: 0 PID: 8 at arch/powerpc/kernel/iommu.c:407 __iommu_free+0x1e4/0x1f0
Modules linked in: sd_mod t10_pi crc64_rocksoft crc64 sg ibmvfc mlx5_core(+) scsi_transport_fc ibmveth mlxfw psample dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 0 PID: 8 Comm: kworker/0:0 Not tainted 6.6.0-rc3-next-20230929-auto #1
Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1030.30 (NH1030_062) hv:phyp pSeries
Workqueue: events work_for_cpu_fn
NIP: c00000000005f6d4 LR: c00000000005f6d0 CTR: 00000000005ca81c
REGS: c000000003a27890 TRAP: 0700 Not tainted (6.6.0-rc3-next-20230929-auto)
MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48000824 XER: 00000008
CFAR: c00000000020f738 IRQMASK: 0
GPR00: c00000000005f6d0 c000000003a27b30 c000000001481800 000000000000017
GPR04: 00000000ffff7fff c000000003a27950 c000000003a27948 0000000000000027
GPR08: c000000c18c07c10 0000000000000001 0000000000000027 c000000002ac8a08
GPR12: 0000000000000000 c000000002ff0000 c00000000019cc88 c000000003042300
GPR16: 0000000000000000 0000000000000000 0000000000000000 c000000003071ab0
GPR20: c00000000349f80d c000000003215440 c000000003215480 61c8864680b583eb
GPR24: 0000000000000000 000000007fffffff 0800000020000000 0000000000000010
GPR28: 0000000000020000 0000800000020000 c00000000c5dc800 c00000000c5dc880
NIP [c00000000005f6d4] __iommu_free+0x1e4/0x1f0
LR [c00000000005f6d0] __iommu_free+0x1e0/0x1f0
Call Trace:
[c000000003a27b30] [c00000000005f6d0] __iommu_free+0x1e0/0x1f0 (unreliable)
[c000000003a27bc0] [c00000000005f848] iommu_free+0x28/0x70
[c000000003a27bf0] [c000000000061518] iommu_free_coherent+0x68/0xa0
[c000000003a27c20] [c00000000005e8d4] dma_iommu_free_coherent+0x24/0x40
[c000000003a27c40] [c00000000024698c] dma_free_attrs+0x10c/0x140
[c000000003a27c90] [c008000000dcb8d4] mlx5_cmd_cleanup+0x5c/0x90 [mlx5_core]
[c000000003a27cc0] [c008000000dc45a0] mlx5_mdev_uninit+0xc8/0x100 [mlx5_core]
[c000000003a27d00] [c008000000dc4ac4] probe_one+0x3ec/0x530 [mlx5_core]
[c000000003a27d90] [c0000000008c5edc] local_pci_probe+0x6c/0x110
[c000000003a27e10] [c000000000189c98] work_for_cpu_fn+0x38/0x60
[c000000003a27e40] [c00000000018d1d0] process_scheduled_works+0x230/0x4f0
[c000000003a27f10] [c00000000018ff14] worker_thread+0x1e4/0x500
[c000000003a27f90] [c00000000019cdb8] kthread+0x138/0x140
[c000000003a27fe0] [c00000000000df98] start_kernel_thread+0x14/0x18
Code: 481b004d 60000000 e89e0028 3c62ffe0 3863dd20 481b0039 60000000 e89e0038 3c62ffe0 3863dd38 481b0025 60000000 <0fe00000> 4bffff20 60000000 3c4c0142
---[ end trace 0000000000000000 ]---
iommu_free: invalid entry
entry = 0x8000000203d0
dma_addr = 0x8000000203d0000
Table = 0xc00000000c5dc800
bus# = 0x1
size = 0x20000
startOff = 0x800000000000
index = 0x70200016
Fixes:
|
||
Jason Gunthorpe
|
2ad56efa80 |
powerpc/iommu: Setup a default domain and remove set_platform_dma_ops
POWER is using the set_platform_dma_ops() callback to hook up its private dma_ops, but this is buired under some indirection and is weirdly happening for a BLOCKED domain as well. For better documentation create a PLATFORM domain to manage the dma_ops, since that is what it is for, and make the BLOCKED domain an alias for it. BLOCKED is required for VFIO. Also removes the leaky allocation of the BLOCKED domain by using a global static. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/3-v8-81230027b2fa+9d-iommu_all_defdom_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de> |
||
Russell Currey
|
c37b6908f7 |
powerpc/iommu: Fix notifiers being shared by PCI and VIO buses
fail_iommu_setup() registers the fail_iommu_bus_notifier struct to both
PCI and VIO buses. struct notifier_block is a linked list node, so this
causes any notifiers later registered to either bus type to also be
registered to the other since they share the same node.
This causes issues in (at least) the vgaarb code, which registers a
notifier for PCI buses. pci_notify() ends up being called on a vio
device, converted with to_pci_dev() even though it's not a PCI device,
and finally makes a bad access in vga_arbiter_add_pci_device() as
discovered with KASAN:
BUG: KASAN: slab-out-of-bounds in vga_arbiter_add_pci_device+0x60/0xe00
Read of size 4 at addr c000000264c26fdc by task swapper/0/1
Call Trace:
dump_stack_lvl+0x1bc/0x2b8 (unreliable)
print_report+0x3f4/0xc60
kasan_report+0x244/0x698
__asan_load4+0xe8/0x250
vga_arbiter_add_pci_device+0x60/0xe00
pci_notify+0x88/0x444
notifier_call_chain+0x104/0x320
blocking_notifier_call_chain+0xa0/0x140
device_add+0xac8/0x1d30
device_register+0x58/0x80
vio_register_device_node+0x9ac/0xce0
vio_bus_scan_register_devices+0xc4/0x13c
__machine_initcall_pseries_vio_device_init+0x94/0xf0
do_one_initcall+0x12c/0xaa8
kernel_init_freeable+0xa48/0xba8
kernel_init+0x64/0x400
ret_from_kernel_thread+0x5c/0x64
Fix this by creating separate notifier_block structs for each bus type.
Fixes:
|
||
Timothy Pearson
|
bfd8d98921 |
powerpc/iommu: Only build sPAPR access functions on pSeries
and PowerNV A build failure with CONFIG_HAVE_PCI=y set without PSERIES or POWERNV set was caught by the random configuration checker. Guard the sPAPR specific IOMMU functions on CONFIG_PPC_PSERIES || CONFIG_PPC_POWERNV. Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/2015925968.3546872.1685990936823.JavaMail.zimbra@raptorengineeringinc.com |
||
Gaurav Batra
|
096339ab84 |
powerpc/iommu: DMA address offset is incorrectly calculated with 2MB TCEs
When DMA window is backed by 2MB TCEs, the DMA address for the mapped
page should be the offset of the page relative to the 2MB TCE. The code
was incorrectly setting the DMA address to the beginning of the TCE
range.
Mellanox driver is reporting timeout trying to ENABLE_HCA for an SR-IOV
ethernet port, when DMA window is backed by 2MB TCEs.
Fixes:
|
||
Jason Gunthorpe
|
ad593827db |
powerpc/iommu: Remove iommu_del_device()
Now that power calls iommu_device_register() and populates its groups
using iommu_ops->device_group it should not be calling
iommu_group_remove_device().
The core code owns the groups and all the other related iommu data, it
will clean it up automatically.
Remove the bus notifiers and explicit calls to
iommu_group_remove_device().
Fixes:
|
||
Alexey Kardashevskiy
|
a940904443 |
powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
Up until now PPC64 managed to avoid using iommu_ops. The VFIO driver uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in the Type1 VFIO driver. Recent development added 2 uses of iommu_ops to the generic VFIO which broke POWER: - a coherency capability check; - blocking IOMMU domain - iommu_group_dma_owner_claimed()/... This adds a simple iommu_ops which reports support for cache coherency and provides a basic support for blocking domains. No other domain types are implemented so the default domain is NULL. Since now iommu_ops controls the group ownership, this takes it out of VFIO. This adds an IOMMU device into a pci_controller (=PHB) and registers it in the IOMMU subsystem, iommu_ops is registered at this point. This setup is done in postcore_initcall_sync. This replaces iommu_group_add_device() with iommu_probe_device() as the former misses necessary steps in connecting PCI devices to IOMMU devices. This adds a comment about why explicit iommu_probe_device() is still needed. The previous discussion is here: https://lore.kernel.org/r/20220707135552.3688927-1-aik@ozlabs.ru/ https://lore.kernel.org/r/20220701061751.1955857-1-aik@ozlabs.ru/ Fixes: |
||
Alexey Kardashevskiy
|
9d67c94335 |
powerpc/iommu: Add "borrowing" iommu_table_group_ops
PPC64 IOMMU API defines iommu_table_group_ops which handles DMA windows for PEs: control the ownership, create/set/unset a table the hardware for dynamic DMA windows (DDW). VFIO uses the API to implement support on POWER. So far only PowerNV IODA2 (POWER8 and newer machines) implemented this and other cases (POWER7 or nested KVM) did not and instead reused existing iommu_table structs. This means 1) no DDW 2) ownership transfer is done directly in the VFIO SPAPR TCE driver. Soon POWER is going to get its own iommu_ops and ownership control is going to move there. This implements spapr_tce_table_group_ops which borrows iommu_table tables. The upside is that VFIO needs to know less about POWER. The new ops returns the existing table from create_table() and only checks if the same window is already set. This is only going to work if the default DMA window starts table_group.tce32_start and as big as pe->table_group.tce32_size (not the case for IODA2+ PowerNV). This changes iommu_table_group_ops::take_ownership() to return an error if borrowing a table failed. This should not cause any visible change in behavior for PowerNV. pSeries was not that well tested/supported anyway. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> [mpe: Fix CONFIG_IOMMU_API=n build (skiroot_defconfig), & formatting] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/525438831.16998517.1678123820075.JavaMail.zimbra@raptorengineeringinc.com |
||
Greg Kroah-Hartman
|
b505063910 |
powerpc/iommu: fix memory leak with using debugfs_lookup()
When calling debugfs_lookup() the result must have dput() called on it, otherwise the memory will leak over time. To make things simpler, just call debugfs_lookup_and_remove() instead which handles all of the logic at once. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20230202141919.2298821-1-gregkh@linuxfoundation.org |
||
Alexey Kardashevskiy
|
d80f6de9d6 |
powerpc/iommu: Fix iommu_table_in_use for a small default DMA window case
The existing iommu_table_in_use() helper checks if the kernel is using
any of TCEs. There are some reserved TCEs:
1) the very first one if DMA window starts from 0 to avoid having a zero
but still valid DMA handle;
2) it_reserved_start..it_reserved_end to exclude MMIO32 window in case
the default window spans across that - this is the default for the first
DMA window on PowerNV.
When 1) is the case and 2) is not the helper does not skip 1) and returns
wrong status.
This only seems occurring when passing through a PCI device to a nested
guest (not something we support really well) so it has not been seen
before.
This fixes the bug by adding a special case for no MMIO32 reservation.
Fixes:
|
||
Michael Ellerman
|
b104e41cda |
Merge branch 'topic/ppc-kvm' into next
Merge our KVM topic branch. |
||
Alexey Kardashevskiy
|
cad32d9d42 |
KVM: PPC: Book3s: Retire H_PUT_TCE/etc real mode handlers
LoPAPR defines guest visible IOMMU with hypercalls to use it - H_PUT_TCE/etc. Implemented first on POWER7 where hypercalls would trap in the KVM in the real mode (with MMU off). The problem with the real mode is some memory is not available and some API usage crashed the host but enabling MMU was an expensive operation. The problems with the real mode handlers are: 1. Occasionally these cannot complete the request so the code is copied+modified to work in the virtual mode, very little is shared; 2. The real mode handlers have to be linked into vmlinux to work; 3. An exception in real mode immediately reboots the machine. If the small DMA window is used, the real mode handlers bring better performance. However since POWER8, there has always been a bigger DMA window which VMs use to map the entire VM memory to avoid calling H_PUT_TCE. Such 1:1 mapping happens once and uses H_PUT_TCE_INDIRECT (a bulk version of H_PUT_TCE) which virtual mode handler is even closer to its real mode version. On POWER9 hypercalls trap straight to the virtual mode so the real mode handlers never execute on POWER9 and later CPUs. So with the current use of the DMA windows and MMU improvements in POWER9 and later, there is no point in duplicating the code. The 32bit passed through devices may slow down but we do not have many of these in practice. For example, with this applied, a 1Gbit ethernet adapter still demostrates above 800Mbit/s of actual throughput. This removes the real mode handlers from KVM and related code from the powernv platform. This updates the list of implemented hcalls in KVM-HV as the realmode handlers are removed. This changes ABI - kvmppc_h_get_tce() moves to the KVM module and kvmppc_find_table() is static now. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220506053755.3820702-1-aik@ozlabs.ru |
||
Christophe Leroy
|
86c38fec69 |
powerpc: Remove asm/prom.h from all files that don't need it
Several files include asm/prom.h for no reason. Clean it up. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> [mpe: Drop change to prom_parse.c as reported by lkp@intel.com] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/7c9b8fda63dcf63e1b28f43e7ebdb95182cbc286.1646767214.git.christophe.leroy@csgroup.eu |
||
Linus Torvalds
|
7cca308cfd |
powerpc updates for 5.15
- Convert pseries & powernv to use MSI IRQ domains. - Rework the pseries CPU numbering so that CPUs that are removed, and later re-added, are given a CPU number on the same node as previously, when possible. - Add support for a new more flexible device-tree format for specifying NUMA distances. - Convert powerpc to GENERIC_PTDUMP. - Retire sbc8548 and sbc8641d board support. - Various other small features and fixes. Thanks to: Alexey Kardashevskiy, Aneesh Kumar K.V, Anton Blanchard, Cédric Le Goater, Christophe Leroy, Emmanuel Gil Peyrot, Fabiano Rosas, Fangrui Song, Finn Thain, Gautham R. Shenoy, Hari Bathini, Joel Stanley, Jordan Niethe, Kajol Jain, Laurent Dufour, Leonardo Bras, Lukas Bulwahn, Marc Zyngier, Masahiro Yamada, Michal Suchanek, Nathan Chancellor, Nicholas Piggin, Parth Shah, Paul Gortmaker, Pratik R. Sampat, Randy Dunlap, Sebastian Andrzej Siewior, Srikar Dronamraju, Wan Jiabing, Xiongwei Song, Zheng Yongjun. -----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmEyHTYTHG1wZUBlbGxl cm1hbi5pZC5hdQAKCRBR6+o8yOGlgDo3D/9aXMVP2wsEMNB0XhTiJ1UUdi311Uq9 PvkAaGZH14ZqZLVigeiD3gt6YzTH0cEuGj6qgwsJrPDjF8FESnMbBsprMLr5/qE1 itWRGMAMCFaeTcB9ogYVJkzwg6RN2ZgIqoq4NVswNSXoAQGWb+1bvXq3RnXXNuGR TQmLL02poNC6nX0YbRaQoT1Xx4nfUTiKHhU+Aok9uOCMJIyYZVATR6Qafb7/j7tO UvjwOHztbu84lcJOGmSnw4LcmwNORLuP9IwR0r+O1M3ijEZqDo9TPkvtSz8HZwjU mxdJwhrUmN0euMcghuiFxW+1XG2eM49ugsdJugiezG2RaIijbIp0nAIvdeaKAgT1 OSSwvWCQ0fkTPyLXE+O6tVqMhlUMdqQlRcyNwmN9svIip9VnwGNq3vA4ePlJm6Fi i0i/tLqVNlJwFokZ7blW5g8SRgGRuFfXd5XUYLFvy5Teez+/7b1mW95gPQZSJ8kV Tbx2e0nHAPX4hCAxJ1AB3/zTlnjY+4+WJ9bD5XdgXkeVE8PPh1BEkulhMi1R1OMj 57D1W6OgsBu/Pze78wjAvwO8+NAb1T/2mv2Bd/LY6Q+7hNDqOOhuajyBTxbH41FG sqx5bKjKOwgTybfV9A0Eo0e4FQBX07yXltBFHaPlyA4sOsIhM59+PxNrEwN1eZrQ LVVsdBXg8pHxrw== =EbN0 -----END PGP SIGNATURE----- Merge tag 'powerpc-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Convert pseries & powernv to use MSI IRQ domains. - Rework the pseries CPU numbering so that CPUs that are removed, and later re-added, are given a CPU number on the same node as previously, when possible. - Add support for a new more flexible device-tree format for specifying NUMA distances. - Convert powerpc to GENERIC_PTDUMP. - Retire sbc8548 and sbc8641d board support. - Various other small features and fixes. Thanks to Alexey Kardashevskiy, Aneesh Kumar K.V, Anton Blanchard, Cédric Le Goater, Christophe Leroy, Emmanuel Gil Peyrot, Fabiano Rosas, Fangrui Song, Finn Thain, Gautham R. Shenoy, Hari Bathini, Joel Stanley, Jordan Niethe, Kajol Jain, Laurent Dufour, Leonardo Bras, Lukas Bulwahn, Marc Zyngier, Masahiro Yamada, Michal Suchanek, Nathan Chancellor, Nicholas Piggin, Parth Shah, Paul Gortmaker, Pratik R. Sampat, Randy Dunlap, Sebastian Andrzej Siewior, Srikar Dronamraju, Wan Jiabing, Xiongwei Song, and Zheng Yongjun. * tag 'powerpc-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (154 commits) powerpc/bug: Cast to unsigned long before passing to inline asm powerpc/ptdump: Fix generic ptdump for 64-bit KVM: PPC: Fix clearing never mapped TCEs in realmode powerpc/pseries/iommu: Rename "direct window" to "dma window" powerpc/pseries/iommu: Make use of DDW for indirect mapping powerpc/pseries/iommu: Find existing DDW with given property name powerpc/pseries/iommu: Update remove_dma_window() to accept property name powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helper powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw() powerpc/pseries/iommu: Allow DDW windows starting at 0x00 powerpc/pseries/iommu: Add ddw_list_new_entry() helper powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper powerpc/kernel/iommu: Add new iommu_table_in_use() helper powerpc/pseries/iommu: Replace hard-coded page shift powerpc/numa: Update cpu_cpu_map on CPU online/offline powerpc/numa: Print debug statements only when required powerpc/numa: convert printk to pr_xxx powerpc/numa: Drop dbg in favour of pr_debug powerpc/smp: Enable CACHE domain for shared processor powerpc/smp: Update cpu_core_map on all PowerPc systems ... |
||
Leonardo Bras
|
3c33066a21 |
powerpc/kernel/iommu: Add new iommu_table_in_use() helper
Having a function to check if the iommu table has any allocation helps deciding if a tbl can be reset for using a new DMA window. It should be enough to replace all instances of !bitmap_empty(tbl...). iommu_table_in_use() skips reserved memory, so we don't need to worry about releasing it before testing. This causes iommu_table_release_pages() to become unnecessary, given it is only used to remove reserved memory for testing. Also, only allow storing reserved memory values in tbl if they are valid in the table, so there is no need to check it in the new helper. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210817063929.38701-3-leobras.c@gmail.com |
||
Logan Gunthorpe
|
eb86ef3b2d |
powerpc/iommu: don't set failed sg dma_address to DMA_MAPPING_ERROR
Setting the ->dma_address to DMA_MAPPING_ERROR is not part of the ->map_sg calling convention, so remove it. Link: https://lore.kernel.org/linux-mips/20210716063241.GC13345@lst.de/ Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Geoff Levand <geoff@infradead.org> Signed-off-by: Christoph Hellwig <hch@lst.de> |
||
Martin Oliveira
|
c4e0e892ab |
powerpc/iommu: return error code from .map_sg() ops
The .map_sg() op now expects an error code instead of zero on failure. Propagate the error up if vio_dma_iommu_map_sg() fails. ppc_iommu_map_sg() may fail either because of iommu_range_alloc() or because of tbl->it_ops->set(). The former only supports returning an error with DMA_MAPPING_ERROR and an examination of the latter indicates that it may return arch-specific errors (for example, tce_buildmulti_pSeriesLP()). Hence, coalesce all of those errors into -EIO, per the documentation on dma_map_sgtable(). Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Geoff Levand <geoff@infradead.org> Signed-off-by: Christoph Hellwig <hch@lst.de> |
||
Frederic Barrat
|
59cc84c802 |
Revert "powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE() to save TCEs"
This reverts commit |
||
Leonardo Bras
|
fc5590fd56 |
powerpc/kernel/iommu: Use largepool as a last resort when !largealloc
As of today, doing iommu_range_alloc() only for !largealloc (npages <= 15) will only be able to use 3/4 of the available pages, given pages on largepool not being available for !largealloc. This could mean some drivers not being able to fully use all the available pages for the DMA window. Add pages on largepool as a last resort for !largealloc, making all pages of the DMA window available. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210318174414.684630-2-leobras.c@gmail.com |
||
Leonardo Bras
|
3c0468d445 |
powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE() to save TCEs
Currently both iommu_alloc_coherent() and iommu_free_coherent() align the desired allocation size to PAGE_SIZE, and gets system pages and IOMMU mappings (TCEs) for that value. When IOMMU_PAGE_SIZE < PAGE_SIZE, this behavior may cause unnecessary TCEs to be created for mapping the whole system page. Example: - PAGE_SIZE = 64k, IOMMU_PAGE_SIZE() = 4k - iommu_alloc_coherent() is called for 128 bytes - 1 system page (64k) is allocated - 16 IOMMU pages (16 x 4k) are allocated (16 TCEs used) It would be enough to use a single TCE for this, so 15 TCEs are wasted in the process. Update iommu_*_coherent() to make sure the size alignment happens only for IOMMU_PAGE_SIZE() before calling iommu_alloc() and iommu_free(). Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift) with IOMMU_PAGE_ALIGN(n, tbl), which is easier to read and does the same. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210318174414.684630-1-leobras.c@gmail.com |
||
Alexey Kardashevskiy
|
cc7130bf11 |
powerpc/iommu: Annotate nested lock for lockdep
The IOMMU table is divided into pools for concurrent mappings and each pool has a separate spinlock. When taking the ownership of an IOMMU group to pass through a device to a VM, we lock these spinlocks which triggers a false negative warning in lockdep (below). This fixes it by annotating the large pool's spinlock as a nest lock which makes lockdep not complaining when locking nested locks if the nest lock is locked already. === WARNING: possible recursive locking detected 5.11.0-le_syzkaller_a+fstn1 #100 Not tainted -------------------------------------------- qemu-system-ppc/4129 is trying to acquire lock: c0000000119bddb0 (&(p->lock)/1){....}-{2:2}, at: iommu_take_ownership+0xac/0x1e0 but task is already holding lock: c0000000119bdd30 (&(p->lock)/1){....}-{2:2}, at: iommu_take_ownership+0xac/0x1e0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(p->lock)/1); lock(&(p->lock)/1); === Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210301063653.51003-1-aik@ozlabs.ru |
||
Alexey Kardashevskiy
|
4be518d838 |
powerpc/iommu: Do not immediately panic when failed IOMMU table allocation
Most platforms allocate IOMMU table structures (specifically it_map) at the boot time and when this fails - it is a valid reason for panic(). However the powernv platform allocates it_map after a device is returned to the host OS after being passed through and this happens long after the host OS booted. It is quite possible to trigger the it_map allocation panic() and kill the host even though it is not necessary - the host OS can still use the DMA bypass mode (requires a tiny fraction of it_map's memory) and even if that fails, the host OS is runnnable as it was without the device for which allocating it_map causes the panic. Instead of immediately crashing in a powernv/ioda2 system, this prints an error and continues. All other platforms still call panic(). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Leonardo Bras <leobras.c@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210216033307.69863-3-aik@ozlabs.ru |
||
Alexey Kardashevskiy
|
7f1fa82d79 |
powerpc/iommu: Allocate it_map by vmalloc
The IOMMU table uses the it_map bitmap to keep track of allocated DMA pages. This has always been a contiguous array allocated at either the boot time or when a passed through device is returned to the host OS. The it_map memory is allocated by alloc_pages() which allocates contiguous physical memory. Such allocation method occasionally creates a problem when there is no big chunk of memory available (no free memory or too fragmented). On powernv/ioda2 the default DMA window requires 16MB for it_map. This replaces alloc_pages_node() with vzalloc_node() which allocates contiguous block but in virtual memory. This should reduce changes of failure but should not cause other behavioral changes as it_map is only used by the kernel's DMA hooks/api when MMU is on. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210216033307.69863-2-aik@ozlabs.ru |
||
kernel test robot
|
bbbe563f84 |
powerpc/iommu/debug: fix ifnullfree.cocci warnings
arch/powerpc/kernel/iommu.c:76:2-16: WARNING: NULL check before some freeing functions is not needed.
NULL check before some freeing functions is not needed.
Based on checkpatch warning
"kfree(NULL) is safe this check is probably not required"
and kfreeaddr.cocci by Julia Lawall.
Generated by: scripts/coccinelle/free/ifnullfree.cocci
Fixes:
|
||
Alexey Kardashevskiy
|
691602aab9 |
powerpc/iommu/debug: Add debugfs entries for IOMMU tables
This adds a folder per LIOBN under /sys/kernel/debug/iommu with IOMMU table parameters. This is enabled by CONFIG_IOMMU_DEBUGFS. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210113102014.124452-1-aik@ozlabs.ru |
||
Nicolin Chen
|
1e9d90dbed |
dma-mapping: introduce dma_get_seg_boundary_nr_pages()
We found that callers of dma_get_seg_boundary mostly do an ALIGN with page mask and then do a page shift to get number of pages: ALIGN(boundary + 1, 1 << shift) >> shift However, the boundary might be as large as ULONG_MAX, which means that a device has no specific boundary limit. So either "+ 1" or passing it to ALIGN() would potentially overflow. According to kernel defines: #define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask)) #define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1) We can simplify the logic here into a helper function doing: ALIGN(boundary + 1, 1 << shift) >> shift = ALIGN_MASK(b + 1, (1 << s) - 1) >> s = {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s = [b + 1 + (1 << s) - 1] >> s = [b + (1 << s)] >> s = (b >> s) + 1 This patch introduces and applies dma_get_seg_boundary_nr_pages() as an overflow-free helper for the dma_get_seg_boundary() callers to get numbers of pages. It also takes care of the NULL dev case for non-DMA API callers. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com> Acked-by: Niklas Schnelle <schnelle@linux.ibm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Signed-off-by: Christoph Hellwig <hch@lst.de> |
||
Michael Ellerman
|
9044adca78 |
Merge branch 'topic/ppc-kvm' into next
Merge our ppc-kvm topic branch to bring in the Ultravisor support patches. |
||
Alexey Kardashevskiy
|
a102f139aa |
powerpc/powernv/ioda: Remove obsolete iommu_table_ops::exchange callbacks
As now we have xchg_no_kill/tce_kill, these are not used anymore so remove them. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190829085252.72370-6-aik@ozlabs.ru |
||
Alexey Kardashevskiy
|
35872480da |
powerpc/powernv/ioda: Split out TCE invalidation from TCE updates
At the moment updates in a TCE table are made by iommu_table_ops::exchange which update one TCE and invalidates an entry in the PHB/NPU TCE cache via set of registers called "TCE Kill" (hence the naming). Writing a TCE is a simple xchg() but invalidating the TCE cache is a relatively expensive OPAL call. Mapping a 100GB guest with PCI+NPU passed through devices takes about 20s. Thankfully we can do better. Since such big mappings happen at the boot time and when memory is plugged/onlined (i.e. not often), these requests come in 512 pages so we call call OPAL 512 times less which brings 20s from the above to less than 10s. Also, since TCE caches can be flushed entirely, calling OPAL for 512 TCEs helps skiboot [1] to decide whether to flush the entire cache or not. This implements 2 new iommu_table_ops callbacks: - xchg_no_kill() to update a single TCE with no TCE invalidation; - tce_kill() to invalidate multiple TCEs. This uses the same xchg_no_kill() callback for IODA1/2. This implements 2 new wrappers on top of the new callbacks similar to the existing iommu_tce_xchg(). This does not use the new callbacks yet, the next patches will; so this should not cause any behavioral change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190829085252.72370-2-aik@ozlabs.ru |
||
Alexey Kardashevskiy
|
201ed7f327 |
powerpc/powernv/ioda2: Create bigger default window with 64k IOMMU pages
At the moment we create a small window only for 32bit devices, the window maps 0..2GB of the PCI space only. For other devices we either use a sketchy bypass or hardware bypass but the former can only work if the amount of RAM is no bigger than the device's DMA mask and the latter requires devices to support at least 59bit DMA. This extends the default DMA window to the maximum size possible to allow a wider DMA mask than just 32bit. The default window size is now limited by the the iommu_table::it_map allocation bitmap which is a contiguous array, 1 bit per an IOMMU page. This increases the default IOMMU page size from hard coded 4K to the system page size to allow wider DMA masks. This increases the level number to not exceed the max order allocation limit per TCE level. By the same time, this keeps minimal levels number as 2 in order to save memory. As the extended window now overlaps the 32bit MMIO region, this adds an area reservation to iommu_init_table(). After this change the default window size is 0x80000000000==1<<43 so devices limited to DMA mask smaller than the amount of system RAM can still use more than just 2GB of memory for DMA. This is an optimization and not a bug fix for DMA API usage. With the on-demand allocation of indirect TCE table levels enabled and 2 levels, the first TCE level size is just 1<<ceil((log2(0x7ffffffffff+1)-16)/2)=16384 TCEs or 2 system pages. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190718051139.74787-5-aik@ozlabs.ru |
||
Thomas Gleixner
|
1a59d1b8e0 |
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 156
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program if not write to the free software foundation inc 59 temple place suite 330 boston ma 02111 1307 usa extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 1334 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Richard Fontana <rfontana@redhat.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070033.113240726@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Linus Torvalds
|
8e143b90e4 |
IOMMU Updates for Linux v4.21
Including (in no particular order): - Page table code for AMD IOMMU now supports large pages where smaller page-sizes were mapped before. VFIO had to work around that in the past and I included a patch to remove it (acked by Alex Williamson) - Patches to unmodularize a couple of IOMMU drivers that would never work as modules anyway. - Work to unify the the iommu-related pointers in 'struct device' into one pointer. This work is not finished yet, but will probably be in the next cycle. - NUMA aware allocation in iommu-dma code - Support for r8a774a1 and r8a774c0 in the Renesas IOMMU driver - Scalable mode support for the Intel VT-d driver - PM runtime improvements for the ARM-SMMU driver - Support for the QCOM-SMMUv2 IOMMU hardware from Qualcom - Various smaller fixes and improvements -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABAgAGBQJcKkEoAAoJECvwRC2XARrjCCoQAJxsgaAF5Z0s7z8j2A9SkaGp SIMnUAI5mDOdyhTOAI+eehpRzg5UVYt/JjFYnHz8HWqbSc8YOvDvHafmhMFIwYvO hq5knbs6ns2jJNFO+M4dioDq+3THdqkGIF5xoHdGTP7cn9+XyQ8lAoHo0RuL122U PJGqX7Cp4XnFP4HMb3uQYhVeBV7mU+XqAdB+4aDnQkzI5LkQCRr74GcqOm+Rlnyc cmQWc2arUMjgc1TJIrex8dx9dT6lq8kOmhyEg/IjHeGaZyJ3HqA+30XDDLEExN0G MeVawuxJz40HgXlkXr+iZTQtIFYkXdKvJH6rptMbOfbDeDz+YZ01TbtAMMH9o4jX yxjjMjdcWTsWYQ/MHHdsoMP34cajCi/EYPMNksbycw+E3Y+X/bSReCoWC0HUK8/+ Z4TpZ9mZVygtJR+QNZ+pE9oiJpb4sroM10zTnbMoVHNnvfsO01FYk7FMPkolSKLw zB4MDswQYgchoFR9Z4ZB4PycYTzeafLKYgDPDoD1vIJgDavuidwvDWDRTDc+aMWM siIIewq19To9jDJkVjX4dsT/p99KVKgAR/Ps6jjWkAroha7g6GcmlYZHIJnyop04 jiaSXUsk8aRucP/CRz5xdMmaGoN7BsNmpUjcrquc6Povk/6gvXvpY04oCs1+gNMX ipL9E3GTFCVBubRFrksv =DT9A -----END PGP SIGNATURE----- Merge tag 'iommu-updates-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull IOMMU updates from Joerg Roedel: - Page table code for AMD IOMMU now supports large pages where smaller page-sizes were mapped before. VFIO had to work around that in the past and I included a patch to remove it (acked by Alex Williamson) - Patches to unmodularize a couple of IOMMU drivers that would never work as modules anyway. - Work to unify the the iommu-related pointers in 'struct device' into one pointer. This work is not finished yet, but will probably be in the next cycle. - NUMA aware allocation in iommu-dma code - Support for r8a774a1 and r8a774c0 in the Renesas IOMMU driver - Scalable mode support for the Intel VT-d driver - PM runtime improvements for the ARM-SMMU driver - Support for the QCOM-SMMUv2 IOMMU hardware from Qualcom - Various smaller fixes and improvements * tag 'iommu-updates-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (78 commits) iommu: Check for iommu_ops == NULL in iommu_probe_device() ACPI/IORT: Don't call iommu_ops->add_device directly iommu/of: Don't call iommu_ops->add_device directly iommu: Consolitate ->add/remove_device() calls iommu/sysfs: Rename iommu_release_device() dmaengine: sh: rcar-dmac: Use device_iommu_mapped() xhci: Use device_iommu_mapped() powerpc/iommu: Use device_iommu_mapped() ACPI/IORT: Use device_iommu_mapped() iommu/of: Use device_iommu_mapped() driver core: Introduce device_iommu_mapped() function iommu/tegra: Use helper functions to access dev->iommu_fwspec iommu/qcom: Use helper functions to access dev->iommu_fwspec iommu/of: Use helper functions to access dev->iommu_fwspec iommu/mediatek: Use helper functions to access dev->iommu_fwspec iommu/ipmmu-vmsa: Use helper functions to access dev->iommu_fwspec iommu/dma: Use helper functions to access dev->iommu_fwspec iommu/arm-smmu: Use helper functions to access dev->iommu_fwspec ACPI/IORT: Use helper functions to access dev->iommu_fwspec iommu: Introduce wrappers around dev->iommu_fwspec ... |
||
Linus Torvalds
|
af7ddd8a62 |
DMA mapping updates for Linux 4.21
A huge update this time, but a lot of that is just consolidating or removing code: - provide a common DMA_MAPPING_ERROR definition and avoid indirect calls for dma_map_* error checking - use direct calls for the DMA direct mapping case, avoiding huge retpoline overhead for high performance workloads - merge the swiotlb dma_map_ops into dma-direct - provide a generic remapping DMA consistent allocator for architectures that have devices that perform DMA that is not cache coherent. Based on the existing arm64 implementation and also used for csky now. - improve the dma-debug infrastructure, including dynamic allocation of entries (Robin Murphy) - default to providing chaining scatterlist everywhere, with opt-outs for the few architectures (alpha, parisc, most arm32 variants) that can't cope with it - misc sparc32 dma-related cleanups - remove the dma_mark_clean arch hook used by swiotlb on ia64 and replace it with the generic noncoherent infrastructure - fix the return type of dma_set_max_seg_size (Niklas Söderlund) - move the dummy dma ops for not DMA capable devices from arm64 to common code (Robin Murphy) - ensure dma_alloc_coherent returns zeroed memory to avoid kernel data leaks through userspace. We already did this for most common architectures, but this ensures we do it everywhere. dma_zalloc_coherent has been deprecated and can hopefully be removed after -rc1 with a coccinelle script. -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlwctQgLHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYMxgQ//dBpAfS4/J76CdAbYry2zqgcOUU9hIrD6NHiEMWov ltJxyvEl3LsUmIdEj3aCrYL9jZN0qsnCzn5BVj2c3jDIVgD64fAr7HDf/PbEEfKb j6/GgEnVLPZV+sQMvhNA5jOzHrkseaqPa4/pNLFZ/l8jnuZ2d+btusDWJpMoVDer TXVwtIfgeIu0gTygYOShLYXd5qptWKWsZEpbTZOO2sE6+x+ZJX7yQYUxYDTlcOIj JWVO2l5QNHPc5T9o2at+6L5aNUvnZOxT79sWgyZLn0Kc+FagKAVwfLqUEl0v7foG 8k/xca5/8p3afB1DfrIrtplJqis7cVgdyGxriwuuoO8X4F0nPyWwpGmxsBhrWwwl xTqC4UorEJ7QwoP6Azopk/vYI2QXIUBLjuCJCuFXZj9+2BGf4IfvBY1S2cLM9qLs HMcxQonuXJii044KEFS96ePEuiT+igVINweIFBKWcgNCEG0UQtyL6RQ1U5297ipF JiWZAqD+p9X52UdKS+oKfAiZEekMXn6Xyo97+YCiNpfOo0GP5eEcwhL+JpY4AiRq apPXtsRy2o1s8yfjdraUIM2Mc2n62vFKb35oUbGCd/QO9piPrFQHl6T0HHcHk4YR XrUXcHieFZBCYqh7ZVa4RL8Msq1wvGuTL4Dxl43mXdsMoUFRR6eSNWLoAV4IpOLZ WgA= =in72 -----END PGP SIGNATURE----- Merge tag 'dma-mapping-4.21' of git://git.infradead.org/users/hch/dma-mapping Pull DMA mapping updates from Christoph Hellwig: "A huge update this time, but a lot of that is just consolidating or removing code: - provide a common DMA_MAPPING_ERROR definition and avoid indirect calls for dma_map_* error checking - use direct calls for the DMA direct mapping case, avoiding huge retpoline overhead for high performance workloads - merge the swiotlb dma_map_ops into dma-direct - provide a generic remapping DMA consistent allocator for architectures that have devices that perform DMA that is not cache coherent. Based on the existing arm64 implementation and also used for csky now. - improve the dma-debug infrastructure, including dynamic allocation of entries (Robin Murphy) - default to providing chaining scatterlist everywhere, with opt-outs for the few architectures (alpha, parisc, most arm32 variants) that can't cope with it - misc sparc32 dma-related cleanups - remove the dma_mark_clean arch hook used by swiotlb on ia64 and replace it with the generic noncoherent infrastructure - fix the return type of dma_set_max_seg_size (Niklas Söderlund) - move the dummy dma ops for not DMA capable devices from arm64 to common code (Robin Murphy) - ensure dma_alloc_coherent returns zeroed memory to avoid kernel data leaks through userspace. We already did this for most common architectures, but this ensures we do it everywhere. dma_zalloc_coherent has been deprecated and can hopefully be removed after -rc1 with a coccinelle script" * tag 'dma-mapping-4.21' of git://git.infradead.org/users/hch/dma-mapping: (73 commits) dma-mapping: fix inverted logic in dma_supported dma-mapping: deprecate dma_zalloc_coherent dma-mapping: zero memory returned from dma_alloc_* sparc/iommu: fix ->map_sg return value sparc/io-unit: fix ->map_sg return value arm64: default to the direct mapping in get_arch_dma_ops PCI: Remove unused attr variable in pci_dma_configure ia64: only select ARCH_HAS_DMA_COHERENT_TO_PFN if swiotlb is enabled dma-mapping: bypass indirect calls for dma-direct vmd: use the proper dma_* APIs instead of direct methods calls dma-direct: merge swiotlb_dma_ops into the dma_direct code dma-direct: use dma_direct_map_page to implement dma_direct_map_sg dma-direct: improve addressability error reporting swiotlb: remove dma_mark_clean swiotlb: remove SWIOTLB_MAP_ERROR ACPI / scan: Refactor _CCA enforcement dma-mapping: factor out dummy DMA ops dma-mapping: always build the direct mapping code dma-mapping: move dma_cache_sync out of line dma-mapping: move various slow path functions out of line ... |
||
Alexey Kardashevskiy
|
c4e9d3c1e6 |
powerpc/powernv/pseries: Rework device adding to IOMMU groups
The powernv platform registers IOMMU groups and adds devices to them from the pci_controller_ops::setup_bridge() hook except one case when virtual functions (SRIOV VFs) are added from a bus notifier. The pseries platform registers IOMMU groups from the pci_controller_ops::dma_bus_setup() hook and adds devices from the pci_controller_ops::dma_dev_setup() hook. The very same bus notifier used for powernv does not add devices for pseries though as __of_scan_bus() adds devices first, then it does the bus/dev DMA setup. Both platforms use iommu_add_device() which takes a device and expects it to have a valid IOMMU table struct with an iommu_table_group pointer which in turn points the iommu_group struct (which represents an IOMMU group). Although the helper seems easy to use, it relies on some pre-existing device configuration and associated data structures which it does not really need. This simplifies iommu_add_device() to take the table_group pointer directly. Pseries already has a table_group pointer handy and the bus notified is not used anyway. For powernv, this copies the existing bus notifier, makes it work for powernv only which means an easy way of getting to the table_group pointer. This was tested on VFs but should also support physical PCI hotplug. Since iommu_add_device() receives the table_group pointer directly, pseries does not do TCE cache invalidation (the hypervisor does) nor allow multiple groups per a VFIO container (in other words sharing an IOMMU table between partitionable endpoints), this removes iommu_table_group_link from pseries. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> |