Documentation: add a page on amdgpu debugging

Covers GPU page fault debugging and adds a reference to umr. v2: update client ids to include SQC/G v3: Remove duplicate text v4: add umr documentation link, fix typo Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-11-21 19:41:42 +00:00 · 2024-03-13 13:19:54 -04:00 · 2024-03-13 13:19:54 -04:00 · 23808afc61
commit 23808afc61
parent e58acb7613
2 changed files with 81 additions and 0 deletions
--- a/Documentation/gpu/amdgpu/debugging.rst
+++ b/Documentation/gpu/amdgpu/debugging.rst
@ -0,0 +1,80 @@
+===============
+ GPU Debugging
+===============
+
+GPUVM Debugging
+===============
+
+To aid in debugging GPU virtual memory related problems, the driver supports a
+number of options module parameters:
+
+`vm_fault_stop` - If non-0, halt the GPU memory controller on a GPU page fault.
+
+`vm_update_mode` - If non-0, use the CPU to update GPU page tables rather than
+the GPU.
+
+
+Decoding a GPUVM Page Fault
+===========================
+
+If you see a GPU page fault in the kernel log, you can decode it to figure
+out what is going wrong in your application.  A page fault in your kernel
+log may look something like this:
+
+::
+
+ [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32777, for process glxinfo pid 2424 thread glxinfo:cs0 pid 2425)
+   in page starting at address 0x0000800102800000 from IH client 0x1b (UTCL2)
+ VM_L2_PROTECTION_FAULT_STATUS:0x00301030
+ 	Faulty UTCL2 client ID: TCP (0x8)
+ 	MORE_FAULTS: 0x0
+ 	WALKER_ERROR: 0x0
+ 	PERMISSION_FAULTS: 0x3
+ 	MAPPING_ERROR: 0x0
+ 	RW: 0x0
+
+First you have the memory hub, gfxhub and mmhub.  gfxhub is the memory
+hub used for graphics, compute, and sdma on some chips.  mmhub is the
+memory hub used for multi-media and sdma on some chips.
+
+Next you have the vmid and pasid.  If the vmid is 0, this fault was likely
+caused by the kernel driver or firmware.  If the vmid is non-0, it is generally
+a fault in a user application.  The pasid is used to link a vmid to a system
+process id.  If the process is active when the fault happens, the process
+information will be printed.
+
+The GPU virtual address that caused the fault comes next.
+
+The client ID indicates the GPU block that caused the fault.
+Some common client IDs:
+
+- CB/DB: The color/depth backend of the graphics pipe
+- CPF: Command Processor Frontend
+- CPC: Command Processor Compute
+- CPG: Command Processor Graphics
+- TCP/SQC/SQG: Shaders
+- SDMA: SDMA engines
+- VCN: Video encode/decode engines
+- JPEG: JPEG engines
+
+PERMISSION_FAULTS describe what faults were encountered:
+
+- bit 0: the PTE was not valid
+- bit 1: the PTE read bit was not set
+- bit 2: the PTE write bit was not set
+- bit 3: the PTE execute bit was not set
+
+Finally, RW, indicates whether the access was a read (0) or a write (1).
+
+In the example above, a shader (cliend id = TCP) generated a read (RW = 0x0) to
+an invalid page (PERMISSION_FAULTS = 0x3) at GPU virtual address
+0x0000800102800000.  The user can then inspect their shader code and resource
+descriptor state to determine what caused the GPU page fault.
+
+UMR
+===
+
+`umr <https://gitlab.freedesktop.org/tomstdenis/umr>`_ is a general purpose
+GPU debugging and diagnostics tool.  Please see the umr
+`documentation <https://umr.readthedocs.io/en/main/>`_ for more information
+about its capabilities.
--- a/Documentation/gpu/amdgpu/index.rst
+++ b/Documentation/gpu/amdgpu/index.rst
@ -15,4 +15,5 @@ Next (GCN), Radeon DNA (RDNA), and Compute DNA (CDNA) architectures.
   ras
   thermal
   driver-misc
+   debugging
   amdgpu-glossary