Use HMM helper function hmm_vma_fault() to get physical pages backing
userptr and start CPU page table update track of those pages. Then use
hmm_vma_range_done() to check if those pages are updated before
amdgpu_cs_submit for gfx or before user queues are resumed for kfd.
If userptr pages are updated, for gfx, amdgpu_cs_ioctl will restart
from scratch, for kfd, restore worker is rescheduled to retry.
HMM simplify the CPU page table concurrent update check, so remove
guptasklock, mmu_invalidations, last_set_pages fields from
amdgpu_ttm_tt struct.
HMM does not pin the page (increase page ref count), so remove related
operations like release_pages(), put_page(), mark_page_dirty().
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Replace our MMU notifier with hmm_mirror_ops.sync_cpu_device_pagetables
callback. Enable CONFIG_HMM and CONFIG_HMM_MIRROR as a dependency in
DRM_AMDGPU_USERPTR Kconfig.
It supports both KFD userptr and gfx userptr paths.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
There is a bug found in vml2 xgmi logic:
mtype is always sent as NC on the VMC to TC interface for a page walk,
regardless of whether the request is being sent to local or remote GPU.
NC means non-coherent and will cause the VMC return data to be cached
in the TCC (versus UC – uncached will not cache the data). Since the
page table updates are being done by SDMA/HDP, then TCC will never be
updated and the GC VML2 will continue to hit on the TCC and never get
the updated page tables and result in a fault.
Heave weigh tlb invalidation does a WB/INVAL of the L1/L2 GL data
caches so TCC will not be hit on next request
Signed-off-by: shaoyunl <Shaoyun.Liu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
It does the same thing we were doing already. I though it needed
work for gen3/4 speeds, but that seems to be covered already.
Reviewed-by: Evan Quan <evan.quan@amd.com>
Acked-by: Michel Dänzer <michel.daenzer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
v1: replace casting to unsigned long with div64_ul
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Slava Abramov <slava.abramov@amd.com>
Tested-by: Slava Abramov <slava.abramov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
There is a spelling mistake in a DRM_ERROR error message. Fix this.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If RAS or XGMI are enabled, you have to use mode1 reset rather
than BACO.
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Support ppfeatures sysfs interface on Vega20 sw smu routine.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Firmware versions can be found as separate sysfs files at:
/sys/class/drm/cardX/device/fw_version (where X is the card number)
The firmware versions are displayed in hexadecimal.
v2: Moved sysfs files to subfolder
Signed-off-by: Ori Messinger <ori.messinger@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
It should be attached to VCN 1.0
Signed-off-by: Leo Liu <leo.liu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Since this is VCN specific and only used by VCN
Signed-off-by: Leo Liu <leo.liu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
it requires to initialize HDP_NONSURFACE_BASE, so as to avoid
using the value left by a previous VM under sriov scenario.
v2: it should not hurt baremetal, generalize it for both sriov
and baremetal
Signed-off-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Tiecheng Zhou <Tiecheng.Zhou@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
only report once per TMO job and the timer would
be restarted upon the job finished if it's just slow.
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
As far as we know this was never used by userspace and so should be removed.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
To replace checking ring type and make them generic
Signed-off-by: Leo Liu <leo.liu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
There is no user fence support for VCN
Signed-off-by: Leo Liu <leo.liu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
There is no user fence support for VCE
Signed-off-by: Leo Liu <leo.liu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
There is no user fence support for UVD
Signed-off-by: Leo Liu <leo.liu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
So we can generalize the no user fence supported engine
Signed-off-by: Leo Liu <leo.liu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
During S3/S4 bootloader will re-init ras state behind us.
Resume might fail or raise a gpu reset.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Tested-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
During S3/S4 bootloader will re-init ras state behind us.
Resume might fail or raise a gpu reset.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Tested-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
During S3/S4 bootloader will re-init ras state behind us.
Resume might fail or raise a gpu reset.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Tested-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
suspend/resume will change ras state behind us. Let driver get notified.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Tested-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
add ras suspend function. rename ras_post_init to amdgpu_ras_resume.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Tested-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
add badpages node.
it will output badpages list in format
gpu pfn : gpu page size : flags
example
0x00000000 : 0x00001000 : R
0x00000001 : 0x00001000 : R
0x00000002 : 0x00001000 : R
0x00000003 : 0x00001000 : R
0x00000004 : 0x00001000 : R
0x00000005 : 0x00001000 : R
0x00000006 : 0x00001000 : R
0x00000007 : 0x00001000 : P
0x00000008 : 0x00001000 : P
0x00000009 : 0x00001000 : P
flags can be one of below characters
R: reserved.
P: pending for reserve.
F: failed to reserve for some reasons.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
During S3 test, when system wake up and resume, ras interface
is already allocated. Move workaround before ras jumps to resume
step in gfx_v9_0_ecc_late_init, and make sure workaround applied
during resume. Also remove unused mmGB_EDC_MODE clearing.
Signed-off-by: James Zhu <James.Zhu@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Under Vega10 SR-IOV, with new RLC's new feature, VF should call RLC
to program some registers if supported
Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
New feature for RLC, some registers can be programmed by
RLC interface under SR-IOV VF:
WREG32_SOC15_RLC_SHADOW:
1, for GRBM_GFX_CNTL, firstly the new register value should be be
programmed to SCRATCH_REG2
1, for GRBM_GFX_INDEX, firstly the new register value should be be
programmed to SCRATCH_REG3
WREG32_RLC:
for registers supported to be programmed by RLC interface, the
following sequence should be used:
1, write the value to SCRATCH_REG0
2, write reg | 0x80000000 to SCRATCH_REG1
3, write 0x1 to RLC_SPARE_INT to notify RLC
4, polling SCRATCH_REG1 to check if finished
Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
For Vega10 SR-IOV VF, skip setting some regs due to:
1, host will program them
2, avoid VF register programming violations
Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add VMR ring support for Vega10 SR-IOV VF if PSP supported
Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
call psp to program ih cntl in SR-IOV if supported
Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add implementation to program regs by PSP, currently the following
IH registers are supported:
IH_RB_CNTL
IH_RB_CNTL_RING1
IH_RB_CNTL_RING2
Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In order to support new PSP feature that PSP may provide interface
to program IH CNTL register, initialize PSP before IH under Vega10
SR-IOV VF
Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Set different register access mode according to the features
provided by firmware
Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
ras need initialize proper state after late init
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
request a gpu reset if ras return EAGAIN.
we will run late init again so it is ok to do nothing this time.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
request a gpu reset if ras return EAGAIN.
we will run late init again so it is ok to do nothing this time.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
request a gpu reset if ras return EAGAIN.
we will run late init again so it is ok to do nothing this time.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
add another flag to allow IP do a gpu reset after device init.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Check ras TA error code and return EAGAIN.
Issue ras enable/disable cmd without checking currect state.
Looks like ras TA will handle current state == target state case.
Now driver might need do a reset to satisfy ras TA.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Enable this now to reset the GPU on RAS errors.
This reverts commit 138352e575.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
ras need late init to initialize proper state.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
v6: Squash in warning fix (Colin Ian King)
v5: Fix warnings (Alex)
v4: fixed mixed delaration and code warnings and minor errors
v3: exposing df funcs in amdgpu_df_funcs in amdgpu.h
v2: moving permonctl/perfmonctr from default to offset
- adding df perfmonctl and perfmonctr registers for df counters
- adding df funcs to set perfmonctl and get perfmonctr for
df and xgmi counters
- exposing df funcs in amdgpu_df_funcs
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
When RAS is enabled, initializes the VGPRs/LDS/SGPRs and
resets EDC error counts. This is done in late_init, before
RAS TA GFX enable.
Signed-off-by: James Zhu <James.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Need to reserve space for the shared eviction fence when initializing
a KFD VM.
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
We are getting a dma-buf implementation completely separate from drm prime,
so rename the files now and cleanup the code a bit.
No functional change.
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Tom St Denis <tom.stdenis@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix some spacing issues, log output, uses of !=NULL/==NULL, unneeded
extra lines and clean up a declaration from =1 to =true for clarity
Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
gmc_v8 is for VI, not CIK, so fix those references
Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>