If xnack is on, VM retry fault interrupt send to IH ring1, and ring1
will be full quickly. IH cannot receive other interrupts, this causes
deadlock if migrating buffer using sdma and waiting for sdma done
while handling retry fault.
Remove VMC from IH storm client, enable ring1 write pointer
overflow, then IH will drop retry fault interrupts and be able to receive
other interrupts while driver is handling retry fault.
IH ring1 write pointer doesn't writeback to memory by IH, and ring1
write pointer recorded by self-irq is not updated, so always read
the latest ring1 write pointer from register.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
unmap range always increase atomic svms->drain_pagefaults to simplify
both parent range and child range unmap, page fault handle ignores the
retry fault if svms->drain_pagefaults is set to speed up interrupt
handling. svm_range_drain_retry_fault restart draining if another
range unmap from cpu.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
VMA may be removed before unmap notifier callback, and deferred list
work remove range, return success for this special case as we are
handling stale retry fault.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
kfd_process_wq_release drain retry fault to ensure no retry fault comes
after removing kfd process from the hash table, otherwise svm page fault
handler will fail to recover the fault and dump GPU vm fault log.
Refactor deferred list work to get_task_mm and take mmap write lock
to handle all ranges, and avoid mm is gone while inserting mmu notifier.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Renoir and newer gfx9 APUs have new TSC register that is
not part of the gfxoff tile, so it can be read without
needing to disable gfx off.
Acked-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fixes: 9f4f2c1a35 ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov")
For sriov XGMI configuration, the host driver will handle the hive reset,
so in guest side, the reset_sriov only be called once on one device. This will
make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already
been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov
function to make them balance .
Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[WHY]
Due to pass the wrong parameter down to the enable_stream_gating(),
it would cause the DSC of the removing stream would not be PG.
[HOW]
To pass the correct parameter down th the enable_stream_gating().
Reviewed-by: Anthony Koo <Anthony.Koo@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Yi-Ling Chen <Yi-Ling.Chen2@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
A warning appears in the log on GPU reset for
link_enc_cfg_link_encs_assign for the following condition:
ASSERT(state->res_ctx.link_enc_cfg_ctx.link_enc_assignments[i].valid == false);
This is not expected behavior and may result in link encoders being
incorrectly assigned.
[How]
The dc->current_state is backed up into dm->cached_dc_state before
we commit 0 streams.
DC will clear link encoder assignments on the real state but the
changes won't propagate over to the copy we made before the
0 streams commit.
DC expects that link encoder assignments are *not* valid
when committing a state, so as a workaround it needs to be cleared
before passing it back into DC.
Reviewed-by: Harry Wentland <Harry.Wentland@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
We're only setting the flags on stream[0]'s planes so this logic fails
if we have more than one stream in the state.
This can cause a page flip timeout with multiple displays in the
configuration.
[How]
Index into the stream_status array using the stream index - it's a 1:1
mapping.
Fixes: cdaae8371a ("drm/amd/display: Handle GPU reset for DC block")
Reviewed-by: Harry Wentland <Harry.Wentland@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
The HW interrupt gets disabled after GPU reset so we don't receive
notifications for HPD or AUX from DMUB - leading to timeout and
black screen with (or without) DPIA links connected.
[How]
Re-enable the interrupt after GPU reset like we do for the other
DC interrupts.
Fixes: 81927e2808 ("drm/amd/display: Support for DMUB AUX")
Reviewed-by: Jude Shih <Jude.Shih@amd.com>
Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
amdgpu_amdkfd_gpuvm_free_memory_of_gpu drop dmabuf reference increased in
amdgpu_gem_prime_export.
amdgpu_bo_destroy drop dmabuf reference increased in
amdgpu_gem_prime_import.
So remove this extra dma_buf_put to avoid double free.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Tested-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In particular, we need to ensure all the necessary blocks are switched
to 64b mode (a5xx+) otherwise the high bits of the address of the BO to
snapshot state into will be ignored, resulting in:
*** gpu fault: ttbr0=0000000000000000 iova=0000000000012000 dir=READ type=TRANSLATION source=CP (0,0,0,0)
platform 506a000.gmu: [drm:a6xx_gmu_set_oob] *ERROR* Timeout waiting for GMU OOB set BOOT_SLUMBER: 0x0
Fixes: 4f776f4511 ("drm/msm/gpu: Convert the GPU show function to use the GPU state")
Signed-off-by: Rob Clark <robdclark@chromium.org>
Link: https://lore.kernel.org/r/20211108180122.487859-1-robdclark@gmail.com
Signed-off-by: Rob Clark <robdclark@chromium.org>
If you happened to try to access `/dev/drm_dp_aux` devices provided by
the MSM DP AUX driver too early at bootup you could go boom. Let's
avoid that by only allowing AUX transfers when the controller is
powered up.
Specifically the crash that was seen (on Chrome OS 5.4 tree with
relevant backports):
Kernel panic - not syncing: Asynchronous SError Interrupt
CPU: 0 PID: 3131 Comm: fwupd Not tainted 5.4.144-16620-g28af11b73efb #1
Hardware name: Google Lazor (rev3+) with KB Backlight (DT)
Call trace:
dump_backtrace+0x0/0x14c
show_stack+0x20/0x2c
dump_stack+0xac/0x124
panic+0x150/0x390
nmi_panic+0x80/0x94
arm64_serror_panic+0x78/0x84
do_serror+0x0/0x118
do_serror+0xa4/0x118
el1_error+0xbc/0x160
dp_catalog_aux_write_data+0x1c/0x3c
dp_aux_cmd_fifo_tx+0xf0/0x1b0
dp_aux_transfer+0x1b0/0x2bc
drm_dp_dpcd_access+0x8c/0x11c
drm_dp_dpcd_read+0x64/0x10c
auxdev_read_iter+0xd4/0x1c4
I did a little bit of tracing and found that:
* We register the AUX device very early at bootup.
* Power isn't actually turned on for my system until
hpd_event_thread() -> dp_display_host_init() -> dp_power_init()
* You can see that dp_power_init() calls dp_aux_init() which is where
we start allowing AUX channel requests to go through.
In general this patch is a bit of a bandaid but at least it gets us
out of the current state where userspace acting at the wrong time can
fully crash the system.
* I think the more proper fix (which requires quite a bit more
changes) is to power stuff on while an AUX transfer is
happening. This is like the solution we did for ti-sn65dsi86. This
might be required for us to move to populating the panel via the
DP-AUX bus.
* Another fix considered was to dynamically register / unregister. I
tried that at <https://crrev.com/c/3169431/3> but it got
ugly. Currently there's a bug where the pm_runtime() state isn't
tracked properly and that causes us to just keep registering more
and more.
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Kuogee Hsieh <quic_khsieh@quicinc.com>
Reviewed-by: Abhinav Kumar <quic_abhinavk@quicinc.com>
Link: https://lore.kernel.org/r/20211109100403.1.I4e23470d681f7efe37e2e7f1a6466e15e9bb1d72@changeid
Signed-off-by: Rob Clark <robdclark@chromium.org>
In commit 510410bfc0 ("drm/msm: Implement mmap as GEM object
function") we switched to a new/cleaner method of doing things. That's
good, but we missed a little bit.
Before that commit, we used to _first_ run through the
drm_gem_mmap_obj() case where `obj->funcs->mmap()` was NULL. That meant
that we ran:
vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
vma->vm_page_prot = pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
...and _then_ we modified those mappings with our own. Now that
`obj->funcs->mmap()` is no longer NULL we don't run the default
code. It looks like the fact that the vm_flags got VM_IO / VM_DONTDUMP
was important because we're now getting crashes on Chromebooks that
use ARC++ while logging out. Specifically a crash that looks like this
(this is on a 5.10 kernel w/ relevant backports but also seen on a
5.15 kernel):
Unable to handle kernel paging request at virtual address ffffffc008000000
Mem abort info:
ESR = 0x96000006
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000006
CM = 0, WnR = 0
swapper pgtable: 4k pages, 39-bit VAs, pgdp=000000008293d000
[ffffffc008000000] pgd=00000001002b3003, p4d=00000001002b3003,
pud=00000001002b3003, pmd=0000000000000000
Internal error: Oops: 96000006 [#1] PREEMPT SMP
[...]
CPU: 7 PID: 15734 Comm: crash_dump64 Tainted: G W 5.10.67 #1 [...]
Hardware name: Qualcomm Technologies, Inc. sc7280 IDP SKU2 platform (DT)
pstate: 80400009 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
pc : __arch_copy_to_user+0xc0/0x30c
lr : copyout+0xac/0x14c
[...]
Call trace:
__arch_copy_to_user+0xc0/0x30c
copy_page_to_iter+0x1a0/0x294
process_vm_rw_core+0x240/0x408
process_vm_rw+0x110/0x16c
__arm64_sys_process_vm_readv+0x30/0x3c
el0_svc_common+0xf8/0x250
do_el0_svc+0x30/0x80
el0_svc+0x10/0x1c
el0_sync_handler+0x78/0x108
el0_sync+0x184/0x1c0
Code: f8408423 f80008c3 910020c6 36100082 (b8404423)
Let's add the two flags back in.
While we're at it, the fact that we aren't running the default means
that we _don't_ need to clear out VM_PFNMAP, so remove that and save
an instruction.
NOTE: it was confirmed that VM_IO was the important flag to fix the
problem I was seeing, but adding back VM_DONTDUMP seems like a sane
thing to do so I'm doing that too.
Fixes: 510410bfc0 ("drm/msm: Implement mmap as GEM object function")
Reported-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Tested-by: Stephen Boyd <swboyd@chromium.org>
Link: https://lore.kernel.org/r/20211110113334.1.I1687e716adb2df746da58b508db3f25423c40b27@changeid
Signed-off-by: Rob Clark <robdclark@chromium.org>
In commit 142639a52a ("drm/msm/a6xx: fix crashstate capture for
A650") we changed a6xx_get_gmu_registers() to read 3 sets of
registers. Unfortunately, we didn't change the memory allocation for
the array. That leads to a KASAN warning (this was on the chromeos-5.4
kernel, which has the problematic commit backported to it):
BUG: KASAN: slab-out-of-bounds in _a6xx_get_gmu_registers+0x144/0x430
Write of size 8 at addr ffffff80c89432b0 by task A618-worker/209
CPU: 5 PID: 209 Comm: A618-worker Tainted: G W 5.4.156-lockdep #22
Hardware name: Google Lazor Limozeen without Touchscreen (rev5 - rev8) (DT)
Call trace:
dump_backtrace+0x0/0x248
show_stack+0x20/0x2c
dump_stack+0x128/0x1ec
print_address_description+0x88/0x4a0
__kasan_report+0xfc/0x120
kasan_report+0x10/0x18
__asan_report_store8_noabort+0x1c/0x24
_a6xx_get_gmu_registers+0x144/0x430
a6xx_gpu_state_get+0x330/0x25d4
msm_gpu_crashstate_capture+0xa0/0x84c
recover_worker+0x328/0x838
kthread_worker_fn+0x32c/0x574
kthread+0x2dc/0x39c
ret_from_fork+0x10/0x18
Allocated by task 209:
__kasan_kmalloc+0xfc/0x1c4
kasan_kmalloc+0xc/0x14
kmem_cache_alloc_trace+0x1f0/0x2a0
a6xx_gpu_state_get+0x164/0x25d4
msm_gpu_crashstate_capture+0xa0/0x84c
recover_worker+0x328/0x838
kthread_worker_fn+0x32c/0x574
kthread+0x2dc/0x39c
ret_from_fork+0x10/0x18
Fixes: 142639a52a ("drm/msm/a6xx: fix crashstate capture for A650")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20211103153049.1.Idfa574ccb529d17b69db3a1852e49b580132035c@changeid
Signed-off-by: Rob Clark <robdclark@chromium.org>
Before the drm driver had support for this file there was a driver that
exposed the contents of the vga password register to userspace. It would
present the entire register instead of interpreting it.
The drm implementation chose to mask of the lower bit, without explaining
why. This breaks the existing userspace, which is looking for 0xa8 in
the lower byte.
Change our implementation to expose the entire register.
Fixes: 696029eb36 ("drm/aspeed: Add sysfs for output settings")
Reported-by: Oskar Senft <osk@google.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
Reviewed-by: Jeremy Kerr <jk@codeconstruct.com.au>
Tested-by: Oskar Senft <osk@google.com>
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
Link: https://patchwork.freedesktop.org/patch/msgid/20211117010145.297253-1-joel@jms.id.au
In function amdgpu_get_xgmi_hive, when kobject_init_and_add failed
There is a potential memleak if not call kobject_put.
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Bernard Zhao <bernard@vivo.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In SRIOV configuration, the reset may failed to bring asic back to normal but stop cpsch
already been called, the start_cpsch will not be called since there is no resume in this
case. When reset been triggered again, driver should avoid to do uninitialization again.
Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
add support that allow the userspace tool like RGP to get the GFX clock
value at runtime, the fix follow the old way to show the min/current/max
clocks level for compatible consideration.
=== Test ===
$ cat /sys/class/drm/card0/device/pp_dpm_sclk
0: 200Mhz *
1: 1100Mhz
2: 1600Mhz
then run stress test on one APU system.
$ cat /sys/class/drm/card0/device/pp_dpm_sclk
0: 200Mhz
1: 1040Mhz *
2: 1600Mhz
The current GFXCLK value is updated at runtime.
BugLink: https://gitlab.freedesktop.org/mesa/mesa/-/issues/5260
Reviewed-by: Huang Ray <Ray.Huang@amd.com>
Signed-off-by: Perry Yuan <Perry.Yuan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
amdgpu_connector_vga_get_modes missed function amdgpu_get_native_mode
which assign amdgpu_encoder->native_mode with *preferred_mode result in
amdgpu_encoder->native_mode.clock always be 0. That will cause
amdgpu_connector_set_property returned early on:
if ((rmx_type != DRM_MODE_SCALE_NONE) &&
(amdgpu_encoder->native_mode.clock == 0))
when we try to set scaling mode Full/Full aspect/Center.
Add the missing function to amdgpu_connector_vga_get_mode can fix this.
It also works on dvi connectors because
amdgpu_connector_dvi_helper_funcs.get_mode use the same method.
Signed-off-by: hongao <hongao@uniontech.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Print Navi1x fine grained clocks in a consistent manner with other SOCs.
Don't show aritificial DPM level when the current clock equals min or max.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Driver initialization is driven by IP version from IP
discovery table. So add error print when failing to add
ip block during driver initialization, this will be more
friendly to user to know which IP version is not correct.
[ 40.467361] [drm] host supports REQ_INIT_DATA handshake
[ 40.474076] [drm] add ip block number 0 <nv_common>
[ 40.474090] [drm] add ip block number 1 <gmc_v10_0>
[ 40.474101] [drm] add ip block number 2 <psp>
[ 40.474103] [drm] add ip block number 3 <navi10_ih>
[ 40.474114] [drm] add ip block number 4 <smu>
[ 40.474119] [drm] add ip block number 5 <amdgpu_vkms>
[ 40.474134] [drm] add ip block number 6 <gfx_v10_0>
[ 40.474143] [drm] add ip block number 7 <sdma_v5_2>
[ 40.474147] amdgpu 0000:00:08.0: amdgpu: Fatal error during GPU init
[ 40.474545] amdgpu 0000:00:08.0: amdgpu: amdgpu: finishing device.
v2: use dev_err to multi-GPU system
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>