Commit Graph

57 Commits

Author SHA1 Message Date
Andrey Grodzovsky
4d1337d2e9 drm/amdgpu: Avoid RAS recovery init when no RAS support.
Fixes driver load regression on APUs.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16 09:59:35 -05:00
Tao Zhou
1a6fc071e1 drm/amdgpu: move the call of ras recovery_init and bad page reserve to proper place
ras recovery_init should be called after ttm init,
bad page reserve should be put in front of gpu reset since i2c
may be unstable during gpu reset.
add cleanup for recovery_init and recovery_fini

v2: add more comment and print.
    remove cancel_work_sync in recovery_init.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13 17:50:47 -05:00
Tao Zhou
78ad00c903 drm/amdgpu: Hook EEPROM table to RAS
support eeprom records load and save for ras,
move EEPROM records storing to bad page reserving

v2: remove redundant check for con->eh_data

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13 17:50:32 -05:00
Tao Zhou
9dc23a6325 drm/amdgpu: change ras bps type to eeprom table record structure
change bps type from retired page to eeprom table record, prepare for
saving umc error records to eeprom

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13 17:50:26 -05:00
Andrey Grodzovsky
d5ea093eeb dmr/amdgpu: Add system auto reboot to RAS.
In case of RAS error allow user configure auto system
reboot through ras_ctrl.
This is also part of the temproray work around for the RAS
hang problem.

v4: Use latest kernel API for disk sync.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13 17:41:17 -05:00
Andrey Grodzovsky
7c6e68c777 drm/amdgpu: Avoid HW GPU reset for RAS.
Problem:
Under certain conditions, when some IP bocks take a RAS error,
we can get into a situation where a GPU reset is not possible
due to issues in RAS in SMU/PSP.

Temporary fix until proper solution in PSP/SMU is ready:
When uncorrectable error happens the DF will unconditionally
broadcast error event packets to all its clients/slave upon
receiving fatal error event and freeze all its outbound queues,
err_event_athub interrupt  will be triggered.
In such case and we use this interrupt
to issue GPU reset. THe GPU reset code is modified for such case to avoid HW
reset, only stops schedulers, deatches all in progress and not yet scheduled
job's fences, set error code on them and signals.
Also reject any new incoming job submissions from user space.
All this is done to notify the applications of the problem.

v2:
Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev
Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c
Remove print param from amdgpu_ras_query_error_count

v3:
Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset
for other XGMI hive memebers.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13 17:41:05 -05:00
Hawking Zhang
b293e891b0 drm/amdgpu: add helper function to do common ras_late_init/fini (v3)
In late_init for ras, the helper function will be used to
1). disable ras feature if the IP block is masked as disabled
2). send enable feature command if the ip block was masked as enabled
3). create debugfs/sysfs node per IP block
4). register interrupt handler

v2: check ih_info.cb to decide add interrupt handler or not

v3: add ras_late_fini for cleanup all the ras fs node and remove
interrupt handler

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13 17:11:04 -05:00
Hawking Zhang
4e644fffb5 drm/amdgpu: add ras_controller and err_event_athub interrupt support
Ras controller interrupt and Ras err event athub interrupt are two dedicated
interrupts for RAS support.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13 17:11:04 -05:00
Guchun Chen
64cc5414fb drm/amdgpu: correct ras error count type
Use unsigned long type for the same ras count variable.
This will avoid overflow on 64 bit system.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-23 11:30:32 -05:00
Tao Zhou
5212a3bdf0 drm/amdgpu: remove ras block's feature status info in sysfs
feature mask info is enough for rocm tool,
"cat /sys/class/drm/card0/device/ras/features" will get the
info like this:

feature mask: 0x3ffb

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-12 12:47:48 -05:00
Tao Zhou
9fb2d8de4a drm/amdgpu: support mmhub ras in amdgpu ras
call mmhub ras query/inject in amdgpu ras

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-12 12:47:48 -05:00
Tao Zhou
44494f96ba drm/amdgpu: add sub block parameter in ras inject command
ras sub block index could be passed from shell command

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-12 12:47:48 -05:00
Tao Zhou
2a3c7ff6e3 drm/amdgpu: update ras sysfs feature info
remove confused ras error type info

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-06 13:52:52 -05:00
Tao Zhou
bd2280da46 drm/amdgpu: replace AMDGPU_RAS_UE with AMDGPU_RAS_SUCCESS
ce can also trigger interrupt, and even both ce and ue error can be
found in one ras query, distinguishing between ce and ue in interrupt
handler is uncessary.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Suggested-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-02 10:30:39 -05:00
Tao Zhou
51437623a0 drm/amdgpu: support ce interrupt in ras module
correctable error can also trigger interrupt in some ras blocks

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-02 10:30:38 -05:00
Tao Zhou
13b7c46c18 drm/amdgpu: add error address query for umc ras
umc error address query can get ce/ue error address and clear error
status

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-02 10:30:38 -05:00
Dennis Li
83b0582c90 drm/amdgpu: support gfx ras error injection and err_cnt query
check gfx error count in both ras querry function and
ras interrupt handler.

gfx ras is still disabled by default due to known stability
issue found in gpu reset.

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31 14:51:14 -05:00
Tao Zhou
7cdc2ee300 drm/amdgpu: remove ras_reserve_vram in ras injection
error injection address is not in gpu address space

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31 14:50:41 -05:00
Tao Zhou
e10634938b drm/amdgpu: add check for ras error type
only ue and ce errors are supported

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31 14:50:35 -05:00
Tao Zhou
cf04dfd0e9 drm/amdgpu: allow ras interrupt callback to return error data
add error data as parameter for ras interrupt cb and process it

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31 14:50:23 -05:00
Tao Zhou
6f102dba80 drm/amdgpu: add support for recording ras error address
more than one error address may be recorded in one query

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31 14:50:05 -05:00
Tao Zhou
045c021653 drm/amdgpu: switch to amdgpu_umc structure
create new amdgpu_umc structure to for more umc
settings in future and switch to the new structure

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31 14:49:52 -05:00
Tao Zhou
05a58345db drm/amdgpu: add ras error count after each query (v2)
v1: increase ras ce/ue error count
v2: log the number of correctable and uncorrectable errors

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31 14:49:33 -05:00
Hawking Zhang
939e2258ce drm/amdgpu: querry umc error count
check umc error count in both ras querry function and
ras interrupt handler

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31 14:49:28 -05:00
Hawking Zhang
7af25d5b7e drm/amdgpu: move some ras data structure to amdgpu_ras.h
These are common structures that can be included by IP specific
source files

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31 14:48:32 -05:00
Hawking Zhang
33c976c961 drm/amdgpu: drop ras self test
this function is not needed any more. error injection is
the only way to validate ras but it can't be executed in
amdgpu_ras_init, where gpu is even not initialized

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-18 14:18:07 -05:00
Hawking Zhang
a5dd40ca81 drm/amdgpu: only allow error injection to UMC IP block
error injection to other IP blocks (except UMC) will be enabled
until RAS feature stablize on those IP blocks

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-18 14:18:07 -05:00
Hawking Zhang
fb2a36075a drm/amdgpu: do not create ras debugfs/sysfs node for ASICs that don't have ras ability
driver shouldn't init any ras debugfs/sysfs node for ASICs that don't have ras
hardware ability

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-18 14:18:07 -05:00
Alex Deucher
d7929c1e13 Merge branch 'drm-next' into drm-next-5.3
Backmerge drm-next and fix up conflicts due to drmP.h removal.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-06-25 08:42:25 -05:00
xinhui pan
acb05f0a3f drm/amdgpu: Do error injection even vram reserve fails
As long as the address is mapped with vram, we can do an error
injection.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-06-20 11:32:10 -05:00
Daniel Vetter
2454fcea33 drm-misc-next for v5.3:
UAPI Changes:
 
 Cross-subsystem Changes:
 - Add code to signal all dma-fences when freed with pending signals.
 - Annotate reservation object access in CONFIG_DEBUG_MUTEXES
 
 Core Changes:
 - Assorted documentation fixes.
 - Use irqsave/restore spinlock to add crc entry.
 - Move code around to drm_client, for internal modeset clients.
 - Make drm_crtc.h and drm_debugfs.h self-contained.
 - Remove drm_fb_helper_connector.
 - Add bootsplash to todo.
 - Fix lock ordering in pan_display_legacy.
 - Support pinning buffers to current location in gem-vram.
 - Remove the now unused locking functions from gem-vram.
 - Remove the now unused kmap-object argument from vram helpers.
 - Stop checking return value of debugfs_create.
 - Add atomic encoder enable/disable helpers.
 - pass drm_atomic_state to atomic connector check.
 - Add atomic support for bridge enable/disable.
 - Add self refresh helpers to core.
 
 Driver Changes:
 - Add extra delay to make MTP SDM845 work.
 - Small fixes to virtio, vkms, sii902x, sii9234, ast, mcde, analogix, rockchip.
 - Add zpos and ?BGR8888 support to meson.
 - More removals of drm_os_linux and drmP headers for amd, radeon, sti, r128, r128, savage, sis.
 - Allow synopsis to unwedge the i2c hdmi bus.
 - Add orientation quirks for GPD panels.
 - Edid cleanups and fixing handling for edid < 1.2.
 - Add runtime pm to stm.
 - Handle s/r in dw-hdmi.
 - Add hooks for power on/off to dsi for stm.
 - Remove virtio dirty tracking code, done in drm core.
 - Rework BO handling in ast and mgag200.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEuXvWqAysSYEJGuVH/lWMcqZwE8MFAl0DYU8ACgkQ/lWMcqZw
 E8NNWw/+MhcRakQmrNDMRIj4DvukzPW2efXbhRFuvthUvVN7rOHMzQZBc3le+gUb
 2GGoEeUYG7XoA/Nj3ZQMUoalrjODywtLClBClC4Blped0mZ4JPiI7bTsrNILn1N1
 hZ0+DbffMCAKqKN8TftK/TrFF9IEM8JSftqD/1RdkiXVcMH3NKuLABHZxzPxH2BH
 XuSqIL5lDyAtanixB53aDf2gw9iipUphYoFlKhdx9dr5Ql96RhiOcDgFhXnFiQu4
 O9z3W6tRI2VPoCzsnhNy3Eah7rBDnZwvyfGa9YU/Q+VeHegb601p8OmNNwpshWE1
 ohixBbADj0dr+K3T/lVW30kovg34i4L5K3O7MR0HxWYSA7+v3AHyG7/GWLxbBNQn
 AFHTRbBph8aP/Dn24ucbKaB7wHi31j7b0Hxj+oJR8RoGhuOYcMOuZrCHqpAxStto
 riSVDCRcq/KcPiuqZZ1UnzFWlQMhNFUwumloPiXFkJ4mcSdK9IbdKBd2eqbRdaU1
 eTOA4istVgNgaNbgLvVB2ltjqXrsdio7/jh6RhobFPqHISiL7iMZg3C/KRBXrkUB
 lYMeGkiE3Wp77zdycdofuEbMfAYUwLts8EYjVsM6xo0BKlBYhpeVuBOYeQEkU7PV
 PpGYqQVeZUoD1OyGlMWIYoyb5Ya7OLUDpooOJdFqoPzUfDki31E=
 =4uQX
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-next-2019-06-14' of git://anongit.freedesktop.org/drm/drm-misc into drm-next

drm-misc-next for v5.3:

UAPI Changes:

Cross-subsystem Changes:
- Add code to signal all dma-fences when freed with pending signals.
- Annotate reservation object access in CONFIG_DEBUG_MUTEXES

Core Changes:
- Assorted documentation fixes.
- Use irqsave/restore spinlock to add crc entry.
- Move code around to drm_client, for internal modeset clients.
- Make drm_crtc.h and drm_debugfs.h self-contained.
- Remove drm_fb_helper_connector.
- Add bootsplash to todo.
- Fix lock ordering in pan_display_legacy.
- Support pinning buffers to current location in gem-vram.
- Remove the now unused locking functions from gem-vram.
- Remove the now unused kmap-object argument from vram helpers.
- Stop checking return value of debugfs_create.
- Add atomic encoder enable/disable helpers.
- pass drm_atomic_state to atomic connector check.
- Add atomic support for bridge enable/disable.
- Add self refresh helpers to core.

Driver Changes:
- Add extra delay to make MTP SDM845 work.
- Small fixes to virtio, vkms, sii902x, sii9234, ast, mcde, analogix, rockchip.
- Add zpos and ?BGR8888 support to meson.
- More removals of drm_os_linux and drmP headers for amd, radeon, sti, r128, r128, savage, sis.
- Allow synopsis to unwedge the i2c hdmi bus.
- Add orientation quirks for GPD panels.
- Edid cleanups and fixing handling for edid < 1.2.
- Add runtime pm to stm.
- Handle s/r in dw-hdmi.
- Add hooks for power on/off to dsi for stm.
- Remove virtio dirty tracking code, done in drm core.
- Rework BO handling in ast and mgag200.

Tiny conflict in drivers/gpu/drm/amd/display/dc/clk_mgr/clk_mgr.c,
needed #include <linux/slab.h> to make it compile.

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/0e01de30-9797-853c-732f-4a5bd6e61445@linux.intel.com
2019-06-14 11:44:24 +02:00
Greg Kroah-Hartman
450f30ea9c amdgpu: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: xinhui pan <xinhui.pan@amd.com>
Cc: Evan Quan <evan.quan@amd.com>
Cc: Feifei Xu <Feifei.Xu@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-06-13 13:59:49 -05:00
Sam Ravnborg
f867723b41 drm/amd: drop use of drmP.h in amdgpu.h
Delete the unused drmP.h from amdgpu.h.
Fix fallout in various files.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20190609220757.10862-5-sam@ravnborg.org
2019-06-10 22:59:53 +02:00
xinhui pan
efb426d581 drm/amdgpu: ras injection use gpu address
injection need a valid gpu address.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-31 10:39:29 -05:00
Tom St Denis
74abc2210e drm/amd/doc: Add RAS documentation to guide
Acked-by: Slava Abramov <slava.abramov@amd.com>
Signed-off-by: Tom St Denis <tom.stdenis@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-24 12:46:49 -05:00
Slava Abramov
d6ee400e79 drm/amdgpu: use div64_ul for 32-bit compatibility v1
v1: replace casting to unsigned long with div64_ul

Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Slava Abramov <slava.abramov@amd.com>
Tested-by: Slava Abramov <slava.abramov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-24 12:21:01 -05:00
xinhui pan
511fdbc33a drm/amdgpu: ras support suspend/resume
add ras suspend function. rename ras_post_init to amdgpu_ras_resume.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Tested-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-24 12:20:51 -05:00
xinhui pan
466b179346 drm/amdgpu: add badpages sysfs interafce
add badpages node.
it will output badpages list in format
gpu pfn : gpu page size : flags

example
0x00000000 : 0x00001000 : R
0x00000001 : 0x00001000 : R
0x00000002 : 0x00001000 : R
0x00000003 : 0x00001000 : R
0x00000004 : 0x00001000 : R
0x00000005 : 0x00001000 : R
0x00000006 : 0x00001000 : R
0x00000007 : 0x00001000 : P
0x00000008 : 0x00001000 : P
0x00000009 : 0x00001000 : P

flags can be one of below characters
R: reserved.
P: pending for reserve.
F: failed to reserve for some reasons.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-24 12:20:51 -05:00
xinhui pan
a564808e7f drm/amdgpu: handle ras reset
add another flag to allow IP do a gpu reset after device init.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-24 12:20:50 -05:00
xinhui pan
7af23ebe93 drm/amdgpu: Issue ras TA disable/enable cmd forcely on boot
Check ras TA error code and return EAGAIN.
Issue ras enable/disable cmd without checking currect state.
Looks like ras TA will handle current state == target state case.

Now driver might need do a reset to satisfy ras TA.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-24 12:20:50 -05:00
xinhui pan
77de502b08 drm/amdgpu: Introduce another ras enable function
Many parts of the whole SW stack can program the ras enablement state
during the boot. Now we handle that case by adding one function which
check the ras flags and choose different code path.

Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-04-10 13:49:15 -05:00
xinhui pan
191051a1be drm/amdgpu: Make default ras error type to none
Unless IP has implemented its own ras, use ERROR_NONE as the default
type.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-04-10 13:49:08 -05:00
xinhui pan
190211ab75 drm/amdgpu: remove per obj debugfs write
there is ras_control node which can do its job.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-27 22:39:59 -05:00
xinhui pan
828cfa2909 drm/amdgpu: Fix amdgpu ras to ta enums conversion
Add helpes to transalte the two enums. And it will catch bugs
easily.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-27 22:39:52 -05:00
xinhui pan
9f491d731c drm/amdgpu: use macro instead of enum for flags
better to use macro.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-27 22:39:44 -05:00
xinhui pan
73aa8e1a3a drm/amdgpu: Fix some sanity check
ras context might be NULL, so move con->h_data after check !con
also fix sizeof wrong type while at it.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-27 22:39:29 -05:00
kbuild test robot
289d513b17 drm/amdgpu: fix semicolon.cocci warnings
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:405:2-3: Unneeded semicolon
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:435:2-3: Unneeded semicolon

 Remove unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

CC: xinhui pan <xinhui.pan@amd.com>
Signed-off-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-19 15:36:52 -05:00
xinhui pan
108c6a6309 drm/amdgpu: add new ras workflow control flags
add ras post init function.
Do some initialization after all IP have finished their late init.

Add new member flags which will control the ras work flow.
For now, vbios enable ras for us on boot. That might change in the
future.
So there should be a flag from vbios to tell us if ras is enabled or not
on boot. Looks like there is no such info now.

Other bits of the flags are reserved to control other parts of ras.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-19 15:36:52 -05:00
xinhui pan
5d0f903fe2 drm/amdgpu: let ras initialization a little noticeable
add drm info output if ras initialized successfully.
add ras atomfirmware sanity check.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-19 15:36:52 -05:00
xinhui pan
163def43e9 drm/amdgpu: Fix lockdep warning more gracely
lockdep need a static key.
Previously we set ignore bit to avoid the warning.
Now call sysfs_attr_init to initialize the static key.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-and-Tested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-03-19 15:36:52 -05:00