Commit Graph

28 Commits

Author SHA1 Message Date
John Clements
c5a4ef3e20 drm/amdgpu: move umc specific macros to header
certain umc macros are common across umc versions

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-23 10:45:00 -04:00
Guchun Chen
fd90456c75 drm/amdgpu: decouple EccErrCnt query and clear operation
Due to hardware bug that when RSMU UMC index is disabled,
clear EccErrCnt at the first UMC instance will clean up all other
EccErrCnt registes from other instances at the same time. This
will break the correctable error count log in EccErrCnt register
once querying it. So decouple both to make error count query workable.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-27 15:52:10 -04:00
Guchun Chen
40e733147f drm/amdgpu: switch to SMN interface to operate RSMU index mode
This makes consistent with other regsiters' access in this module.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-27 15:52:03 -04:00
John Clements
1a2172b5ee drm/amdgpu: update page retirement sequence
check UMC status and exit prior to making and erroneus register access

this resolved unexpected behaviour with UMC indexing mode broadcasting writes

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-06 14:32:06 -05:00
Guchun Chen
d38c3ac716 drm/amdgpu: toggle DF-Cstate when accessing UMC ras error related registers
On arcturus, DF-Cstate needs to be toggled off/on
before and after accessing UMC error counter and
error address registers, otherwise, clearing such
registers may fail.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-06 14:31:59 -05:00
John Clements
eee2eabafe drm/amdgpu: preserve RSMU UMC index mode state
between UMC RAS err register access restore previous RSMU UMC index mode state

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-14 10:18:10 -05:00
Guchun Chen
5d4667ec33 drm/amdgpu: calculate MCUMC_ADDRT0 per asic's UMC offset
Hardcoded offset is not friendly. And another benifit of this
patch is to keep read and write access to this register be
consistent with other similar UMC regsiters  in this file.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-14 10:18:09 -05:00
John Clements
c8aa6ae30c drm/amdgpu: updated UMC error address record with correct channel index
defined macros for repetitive for loops

Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-07 12:02:08 -05:00
John Clements
0ee51f1d94 drm/amdgpu: resolved bug in UMC RAS CE query
switch CE counter register access' to use SMN

disable UMC indexing mode

Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-07 12:02:01 -05:00
John Clements
bd68fb94b3 drm/amdgpu: resolve bug in UMC 6 error counter query
iterate over all error counter registers in SMN space

removed support error counter access via MMIO

Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-07 11:58:37 -05:00
John Clements
955c712007 drm/amdgpu: update UMC 6.1 RAS error counter register access path
use proper method for SMN register access

Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-07 11:57:48 -05:00
Guchun Chen
fb71a336cd drm/amdgpu: move umc offset to one new header file for Arcturus
Code refactor and no functional change.

Fixes: 4cf781c24c ("drm/amdgpu: Added RAS UMC error query support for Arcturus")
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-18 16:33:01 -05:00
John Clements
4cf781c24c drm/amdgpu: Added RAS UMC error query support for Arcturus
Updated UMC 6.1 function set to support UMC 6.1.1 and 6.1.2 devices

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-11 15:22:07 -05:00
Tao Zhou
afa44809a4 drm/amdgpu: use GPU PAGE SHIFT for umc retired page
umc retired page belongs to vram and it should be aligned to gpu page
size

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-03 09:10:59 -05:00
Tao Zhou
d99659a062 drm/amdgpu: rename umc ras_init to err_cnt_init
this interface is related to specific version of umc, distinguish it
from ras_late_init

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16 10:06:20 -05:00
Tao Zhou
86edcc7dba drm/amdgpu: move umc late init from gmc to umc block
umc late init is umc specific, it's more suitable to be put in umc block

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16 10:06:03 -05:00
Tao Zhou
87d2b92f1e drm/amdgpu: save umc error records
save umc error records to ras bad page array

v2: add bad pages before gpu reset
v3: add NULL check for adev->umc.funcs

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13 17:50:40 -05:00
Tao Zhou
dd21a572c9 drm/amdgpu: implement UMC 64 bits REG operations
implement 64 bits operations via 32 bits interface

v2: make use of lower_32_bits() and upper_32_bits() macros

Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-09 11:17:10 -05:00
Tao Zhou
b1a5895352 drm/amdgpu: update the calc algorithm of umc ecc error count
the initial value of ecc error count can be adjusted

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-02 10:30:38 -05:00
Tao Zhou
b7f92097f5 drm/amdgpu: implement umc ras init function
enable umc ce interrupt and initialize ecc error count

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-02 10:30:38 -05:00
Tao Zhou
2b671b6049 drm/amdgpu: apply umc_for_each_channel macro to umc_6_1
use umc_for_each_channel to make code simpler

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-02 10:30:38 -05:00
Tao Zhou
3aacf4ea11 drm/amdgpu: initialize new parameters and functions for amdgpu_umc structure
add initialization for new members of amdgpu_umc structure

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-02 10:30:38 -05:00
Tao Zhou
a55c8d7bda drm/amdgpu: remove the clear of MCA_ADDR
clearing MCA_STATUS is enough to reset the whole MCA, writing zero to
MCA_ADDR is unnecessary

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-02 10:30:38 -05:00
Tao Zhou
8c94810357 drm/amdgpu: query umc ras error address
query umc ras error address, translate it to gpu 4k page view
and save it.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31 14:50:17 -05:00
Tao Zhou
c2742aef4d drm/amdgpu: add structures for umc error address translation
add related registers, callback function and channel index table

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31 14:50:11 -05:00
Tao Zhou
f1ed4afa13 drm/amdgpu: update algorithm of umc uncorrectable error counting
remove the check of ErrorCodeExt

v2: refine the if condition for ue counting

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31 14:49:58 -05:00
Tao Zhou
5bbfb64a17 drm/amdgpu: use 64bit operation macros for umc
replace some 32bit macros with 64bit operations to simplify code

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31 14:49:46 -05:00
Hawking Zhang
9884c2b1c3 drm/amdgpu: add umc v6_1 query error count support
Implement umc query_ras_error_count function to support querry
both correctable and uncorrectable error

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-07-31 14:49:16 -05:00