drm/amdgpu: break driver init process when it's bad GPU(v5)

When retrieving bad gpu tag from eeprom, GPU init should
fail as the GPU needs to be retired for further check.

v2: Fix spelling typo, correct the condition to detect
    bad gpu tag and refine error message.

v3: Refine function argument name.

v4: Fix missing check of returning value of i2c
    initialization error case.

v5: Use dev_err to print PCI information in dmesg instead
    of DRM_ERROR.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This commit is contained in:
Guchun Chen
2020-07-23 15:42:19 +08:00
committed by Alex Deucher
parent 1d6a9d122d
commit b82e65a935
4 changed files with 36 additions and 7 deletions

View File

@@ -1821,6 +1821,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
struct ras_err_handler_data **data;
uint32_t max_eeprom_records_len = 0;
bool exc_err_limit = false;
int ret;
if (con)
@@ -1842,8 +1843,12 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
max_eeprom_records_len = amdgpu_ras_eeprom_get_record_max_length();
amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
ret = amdgpu_ras_eeprom_init(&con->eeprom_control);
if (ret)
ret = amdgpu_ras_eeprom_init(&con->eeprom_control, &exc_err_limit);
/*
* This calling fails when exc_err_limit is true or
* ret != 0.
*/
if (exc_err_limit || ret)
goto free;
if (con->eeprom_control.num_recs) {
@@ -1867,6 +1872,15 @@ free:
out:
dev_warn(adev->dev, "Failed to initialize ras recovery!\n");
/*
* Except error threshold exceeding case, other failure cases in this
* function would not fail amdgpu driver init.
*/
if (!exc_err_limit)
ret = 0;
else
ret = -EINVAL;
return ret;
}