Commit Graph

21048 Commits

Author SHA1 Message Date
Yang Wang
3b99e8e37d drm/amd/pm: correct hwmon power label name
only vangogh has 2 types of hwmon power node: "fastPPT" and "slowPPT",
the other asic only has 1 type of hwmon power node: "PPT".

Signed-off-by: Yang Wang <KevinYang.Wang@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 16:57:52 -05:00
zhanglianjie
db7b81545f drm/amd/amdgpu/amdgpu_uvd: Fix forgotten unmap buffer object
After the buffer object is successfully mapped,
call amdgpu_bo_kunmap before the function returns.

Signed-off-by: zhanglianjie <zhanglianjie@uniontech.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 16:57:52 -05:00
Mukul Joshi
a439b890db drm/amdkfd: Consolidate MQD manager functions
A few MQD manager functions are duplicated for all versions of
MQD manager. Remove this duplication by moving the common
functions into kfd_mqd_manager.c file.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 16:57:51 -05:00
Mukul Joshi
5bdd3eb253 drm/amdkfd: Remove unused old debugger implementation
Cleanup the kfd code by removing the unused old debugger
implementation.
The address watch was only ever implemented in the upstream
driver for GFXv7 (Kaveri). The user mode tools runtime using
this API was never open-sourced. Work on the old debugger
prototype that used this API has been discontinued years ago.
Only a small piece of resetting wavefronts is kept and
is moved to kfd_device_queue_manager.c.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 16:57:51 -05:00
Mukul Joshi
6c1a786773 drm/amdkfd: Fix TLB flushing in KFD SVM with no HWS
With no HWS, TLB flushing will not work in SVM code.
Fix this by calling kfd_flush_tlb() which works for both
HWS and no HWS case.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 16:57:51 -05:00
Yang Wang
de7fbd020d drm/amd/pm: fix hwmon node of power1_label create issue
it will cause hwmon node of power1_label is not created.

v2:
the hwmon node of "power1_label" is always needed for all ASICs.
and the patch will remove ASIC type check for "power1_label".

Fixes: ae07970a06 ("drm/amd/pm: add support for hwmon control of slow and fast PPT limit on vangogh")

Signed-off-by: Yang Wang <KevinYang.Wang@amd.com>
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 16:57:51 -05:00
Yang Wang
40f1dc5223 drm/amd/pm: drm/amd/pm: disable GetPptLimit message in sriov mode
PPT limit cannot be queried from VF

Fixes: f3527a6483 ("drm/amd/pm: Enable sysfs required by rocm-smi tool for One VF mode")
Signed-off-by: Yang Wang <KevinYang.Wang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 16:57:21 -05:00
Tao Zhou
b1c87b0874 drm/amdkfd: use unmap all queues for poison consumption
Replace reset queue for specific PASID with unmap all queues, reset
queue could break CP scheduler.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 14:15:07 -05:00
Tao Zhou
03e5b167bd drm/amdkfd: rename kfd_process_vm_fault to kfd_dqm_evict_pasid
As the function is used in more different cases, use a more general
name.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 14:14:53 -05:00
Alex Deucher
4a5dc6c73d drm/amdgpu: move dpcs_3_0_3 headers from dcn to dpcs
To align with other headers.

Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:50 -05:00
Alex Deucher
68550cbc61 drm/amdgpu: move dpcs_3_0_0 headers from dcn to dpcs
To align with other headers.

Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:50 -05:00
Alex Deucher
120cc6e67a drm/amdgpu: add missing license to dpcs_3_0_0 headers
MIT.

Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:50 -05:00
Alex Deucher
bcfab8e35c drm/amdgpu/display: change pipe policy for DCN 2.0
Fixes hangs on driver load with multiple displays on
DCN 2.0 parts.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=215511
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1877
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1886
Fixes: ee2698cf79 ("drm/amd/display: Changed pipe split policy to allow for multi-display pipe split")
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:50 -05:00
Alex Deucher
3786a9bc04 drm/amdgpu: drop experimental flag on aldebaran
These have been at production level for a while. Drop
the flag.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:50 -05:00
Maíra Canal
f54b6bdf42 drm/amd/pm: add missing prototypes to amdgpu_dpm_internal
Include the header with the prototype to silence the following clang
warnings:

drivers/gpu/drm/amd/amdgpu/../pm/amdgpu_dpm_internal.c:29:6: warning: no
previous prototype for function 'amdgpu_dpm_get_active_displays'
[-Wmissing-prototypes]
void amdgpu_dpm_get_active_displays(struct amdgpu_device *adev)
     ^
drivers/gpu/drm/amd/amdgpu/../pm/amdgpu_dpm_internal.c:29:1: note: declare
'static' if the function is not intended to be used outside of this
translation unit
void amdgpu_dpm_get_active_displays(struct amdgpu_device *adev)
^
static
drivers/gpu/drm/amd/amdgpu/../pm/amdgpu_dpm_internal.c:76:5: warning: no
previous prototype for function 'amdgpu_dpm_get_vrefresh'
[-Wmissing-prototypes]
u32 amdgpu_dpm_get_vrefresh(struct amdgpu_device *adev)
    ^
drivers/gpu/drm/amd/amdgpu/../pm/amdgpu_dpm_internal.c:76:1: note: declare
'static' if the function is not intended to be used outside of this
translation unit
u32 amdgpu_dpm_get_vrefresh(struct amdgpu_device *adev)
^
static
2 warnings generated.

Besides that, remove the duplicated prototype of the function
amdgpu_dpm_get_vblank_time in order to keep the consistency of the
headers.

Fixes: 6ddbd37f10 ("drm/amd/pm: optimize the amdgpu_pm_compute_clocks() implementations")
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Maíra Canal <maira.canal@usp.br>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:50 -05:00
Tom Rix
6cbdf12b87 drm/amd/pm: fix error handling
clang static analysis reports this error
amdgpu_smu.c:2289:9: warning: Called function pointer
  is null (null dereference)
        return smu->ppt_funcs->emit_clk_levels(
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There is a logic error in the earlier check of
emit_clk_levels.  The error value is set to
the ret variable but ret is never used.  Return
directly and remove the unneeded ret variable.

Fixes: 5d64f9bbb6 ("amdgpu/pm: Implement new API function "emit" that accepts buffer base and write offset")
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:50 -05:00
Christian König
b6fba4ecf3 drm/amdgpu: reserve the pd while cleaning up PRTs
We want to have lockdep annotation here, so make sure that we reserve
the PD while removing PRTs even if it isn't strictly necessary since the
VM object is about to be destroyed anyway.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:50 -05:00
Christian König
d7d7ddc156 drm/amdgpu: move lockdep assert to the right place.
Since newly added BOs don't have any mappings it's ok to add them
without holding the VM lock. Only when we add per VM BOs the lock is
mandatory.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reported-by: Bhardwaj, Rajneesh <Rajneesh.Bhardwaj@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:50 -05:00
Martin Tsai
66d58bf73b drm/amd/display: handle null link encoder
[Why]
The link encoder mapping could return a null one and causes system crash.

[How]
Let the mapping can get an available link encoder without endpoint
identification check.

Reviewed-by: Wenjing Liu <Wenjing.Liu@amd.com>
Acked-by: Jasdeep Dhillon <jdhillon@amd.com>
Signed-off-by: Martin Tsai <martin.tsai@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:50 -05:00
Aric Cyr
ed3a56f9c6 drm/amd/display: 3.2.172
This version brings along the following fixes:
-fix for build failure uninitalized error
-Bug fix for DP2 using uncertified cable
-limit unbounded request to 5k
-fix DP LT sequence on EQ fail
-Bug fixes for S3/S4

Acked-by: Jasdeep Dhillon <jdhillon@amd.com>
Signed-off-by: Aric Cyr <aric.cyr@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:50 -05:00
Anthony Koo
e8dd130bfd drm/amd/display: [FW Promotion] Release 0.0.103.0
Reviewed-by: Aric Cyr <Aric.Cyr@amd.com>
Acked-by: Jasdeep Dhillon <jdhillon@amd.com>
Signed-off-by: Anthony Koo <Anthony.Koo@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:49 -05:00
Ilya
39da460fd4 drm/amd/display: Fix DP LT sequence on EQ fail
[Why]
The number of lanes wasn't being reset to maximum when reducing link
rate due to an EQ failure. This could result in having fewer lanes in
the verified link capabilities, a lower maximum link bandwidth, and
fewer modes being supported.

[How]
Reset the number of lanes to max when dropping link rate due to EQ
failure during link training.

Reviewed-by: Aric Cyr <Aric.Cyr@amd.com>
Acked-by: Jasdeep Dhillon <jdhillon@amd.com>
Signed-off-by: Ilya <Ilya.Bakoulin@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:49 -05:00
Zhan Liu
559e265522 drm/amd/display: keep eDP Vdd on when eDP stream is already enabled
[Why]
Even if can_apply_edp_fast_boot is set to 1 at boot, this flag will
be cleared to 0 at S3 resume.

[How]
Keep eDP Vdd on when eDP stream is already enabled.

Reviewed-by: Charlene Liu <Charlene.Liu@amd.com>
Acked-by: Jasdeep Dhillon <jdhillon@amd.com>
Signed-off-by: Zhan Liu <Zhan.Liu@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:49 -05:00
Paul Hsieh
98ea24e6c5 drm/amd/display: change fastboot timing validation
[Why]
VBIOS light up eDP with 6bpc but driver use 8bpc without
disable valid stream then re-enable valid stream. Some
panels can't runtime change color depth.

[How]
Change fastboot timing validation function. Not only check
LANE_COUNT, LINK_RATE...etc

Reviewed-by: Anthony Koo <Anthony.Koo@amd.com>
Acked-by: Jasdeep Dhillon <jdhillon@amd.com>
Signed-off-by: Paul Hsieh <paul.hsieh@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:49 -05:00
Dmytro Laktyushkin
ca7f9d0aba drm/amd/display: fix yellow carp wm clamping
Fix clamping to match register field size

Reviewed-by: Charlene Liu <Charlene.Liu@amd.com>
Acked-by: Jasdeep Dhillon <jdhillon@amd.com>
Signed-off-by: Dmytro Laktyushkin <Dmytro.Laktyushkin@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:49 -05:00
Bing Guo
b650075909 drm/amdgpu/display/dc: do blocked MST topology discovery at resume from S3/S4
Why:
When resume from sleep or hiberation, blocked MST Topology discovery might
need to be used.

How:
Added "DETECT_REASON_RESUMEFROMS3S4" to enum dc_detect_reason; use it to
require blocked MST Topology discovery.

Reviewed-by: Wenjing Liu <Wenjing.Liu@amd.com>
Acked-by: Jasdeep Dhillon <jdhillon@amd.com>
Signed-off-by: Bing Guo <Bing.Guo@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:03:35 -05:00
Eric Bernstein
77a35bb579 drm/amd/display: remove static from optc31_set_drr
remove static from optc31_set_drr

Reviewed-by: Nevenko Stupar <Nevenko.Stupar@amd.com>
Acked-by: Jasdeep Dhillon <jdhillon@amd.com>
Signed-off-by: Eric Bernstein <eric.bernstein@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:01:17 -05:00
Dmytro Laktyushkin
3084488a49 drm/amd/display: limit unbounded requesting to 5k
Unbounded requesting is unsupported on pipe split modes
and this change prevents us running into such a situation
with wide modes.

Reviewed-by: Charlene Liu <Charlene.Liu@amd.com>
Acked-by: Jasdeep Dhillon <jdhillon@amd.com>
Signed-off-by: Dmytro Laktyushkin <Dmytro.Laktyushkin@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:01:17 -05:00
Nicholas Kazlauskas
6d33f0e820 drm/amd/display: Fix stream->link_enc unassigned during stream removal
[Why]
Found when running igt@kms_atomic.

Userspace attempts to do a TEST_COMMIT when 0 streams which calls
dc_remove_stream_from_ctx. This in turn calls link_enc_unassign
which ends up modifying stream->link = NULL directly, causing the
global link_enc to be removed preventing further link activity
and future link validation from passing.

[How]
We take care of link_enc unassignment at the start of
link_enc_cfg_link_encs_assign so this call is no longer necessary.

Fixes global state from being modified while unlocked.

Reviewed-by: Jimmy Kizito <Jimmy.Kizito@amd.com>
Acked-by: Jasdeep Dhillon <jdhillon@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:01:17 -05:00
Eric Bernstein
a423746305 drm/amd/display: Fix for variable may be used uninitialized error
[Why]
Build failure due to ‘status’ may be used uninitialized

[How]
Initialize status to LINK_TRAINING_SUCCESS

Reviewed-by: Wenjing Liu <Wenjing.Liu@amd.com>
Acked-by: Jasdeep Dhillon <jdhillon@amd.com>
Signed-off-by: Eric Bernstein <eric.bernstein@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:01:16 -05:00
Evan Quan
f69c15e15e drm/amd/pm: revise the implementation of smu_cmn_disable_all_features_with_exception
As there is no internal cache for enabled ppfeatures now. Thus the 2nd
parameter will be not needed any more.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:01:16 -05:00
Evan Quan
a89ef0448c drm/amd/pm: avoid consecutive retrieving for enabled ppfeatures
As the enabled ppfeatures are just retrieved ahead. We can use
that directly instead of retrieving again and again.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:01:16 -05:00
Evan Quan
3c6591e947 drm/amd/pm: drop the cache for enabled ppfeatures
The following scenarios make the driver cache for enabled ppfeatures
outdated and invalid:
  - Other tools interact with PMFW to change the enabled ppfeatures.
  - PMFW may enable/disable some features behind driver's back. E.g.
    for sienna_cichild, on gfxoff entering, PMFW will disable gfx
    related DPM features. All those are performed without driver's
    notice.
Also considering driver does not actually interact with PMFW such
frequently, the benefit brought by such cache is very limited.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:01:16 -05:00
Evan Quan
7ade3ca9cd drm/amd/pm: correct the usage for 'supported' member of smu_feature structure
The supported features should be retrieved just after EnableAllDpmFeatures message
complete. And the check(whether some dpm feature is supported) is only needed when we
decide to enable or disable it.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:01:16 -05:00
Evan Quan
2d282665d2 drm/amd/pm: update the data type for retrieving enabled ppfeatures
Use uint64_t instead of an array of uint32_t. This can avoid
some non-necessary intermediate uint32_t -> uint64_t conversions.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:01:16 -05:00
Evan Quan
5af779adc3 drm/amd/pm: unify the interface for retrieving enabled ppfeatures
Instead of having two which do the same thing.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:01:16 -05:00
Evan Quan
bd42571168 drm/amd/pm: correct the way for retrieving enabled ppfeatures on Renoir
As other dGPU asics, Renoir should use smu_cmn_get_enabled_mask() for
that job.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:01:16 -05:00
Roman Li
de95753cce drm/amd/display: Cap pflip irqs per max otg number
[Why]
pflip interrupt order are mapped 1 to 1 to otg id.
e.g. if irq_src=26 corresponds to otg0 then 27->otg1, 28->otg2...

Linux DM registers pflip interrupts per number of crtcs.
In fused pipe case crtc numbers can be less than otg id.

e.g. if one pipe out of 3(otg#0-2) is fused adev->mode_info.num_crtc=2
so DM only registers irq_src 26,27.
This is a bug since if pipe#2 remains unfused DM never gets
otg2 pflip interrupt (irq_src=28)
That may results in gfx failure due to pflip timeout.

[How]
Register pflip interrupts per max num of otg instead of num_crtc

Signed-off-by: Roman Li <Roman.Li@amd.com>
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:01:16 -05:00
Aaron Liu
29ba7b16b9 drm/amdgpu: check the GART table before invalidating TLB
Bypass group programming (utcl2_harvest) aims to forbid UTCL2 to send
invalidation command to harvested SE/SA. Once invalidation command comes
into harvested SE/SA, SE/SA has no response and system hang.

This patch is to add checking if the GART table is already allocated before
invalidating TLB. The new procedure is as following:
1. Calling amdgpu_gtt_mgr_init() in amdgpu_ttm_init(). After this step GTT
   BOs can be allocated, but GART mappings are still ignored.
2. Calling amdgpu_gart_table_vram_alloc() from the GMC code. This allocates
   the GART backing store.
3. Initializing the hardware, and programming the backing store into VMID0
   for all VMHUBs.
4. Calling amdgpu_gtt_mgr_recover() to make sure the table is updated with
   the GTT allocations done before it was allocated.

Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Acked-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:01:16 -05:00
Aaron Liu
6d53b115be drm/amdgpu: add utcl2_harvest to gc 10.3.1
Confirmed with hardware team, there is harvesting for gc 10.3.1.

Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:01:16 -05:00
Tao Zhou
4e781873fa drm/amdgpu: fix list add issue in vram reserve
The parameter order in the list_add_tail is incorrect, it causes the
reuse of ras reserved page.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:01:16 -05:00
yipechai
a50b048276 Revert "drm/amdgpu: Add judgement to avoid infinite loop"
The commit d5e8ff5f7b ("drm/amdgpu: Fixed the defect of soft lock caused by infinite loop")
had fixed this defect.

Revert workaround
commit a2170b4af6 ("drm/amdgpu: Add judgement to avoid infinite loop").

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:00:05 -05:00
yipechai
d5e8ff5f7b drm/amdgpu: Fixed the defect of soft lock caused by infinite loop
1. The infinite loop case only occurs on multiple cards support
   ras functions.
2. The explanation of root cause refer to commit 76641cbbf196
   ("drm/amdgpu: Add judgement to avoid infinite loop").
3. Create new node to manage each unique ras instance to guarantee
   each device .ras_list is completely independent.
4. Fixes: commit 7a6b8ab3231b51 ("drm/amdgpu: Unify ras block
   interface for each ras block").
5. The soft locked logs are as follows:
[  262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G           OE     5.13.0-27-generic #29~20.04.1-Ubuntu
[  262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS T20200717143848 07/17/2020
[  262.165698] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
[  262.165980] RIP: 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu]
[  262.166239] Code: 68 d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 d3 74 25 49 89 c6 49 8b 45
[  262.166243] RSP: 0018:ffffac908fa87d80 EFLAGS: 00000202
[  262.166247] RAX: ffffffffc1394248 RBX: ffff91e4ab8d6e20 RCX: ffffffffc1394248
[  262.166249] RDX: ffff91e4aa356e20 RSI: 000000000000000e RDI: ffff91e4ab8c0000
[  262.166252] RBP: ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001
[  262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12: 000000000000000e
[  262.166256] R13: ffff91e4aa356df8 R14: ffffffffc1394320 R15: 0000000000000003
[  262.166258] FS:  0000000000000000(0000) GS:ffff92238fb40000(0000) knlGS:0000000000000000
[  262.166261] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  262.166264] CR2: 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0
[  262.166267] Call Trace:
[  262.166272]  amdgpu_ras_do_recovery+0x130/0x290 [amdgpu]
[  262.166529]  ? psi_task_switch+0xd2/0x250
[  262.166537]  ? __switch_to+0x11d/0x460
[  262.166542]  ? __switch_to_asm+0x36/0x70
[  262.166549]  process_one_work+0x220/0x3c0
[  262.166556]  worker_thread+0x4d/0x3f0
[  262.166560]  ? process_one_work+0x3c0/0x3c0
[  262.166563]  kthread+0x12b/0x150
[  262.166568]  ? set_kthread_struct+0x40/0x40
[  262.166571]  ret_from_fork+0x22/0x30

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00
Luben Tuikov
00d6936dbd drm/amdgpu: Set FRU bus for Aldebaran and Vega 20
The FRU and RAS EEPROMs share the same I2C bus on Aldebaran and Vega 20
ASICs. Set the FRU bus "pointer" to this single bus, as access to the FRU
is sought through that bus "pointer" and not through the RAS bus "pointer".

Cc: Roy Sun <Roy.Sun@amd.com>
Cc: Alex Deucher <Alexander.Deucher@amd.com>
Fixes: 2f60dd5076 ("drm/amd: Expose the FRU SMU I2C bus")
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj
447c7997b6 drm/amdgpu: Fix recursive locking warning
Noticed the below warning while running a pytorch workload on vega10
GPUs. Change to trylock to avoid conflicts with already held reservation
locks.

[  +0.000003] WARNING: possible recursive locking detected
[  +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted
[  +0.000004] --------------------------------------------
[  +0.000002] python/4822 is trying to acquire lock:
[  +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3},
at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000203]
              but task is already holding lock:
[  +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3},
at: ttm_eu_reserve_buffers+0x270/0x470 [ttm]
[  +0.000017]
              other info that might help us debug this:
[  +0.000002]  Possible unsafe locking scenario:

[  +0.000003]        CPU0
[  +0.000002]        ----
[  +0.000002]   lock(reservation_ww_class_mutex);
[  +0.000004]   lock(reservation_ww_class_mutex);
[  +0.000003]
               *** DEADLOCK ***

[  +0.000002]  May be due to missing lock nesting notation

[  +0.000003] 7 locks held by python/4822:
[  +0.000003]  #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at:
kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu]
[  +0.000232]  #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at:
amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu]
[  +0.000241]  #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at:
amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu]
[  +0.000236]  #3: ffffb2b35606fd28
(reservation_ww_class_acquire){+.+.}-{0:0}, at:
amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu]
[  +0.000235]  #4: ffff932cbb7181f8
(reservation_ww_class_mutex){+.+.}-{3:3}, at:
ttm_eu_reserve_buffers+0x270/0x470 [ttm]
[  +0.000015]  #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at:
drm_dev_enter+0x5/0xa0 [drm]
[  +0.000038]  #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3},
at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu]
[  +0.000195]
              stack backtrace:
[  +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted
5.13.0-kfd-rajneesh #1030
[  +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02
08/29/2018
[  +0.000003] Call Trace:
[  +0.000003]  dump_stack+0x6d/0x89
[  +0.000010]  __lock_acquire+0xb93/0x1a90
[  +0.000009]  lock_acquire+0x25d/0x2d0
[  +0.000005]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000184]  ? lock_is_held_type+0xa2/0x110
[  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000184]  __ww_mutex_lock.constprop.17+0xca/0x1060
[  +0.000007]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000183]  ? lock_release+0x13f/0x270
[  +0.000005]  ? lock_is_held_type+0xa2/0x110
[  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000183]  amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000185]  ttm_bo_release+0x4c6/0x580 [ttm]
[  +0.000010]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
[  +0.000183]  amdgpu_vm_free_table+0x76/0xa0 [amdgpu]
[  +0.000189]  amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu]
[  +0.000189]  amdgpu_vm_update_ptes+0x411/0x770 [amdgpu]
[  +0.000191]  amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu]
[  +0.000191]  amdgpu_vm_bo_update+0x251/0x610 [amdgpu]
[  +0.000191]  update_gpuvm_pte+0xcc/0x290 [amdgpu]
[  +0.000229]  ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu]
[  +0.000190]  amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60
[amdgpu]
[  +0.000234]  kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu]
[  +0.000218]  kfd_ioctl+0x2b9/0x600 [amdgpu]
[  +0.000216]  ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu]
[  +0.000216]  ? lock_release+0x13f/0x270
[  +0.000006]  ? __fget_files+0x107/0x1e0
[  +0.000007]  __x64_sys_ioctl+0x8b/0xd0
[  +0.000007]  do_syscall_64+0x36/0x70
[  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000007] RIP: 0033:0x7fbff90a7317
[  +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00
48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
[  +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[  +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX:
00007fbff90a7317
[  +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI:
0000000000000004
[  +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09:
00007fbcc402d880
[  +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12:
00000000c0184b18
[  +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15:
00007fbcc402d820

Cc: Christian König <christian.koenig@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <Alexander.Deucher@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00
Luben Tuikov
00b14ce075 drm/amdgpu: Prevent random memory access in FRU code
Prevent random memory access in the FRU EEPROM code by passing the size of
the destination buffer to the reading routine, and reading no more than the
size of the buffer.

Cc: Kent Russell <kent.russell@amd.com>
Cc: Alex Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00
Luben Tuikov
3f3a24a0a3 drm/amdgpu: Don't offset by 2 in FRU EEPROM
Read buffers no longer expose the I2C address, and so we don't need to
offset by two when we get the read data.

Cc: Alex Deucher <Alexander.Deucher@amd.com>
Cc: Kent Russell <kent.russell@amd.com>
Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Fixes: bd607166af ("drm/amdgpu: Enable reading FRU chip via I2C v3")
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00
Luben Tuikov
3f1e2e9d99 drm/amdgpu: Nerf "buff" to "buf"
Buffer is abbreviated "buf" (buf-fer), not "buff" (buff-er).
This is consistent with the rest of the kernel code.

Cc: Kent Russell <kent.russell@amd.com>
Cc: Alex Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj
2a909ae718 drm/amdkfd: CRIU resume shared virtual memory ranges
In CRIU resume stage, resume all the shared virtual memory ranges from
the data stored inside the resuming kfd process during CRIU restore
phase. Also setup xnack mode and free up the resources.

KFD_IOCTL_SVM_ATTR_CLR_FLAGS is not available for querying via get_attr
interface but we must clear the flags during restore as there might be
some default flags set when the prange is created. Also handle the
invalid PREFETCH atribute values saved during checkpoint by replacing
them with another dummy KFD_IOCTL_SVM_ATTR_SET_FLAGS attribute.

(rajneesh: Fixed the checkpatch reported problems)
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00
Rajneesh Bhardwaj
c2db32ce77 drm/amdkfd: CRIU prepare for svm resume
During CRIU restore phase, the VMAs for the virtual address ranges are
not at their final location yet so in this stage, only cache the data
required to successfully resume the svm ranges during an imminent CRIU
resume phase.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00