Commit Graph

948 Commits

Author SHA1 Message Date
Huang Rui
4f1e9a76bd drm/amdgpu: add van gogh asic_type enum (v2)
This patch adds van gogh to amd_asic_type enum and amdgpu_asic_name[].

v2: add missing comma

Signed-off-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-10-05 15:14:02 -04:00
Hawking Zhang
f7ee1874b0 drm/amdgpu: support indirect access reg outside of mmio bar (v2)
support both direct and indirect accessor in unified
helper functions.

v2: Retire indirect mmio access via mm_index/data

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Kevin Wang <kevin1.wang@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-10-01 10:42:55 -04:00
Hawking Zhang
1bba36834c drm/amdgpu: add helper function for indirect reg access (v3)
Add helper function in order to remove RREG32/WREG32
in current pcie_rreg/wreg function for soc15 and
onwards adapters.
PCIE_INDEX/DATA pairs are used to access regsiters
outside of mmio bar in the helper functions.
The new helper functions help remove the recursion
of amdgpu_mm_rreg/wreg from pcie_rreg/wreg and
provide the oppotunity to centralize direct and
indirect access in a single function.

v2: Fixed typo and refine the comments

v3: Remove unnecessary volatile local variable

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Kevin Wang <kevin1.wang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-10-01 10:42:13 -04:00
Tiecheng Zhou
b602ca5f31 drm/amdgpu: stop data_exchange work thread before reset
In FLR routine, init_data_exchange is called at reset_sriov
while fini_data_exchange is not. This will duplicating work
thread.

So call fini_data_exchange before reset for SRIOV

Signed-off-by: Tiecheng Zhou <Tiecheng.Zhou@amd.com>
Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
Reviewed-by: Monk Liu <monk.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-25 17:03:22 -04:00
Bokun Zhang
519b8b76f0 drm/amdgpu: Implement new guest side VF2PF message transaction (v2)
- Refactor the driver code to use amdgpu_virt_read_pf2vf_data
  and amdgpu_virt_write_vf2pf_data instead of writing all code in
  one function (which is the old amdgpu_virt_init_data_exchange)

- Adding a new transaction method for VF2PF message between host
  and guest driver. Guest side will periodically update VF2PF
  message in the framebuffer.

  In the new header, we include guest ucode information, guest
  framebuffer usage, and engine usage

- Clean up the old macros since they will cause compile error if
  the new transaction method is used

v2: squash in build fix

Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
Reviewed-by: Monk Liu <monk.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-25 17:03:22 -04:00
Alex Deucher
9b498efae2 drm/amdgpu: store noretry parameter per driver instance
This will allow us to have different defaults per asic
in a future patch.

Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-25 16:55:16 -04:00
Jiansong Chen
84d244a364 drm/amdgpu: remove gpu_info fw support for sienna_cichlid etc.
Remove gpu_info fw support for sienna_cichlid etc., since the
information can be retrieved from discovery binary.

Signed-off-by: Jiansong Chen <Jiansong.Chen@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-25 16:53:17 -04:00
Alex Deucher
4192f7b576 drm/amdgpu: unmap register bar on device init failure
We never unmapped the regiser BAR on failure.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-17 18:00:08 -04:00
Luben Tuikov
5aea5327ea drm/amdgpu: No sysfs, not an error condition
Not being able to create amdgpu sysfs attributes
is not a fatal error warranting not to continue
to try to bring up the display. Thus, if we get
an error trying to create amdgpu sysfs attrs,
report it and continue on to try to bring up
a display.

Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Slava Abramov <slava.abramov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-17 17:56:31 -04:00
Andrey Grodzovsky
7cbbc745dc drm/amdgpu: Minor checkpatch fix
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-15 17:25:29 -04:00
Andrey Grodzovsky
6894305c97 drm/amdgpu: Disable DPC for XGMI for now.
XGMI support is more complicated than single device support as
questions of synchronization between the device recovering from
PCI error and other members of the hive are required.
Leaving this for next round.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-15 17:25:22 -04:00
Andrey Grodzovsky
7ac71382e9 drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
Reuse exsisting functions from GPU recovery to avoid code
duplications.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-15 17:25:16 -04:00
Andrey Grodzovsky
c1dd4aa624 drm/amdgpu: Fix consecutive DPC recovery failures.
Cache the PCI state on boot and before each case where we might
loose it.

v2: Add pci_restore_state while caching the PCI state to avoid
breaking PCI core logic for stuff like suspend/resume.

v3: Extract pci_restore_state from amdgpu_device_cache_pci_state
to avoid superflous restores during GPU resets and suspend/resumes.

v4: Style fixes.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-15 17:25:04 -04:00
Andrey Grodzovsky
362c7b91c1 drm/amdgpu: Fix SMU error failure
Wait for HW/PSP initiated ASIC reset to complete before
starting the recovery operations.

v2: Remove typo

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-15 17:24:55 -04:00
Andrey Grodzovsky
acd89fca67 drm/amdgpu: Block all job scheduling activity during DPC recovery
DPC recovery involves ASIC reset just as normal GPU recovery so block
SW GPU schedulers and wait on all concurrent GPU resets.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-15 17:24:48 -04:00
Andrey Grodzovsky
bf36b52e78 drm/amdgpu: Avoid accessing HW when suspending SW state
At this point the ASIC is already post reset by the HW/PSP
so the HW not in proper state to be configured for suspension,
some blocks might be even gated and so best is to avoid touching it.

v2: Rename in_dpc to more meaningful name

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-15 17:24:39 -04:00
Andrey Grodzovsky
c9a6b82f45 drm/amdgpu: Implement DPC recovery
Add PCI Downstream Port Containment (DPC) with
basic recovery functionality

v2: remove pci_save_state to avoid breaking suspend/resume
v3: Fix style comments
v4: Improve description.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-15 17:24:32 -04:00
Dave Airlie
0c8d22fcae Merge tag 'amd-drm-next-5.10-2020-09-03' of git://people.freedesktop.org/~agd5f/linux into drm-next
amd-drm-next-5.10-2020-09-03:

amdgpu:
- RAS fixes
- Sienna Cichlid updates
- Navy Flounder updates
- DCE6 (SI) support in DC
- Enable plane rotation
- Rework pre-OS vram reservation handling during driver init
- Add standard interface to dump GPU metrics table from SMU
- Rework tiling and tmz state handling in atomic commits
- Pstate fixes
- Add voltage and power hwmon interfaces for renoir
- SW CTF fixes
- S/G display fix for Raven
- Print client strings for vmfaults for vega and newer
- Manual fan control fixes
- Display updates
- Reorg power management directory structure
- Misc bug fixes
- Misc code cleanups

amdkfd:
- Topology fixes
- Add SMI events for thermal throttling and GPU resets

radeon:
- switch from pci_* to dma_* for dma allocations
- PLL fix

Scheduler:
- Clean up priority levels

UAPI:
- amdgpu INFO IOCTL query update for TMZ state
  https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6049
- amdkfd SMI event interface updates
  https://github.com/RadeonOpenCompute/rocm_smi_lib/tree/therm_thrott

From: Alex Deucher <alexdeucher@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200903222921.4152-1-alexander.deucher@amd.com
2020-09-08 16:40:13 +10:00
Luben Tuikov
1625951a3a drm/amdgpu: Remove superfluous NULL check
The DRM device is a static member of
the amdgpu device structure and as such
always exists, so long as the PCI and
thus the amdgpu device exist.

Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-03 14:47:55 -04:00
Dennis Li
81202807ae drm/amdgpu: block ring buffer access during GPU recovery
When GPU is in reset, its status isn't stable and ring buffer also need
be reset when resuming. Therefore driver should protect GPU recovery
thread from ring buffer accessed by other threads. Otherwise GPU will
randomly hang during recovery.

v2: correct indent

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-03 14:46:55 -04:00
Nirmoy Das
e230ac1118 drm/amdgpu: fix compiler warnings
Fixes below compiler warnings:
 CC [M]  drivers/gpu/drm/amd/amdgpu/amdgpu_device.o
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:381:1: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration]
  381 | void static inline amdgpu_mm_wreg_mmio(struct amdgpu_device *adev, uint32_t reg, uint32_t v, uint32_t acc_flags)
      | ^~~~
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:381:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c: In function ‘amdgpu_device_fini’:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:3381:6: warning: variable ‘r’ set but not used [-Wunused-but-set-variable]
 3381 |  int r;
      |      ^

Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-28 14:00:23 -04:00
Jiawei
4cd2a96d3a drm/amdgpu: simplify hw status clear/set logic
Optimize code to iterate less loops in
amdgpu_device_ip_reinit_early_sriov()

Signed-off-by: Jiawei <Jiawei.Gu@amd.com>
Reviewed-by: Emily.Deng <Emily.Deng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-27 10:09:07 -04:00
Alex Deucher
c997e8e26c drm/amdgpu: report DC not supported if virtual display is enabled (v2)
Virtual display is non-atomic so report false to avoid checking
atomic state and other atomic things at runtime.

v2: squash into the sr-iov check

Acked-by: Nirmoy Das <nirmoy.das@amd.com>
Acked-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-26 16:40:19 -04:00
Alex Deucher
4d2997ab21 drm/amdgpu: add a wrapper for atom asic_init
This allows us to add asic specific workarounds for atom
asic init while keeping the adev specifics out of the
atombios parser code.

Acked-by: Nirmoy Das <nirmoy.das@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-26 16:40:19 -04:00
Luben Tuikov
8aba21b751 drm/amdgpu: Embed drm_device into amdgpu_device (v3)
a) Embed struct drm_device into struct amdgpu_device.
b) Modify the inline-f drm_to_adev() accordingly.
c) Modify the inline-f adev_to_drm() accordingly.
d) Eliminate the use of drm_device.dev_private,
   in amdgpu.
e) Switch from using drm_dev_alloc() to
   drm_dev_init().
f) Add a DRM driver release function, which frees
   the container amdgpu_device after all krefs on
   the contained drm_device have been released.

v2: Split out adding adev_to_drm() into its own
    patch (previous commit), making this patch
    more succinct and clear. More detailed commit
    description.
v3: squash in fix to call drmm_add_final_kfree()
    to avoid a warning.

Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-26 16:40:17 -04:00
Luben Tuikov
4a580877bd drm/amdgpu: Get DRM dev from adev by inline-f
Add a static inline adev_to_drm() to obtain
the DRM device pointer from an amdgpu_device pointer.

Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-24 13:06:06 -04:00
Luben Tuikov
1348969ab6 drm/amdgpu: drm_device to amdgpu_device by inline-f (v2)
Get the amdgpu_device from the DRM device by use
of an inline function, drm_to_adev(). The inline
function resolves a pointer to struct drm_device
to a pointer to struct amdgpu_device.

v2: Use a typed visible static inline function
    instead of an invisible macro.

Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-24 13:06:06 -04:00
Dennis Li
08ebb485f0 drm/amdgpu: annotate a false positive recursive locking
Re-apply commit 72e14ebf9f

[  584.110304] ============================================
[  584.110590] WARNING: possible recursive locking detected
[  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
[  584.111164] --------------------------------------------
[  584.111456] kworker/38:1/553 is trying to acquire lock:
[  584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.112112]
               but task is already holding lock:
[  584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.113068]
               other info that might help us debug this:
[  584.113689]  Possible unsafe locking scenario:

[  584.114350]        CPU0
[  584.114685]        ----
[  584.115014]   lock(&adev->reset_sem);
[  584.115349]   lock(&adev->reset_sem);
[  584.115678]
                *** DEADLOCK ***

[  584.116624]  May be due to missing lock nesting notation

[  584.117284] 4 locks held by kworker/38:1/553:
[  584.117616]  #0: ffff9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630
[  584.117967]  #1: ffffac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630
[  584.118358]  #2: ffffffffc1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu]
[  584.118786]  #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.119222]
               stack backtrace:
[  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G           OE     5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
[  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
[  584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
[  584.121638] Call Trace:
[  584.122050]  dump_stack+0x98/0xd5
[  584.122499]  __lock_acquire+0x1139/0x16e0
[  584.122931]  ? trace_hardirqs_on+0x3b/0xf0
[  584.123358]  ? cancel_delayed_work+0xa6/0xc0
[  584.123771]  lock_acquire+0xb8/0x1c0
[  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.124599]  down_write+0x49/0x120
[  584.125032]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.125472]  amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.125910]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu]
[  584.126367]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu]
[  584.126789]  process_one_work+0x29e/0x630
[  584.127208]  worker_thread+0x3c/0x3f0
[  584.127621]  ? __kthread_parkme+0x61/0x90
[  584.128014]  kthread+0x12f/0x150
[  584.128402]  ? process_one_work+0x630/0x630
[  584.128790]  ? kthread_park+0x90/0x90
[  584.129174]  ret_from_fork+0x3a/0x50

Each adev has owned lock_class_key to avoid false positive
recursive locking.

v2:
1. register adev->lock_key into lockdep, otherwise lockdep will
report the below warning

[ 1216.705820] BUG: key ffff890183b647d0 has not been registered!
[ 1216.705924] ------------[ cut here ]------------
[ 1216.705972] DEBUG_LOCKS_WARN_ON(1)
[ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210

v3:
change to use down_write_nest_lock to annotate the false dead-lock
warning.

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-24 13:05:52 -04:00
Dennis Li
d95e8e97e2 drm/amdgpu: refine create and release logic of hive info
Change to dynamically create and release hive info object,
which help driver support more hives in the future.

v2:
Change to save hive object pointer in adev, to avoid locking
xgmi_mutex every time when calling amdgpu_get_xgmi_hive.

v3:
1. Change type of hive object pointer in adev from void* to
amdgpu_hive_info*.
2. remove unnecessary variable initialization.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-24 12:24:14 -04:00
Dennis Li
aac891685d drm/amdgpu: refine message print for devices of hive
Using dev_xxx instead of DRM_xxx/pr_xxx to indicate which device
of a hive is the message for.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-24 12:24:06 -04:00
Dennis Li
cbfd17f7ba drm/amdgpu: fix the nullptr issue when reenter GPU recovery
in single gpu system, if driver reenter gpu recovery,
amdgpu_device_lock_adev will return false, but hive is
nullptr now.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-24 12:23:54 -04:00
Dennis Li
6049db43d6 drm/amdgpu: change reset lock from mutex to rw_semaphore
clients don't need reset-lock for synchronization when no
GPU recovery.

v2:
change to return the return value of down_read_killable.

v3:
if GPU recovery begin, VF ignore FLR notification.

Reviewed-by: Monk Liu <monk.liu@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-24 12:23:48 -04:00
Dennis Li
53b3f8f40e drm/amdgpu: refine codes to avoid reentering GPU recovery
if other threads have holden the reset lock, recovery will
fail to try_lock. Therefore we introduce atomic hive->in_reset
and adev->in_gpu_reset, to avoid reentering GPU recovery.

v2:
drop "? true : false" in the definition of amdgpu_in_reset

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-24 12:22:56 -04:00
jqdeng
9a1cddd637 drm/amdgpu: Fix repeatly flr issue
Only for no job running test case need to do recover in
flr notification.
For having job in mirror list, then let guest driver to
hit job timeout, and then do recover.

Signed-off-by: jqdeng <Emily.Deng@amd.com>
Acked-by: Nirmoy Das <nirmoy.das@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-18 18:22:02 -04:00
Christian König
f1403342eb drm/amdgpu: revert "fix system hang issue during GPU reset"
The whole approach wasn't thought through till the end.

We already had a reset lock like this in the past and it caused the same problems like this one.

Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary.

This reverts commit df9c8d1aa2.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-14 16:22:40 -04:00
Evan Quan
f10bb940d8 drm/amd/powerplay: optimize the interface for mgpu fan boost enablement
Cover the implementation details from outside(of power). Also preparing
for expanding this to swSMU.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-14 16:22:40 -04:00
Dennis Li
72e14ebf9f drm/amdgpu: annotate a false positive recursive locking
[  584.110304] ============================================
[  584.110590] WARNING: possible recursive locking detected
[  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
[  584.111164] --------------------------------------------
[  584.111456] kworker/38:1/553 is trying to acquire lock:
[  584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.112112]
               but task is already holding lock:
[  584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.113068]
               other info that might help us debug this:
[  584.113689]  Possible unsafe locking scenario:

[  584.114350]        CPU0
[  584.114685]        ----
[  584.115014]   lock(&adev->reset_sem);
[  584.115349]   lock(&adev->reset_sem);
[  584.115678]
                *** DEADLOCK ***

[  584.116624]  May be due to missing lock nesting notation

[  584.117284] 4 locks held by kworker/38:1/553:
[  584.117616]  #0: ffff9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630
[  584.117967]  #1: ffffac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630
[  584.118358]  #2: ffffffffc1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu]
[  584.118786]  #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.119222]
               stack backtrace:
[  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G           OE     5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
[  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
[  584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
[  584.121638] Call Trace:
[  584.122050]  dump_stack+0x98/0xd5
[  584.122499]  __lock_acquire+0x1139/0x16e0
[  584.122931]  ? trace_hardirqs_on+0x3b/0xf0
[  584.123358]  ? cancel_delayed_work+0xa6/0xc0
[  584.123771]  lock_acquire+0xb8/0x1c0
[  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.124599]  down_write+0x49/0x120
[  584.125032]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.125472]  amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.125910]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu]
[  584.126367]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu]
[  584.126789]  process_one_work+0x29e/0x630
[  584.127208]  worker_thread+0x3c/0x3f0
[  584.127621]  ? __kthread_parkme+0x61/0x90
[  584.128014]  kthread+0x12f/0x150
[  584.128402]  ? process_one_work+0x630/0x630
[  584.128790]  ? kthread_park+0x90/0x90
[  584.129174]  ret_from_fork+0x3a/0x50

Each adev has owned lock_class_key to avoid false positive
recursive locking.

v2:
1. register adev->lock_key into lockdep, otherwise lockdep will
report the below warning

[ 1216.705820] BUG: key ffff890183b647d0 has not been registered!
[ 1216.705924] ------------[ cut here ]------------
[ 1216.705972] DEBUG_LOCKS_WARN_ON(1)
[ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210

v3:
change to use down_write_nest_lock to annotate the false dead-lock
warning.

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-14 16:22:40 -04:00
Thomas Zimmermann
534b1f9071 Merge drm/drm-next into drm-misc-next
Backmerging drm-next into drm-misc-next for nouveau and panel updates.
Resolves a conflict between ttm and nouveau, where struct ttm_mem_res got
renamed to struct ttm_resource.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2020-08-12 20:42:08 +02:00
Liu ChengZhe
6b6124bb4a drm/amdgpu: fix PSP autoload twice in FLR
Assigning false to block->status.hw overwrites PSP's previous
hardware status, which causes the PSP to Resume operation after
hardware init.

Remove this assignment and let the PSP execute Resume operation
when it is told to.

v2: Remove the braces.
v3: Modify the description.

Signed-off-by: Liu ChengZhe <ChengZhe.Liu@amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-06 16:44:41 -04:00
Colin Ian King
c16ce56240 drm/amdgpu: fix spelling mistake "paramter" -> "parameter"
There is a spelling mistake in a dev_warn message. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-06 15:43:26 -04:00
Dave Airlie
6c28aed6e5 drm/amdgfx/ttm: use wrapper to get ttm memory managers
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200804025632.3868079-38-airlied@gmail.com
2020-08-06 12:32:03 +10:00
Alex Deucher
72de33f8f7 drm/amdgpu: move IP discovery data to mman
It's related to the memory manager so move it there.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:29:29 -04:00
Monk Liu
a300de40f6 drm/amdgpu: introduce a new parameter to configure how many KCQ we want(v5)
what:
the MQD's save and restore of KCQ (kernel compute queue)
cost lots of clocks during world switch which impacts a lot
to multi-VF performance

how:
introduce a paramter to control the number of KCQ to avoid
performance drop if there is no kernel compute queue needed

notes:
this paramter only affects gfx 8/9/10

v2:
refine namings

v3:
choose queues for each ring to that try best to cross pipes evenly.

v4:
fix indentation
some cleanupsin the gfx_compute_queue_acquire()

v5:
further fix on indentations
more cleanupsin gfx_compute_queue_acquire()

TODO:
in the future we will let hypervisor driver to set this paramter
automatically thus no need for user to configure it through
modprobe in virtual machine

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:27:29 -04:00
Guchun Chen
e8fbaf0342 drm/amdgpu: break GPU recovery once it's in bad state(v4)
When GPU executes recovery and retriving bad GPU tag
from external eerpom device, the recovery will be broken
and error message is printed as well for user's awareness.

v2: Refine warning message in threshold reaching case, and
    fix spelling typo.

v3: Fix explicit calling of bad gpu.

v4: Rename function names.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:26:54 -04:00
Guchun Chen
b82e65a935 drm/amdgpu: break driver init process when it's bad GPU(v5)
When retrieving bad gpu tag from eeprom, GPU init should
fail as the GPU needs to be retired for further check.

v2: Fix spelling typo, correct the condition to detect
    bad gpu tag and refine error message.

v3: Refine function argument name.

v4: Fix missing check of returning value of i2c
    initialization error case.

v5: Use dev_err to print PCI information in dmesg instead
    of DRM_ERROR.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:26:31 -04:00
Liu ChengZhe
392cf6a739 drm/amdgpu: fix PSP autoload twice in FLR
Assigning false to block->status.hw overwrites PSP's previous
hardware status, which causes the PSP to Resume operation after
hardware init.

Remove this assignment and let the PSP execute Resume operation
when it is told to.

v2: Remove the braces.
v3: Modify the description.

Signed-off-by: Liu ChengZhe <ChengZhe.Liu@amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-30 15:36:44 -04:00
Mauro Rossi
64200c468f drm/amdgpu: enable DC support for SI parts (v2)
[Why]
amdgpu_device.c requires changes for SI chipsets support
si.c require changes for Display Manager IP block enabling

[How]
amdgpu_device.c: add SI families in amdgpu_device_asic_has_dc_support()
si.c: changes in si_set_ip_blocks() for Display Manager IP blocks enablement

(v1) NOTE: As per Kaveri and older amdgpu.dc=1 kernel cmdline is required

(v2) fix for bc011f9350 ("drm/amdgpu: Change SI/CI gfx/sdma/smu init sequence")
     remove CHIP_HAINAN support since it does not have physical DCE6 module

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mauro Rossi <issor.oruam@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-28 09:22:48 -04:00
Dennis Li
df9c8d1aa2 drm/amdgpu: fix system hang issue during GPU reset
when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
re-entering GPU recovery.

During GPU reset and resume, it is unsafe that other threads access GPU,
which maybe cause GPU reset failed. Therefore the new rw_semaphore
adev->reset_sem is introduced, which protect GPU from being accessed by
external threads during recovery.

v2:
1. add rwlock for some ioctls, debugfs and file-close function.
2. change to use dqm->is_resetting and dqm_lock for protection in kfd
driver.
3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
re-enter GPU recovery for the same GPU hang.

v3:
1. change back to use adev->reset_sem to protect kfd callback
functions, because dqm_lock couldn't protect all codes, for example:
free_mqd must be called outside of dqm_lock;

[ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
[ 1230.177221] Call Trace:
[ 1230.178249]  dump_stack+0x98/0xd5
[ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
[ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
[ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
[ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
[ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
[ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
[ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
[ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
[ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
[ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
[ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
[ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
[ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
[ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
[ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
[ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
[ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
[ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
[ 1230.202831]  ksys_ioctl+0x98/0xb0
[ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
[ 1230.205174]  do_syscall_64+0x5f/0x250
[ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

2. remove try_lock and introduce atomic hive->in_reset, to avoid
re-enter GPU recovery.

v4:
1. remove an unnecessary whitespace change in kfd_chardev.c
2. remove comment codes in amdgpu_device.c
3. add more detailed comment in commit message
4. define a wrap function amdgpu_in_reset

v5:
1. Fix some style issues.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com>
Suggested-by: Lijo Lazar <Lijo.Lazar@amd.com>
Suggested-by: Luben Tukov <luben.tuikov@amd.com>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-27 16:21:37 -04:00
Bhawanpreet Lakha
a6c5308f2a drm/amd/display: add DC support for navy flounder
Plumb DC support for navy flounder through.

Signed-off-by: Bhawanpreet Lakha <Bhawanpreet.Lakha@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-15 13:27:26 -04:00
Jiansong Chen
41f446bf52 drm/amdgpu: set asic family and ip blocks for navy_flounder
Add the asic family and IP blocks for navy flounder.

Signed-off-by: Jiansong Chen <Jiansong.Chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-15 12:45:48 -04:00
Jiansong Chen
120eb83336 drm/amdgpu: add navy_flounder gpu info firmware
Signed-off-by: Jiansong Chen <Jiansong.Chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-15 12:45:43 -04:00
Jiansong Chen
ddd8fbe77d drm/amdgpu: add navy_flounder asic type
Signed-off-by: Jiansong Chen <Jiansong.Chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-15 12:45:39 -04:00
Wenhui Sheng
bb5c7235ea drm/amdgpu: RAS emergency restart logic refine
If we are in RAS triggered situation and
BACO isn't support, emergency restart is needed,
and this code is only needed for some specific
cases(vega20 with given smu fw version).

After we add smu mode1 reset for sienna cichlid, we
need to share AMD_RESET_METHOD_MODE1 with psp mode1 reset,
so in amdgpu_device_gpu_recover, we need differentiate
which mode1 reset we are using, then decide if it's
a full reset and then decide if emergency restart is needed,
the logic will become much more complex.

After discussion with Hawking, move emergency restart logic
to an independent function.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Wenhui Sheng <Wenhui.Sheng@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-15 12:41:47 -04:00
Nirmoy Das
2b9f78481b drm/amdgpu: minor cleanup of phase1 suspend code
Cleanup of phase1 suspend code to reduce unnecessary indentation.

Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-10 17:42:42 -04:00
Likun Gao
131a3c7474 drm/amdgpu: enable gpu recovery for sienna cichlid
Enable gpu recovery for sienna cichlid by default to trigger
gpu recovery once needed.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-10 17:41:06 -04:00
Hawking Zhang
e78b579d2d Revert "drm/amdgpu: support access regs outside of mmio bar"
This reverts commit 2eee0229f6.
Fallback to a stable base until we have a correct new one

Signed-off-by:Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-02 12:02:56 -04:00
Monk Liu
2c738637ba drm/amdgpu: make IB test synchronize with init for SRIOV(v2)
issue:
originally we kickoff IB test asynchronously with driver's
init, thus
the IB test may still running when the driver loading
done (modprobe amdgpu done).
if we shutdown VM immediately after amdgpu driver
loaded then GPU may
hang because the IB test is still running

fix:
flush the delayed_init routine at the bottom of device_init
to avoid driver loading done prior to the IB test completes

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-02 12:02:55 -04:00
Wenhui Sheng
e3a4d51c76 drm/amdgpu: merge atombios init block
After we move request full access before set
ip blocks, we can merge atombios init block
of SRIOV and baremetal together.

Signed-off-by: Wenhui Sheng <Wenhui.Sheng@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-02 12:02:51 -04:00
Wenhui Sheng
00a979f3d6 drm/amdgpu: invoke req full access early enough
From SIENNA_CICHLID, HW introduce a new protection
feature which can control the FB, doorbell and MMIO
write access for VF, so guest driver should request
full access before ip discovery, or we couldn't access
ip discovery data in FB.

Signed-off-by: Wenhui Sheng <Wenhui.Sheng@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-02 12:02:51 -04:00
Nirmoy Das
75e1658ea0 drm/amdgpu: call release_firmware() without a NULL check
The release_firmware() function is NULL tolerant so we do not need
to check for NULL param before calling it.

Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-01 01:59:27 -04:00
Kevin Wang
5d5bd5e32e drm/amdgpu: restrict the hw sched jobs number to power of two
the module parameter sched_hw_submission is probably from user mode,
and the kernel need to check whether it is legal.

1. align hw sched jobs to power of 2 and set minimum number is 2.
2. use kernel api is_power_of_2() to simplify driver code.

Signed-off-by: Kevin Wang <kevin1.wang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-01 01:59:23 -04:00
Colton Lewis
282fd22b46 drm/amd: correct trivial kernel-doc inconsistencies
Silence documentation warnings by correcting kernel-doc comments.

./drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:3388: warning: Excess function parameter 'suspend' description in 'amdgpu_device_suspend'
./drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:3485: warning: Excess function parameter 'resume' description in 'amdgpu_device_resume'
./drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c:418: warning: Excess function parameter 'tbo' description in 'amdgpu_vram_mgr_del'
./drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c:418: warning: Excess function parameter 'place' description in 'amdgpu_vram_mgr_del'
./drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c:279: warning: Excess function parameter 'tbo' description in 'amdgpu_gtt_mgr_del'
./drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c:279: warning: Excess function parameter 'place' description in 'amdgpu_gtt_mgr_del'
./drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:332: warning: Function parameter or member 'hdcp_workqueue' not described in 'amdgpu_display_manager'
./drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:332: warning: Function parameter or member 'cached_dc_state' not described in 'amdgpu_display_manager'

Signed-off-by: Colton Lewis <colton.w.lewis@protonmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-01 01:59:19 -04:00
Alex Deucher
b7221f2b46 drm/amdgpu: skip BAR resizing if the bios already did it
No need to do it again.

Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-01 01:59:17 -04:00
Stanley.Yang
5278a159cf drm/amdgpu: support reserve bad page for virt (v3)
v1: rename some functions name, only init ras error handler data for
    supported asic.

v2: fix potential memory leak.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-01 01:59:17 -04:00
Tianci.Yin
cc375d8c52 drm/amdgpu: temporarily read bounding box from gpu_info fw for navi12
The bounding box is still needed by Navi12, temporarily read it from gpu_info
firmware. Should be droped when DAL no longer needs it.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Xiaojie Yuan <xiaojie.yuan@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Tianci.Yin <tianci.yin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-01 01:59:15 -04:00
Jerry (Fangzhi) Zuo
81d9bfb8c5 drm/amdgpu/dc: Add missing Sienna_Cichlid chip id
Signed-off-by: Jerry (Fangzhi) Zuo <Jerry.Zuo@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-01 01:59:10 -04:00
Likun Gao
11e8aef52e drm/amdgpu: set asic family and ip blocks for sienna_cichlid
Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-06-03 13:51:57 -04:00
Likun Gao
c0a43457dc drm/amdgpu: add sienna_cichlid gpu info firmware v2
gpu info fw contains chip specific parameters.

v2: fix fw_name

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-06-03 13:51:57 -04:00
Likun Gao
ccaf72d3c2 drm/amdgpu: add sienna_cichlid asic type
Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-06-03 13:51:56 -04:00
Alex Deucher
4292b0b202 drm/amdgpu: clean up discovery testing
Rather than checking of the variable is enabled and the
chip is the right family check for the presence of the
discovery table.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-29 13:55:07 -04:00
Alex Deucher
258620d0b3 drm/amdgpu: skip gpu_info firmware if discovery info is available
The GPU info firmware is only applicable at bring up when the
IP discovery table is not present.  If it's available, use that
first and then fallback to parsing the gpu info firmware.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-29 13:55:07 -04:00
Evan Quan
b265bdbd9f drm/amdgpu: added a sysfs interface for thermal throttling related V4
User can check and set the enablement of throttling logging and
the interval between each logging.

V2: simplify the sysfs interface(no string parsing)
V3: add proper lock protection on updating throttling_logging_rs.interval
V4: documentation cosmetic per Luben's suggestion

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-29 13:55:07 -04:00
Alex Deucher
da87c30b17 drm/amdgpu: put some case statments in family order
SI and CIK came before VI and newer asics.

Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-28 14:00:49 -04:00
Alex Deucher
e1ad2d53bc drm/amdgpu: simplify CZ/ST and KV/KB/ML checks
Just check for APU.

Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-28 14:00:49 -04:00
Alex Deucher
70534d1ee8 drm/amdgpu: simplify raven and renoir checks
Just check for APU.

Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-28 14:00:49 -04:00
Alex Deucher
54f78a7655 drm/amdgpu: add apu flags (v2)
Add some APU flags to simplify handling of different APU
variants.  It's easier to understand the special cases
if we use names flags rather than checking device ids and
silicon revisions.

v2: rebase on latest code

Acked-by: Evan Quan <evan.quan@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-22 13:41:53 -04:00
Alex Deucher
6e29c227a4 drm/amdgpu: move gpu_info parsing after common early init
We need to get the silicon revision id before we parse
the firmware in order to load the correct gpu info firmware
for raven2 variants.

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1103
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-22 13:41:32 -04:00
Alex Deucher
6ba57b7a8f drm/amdgpu: move discovery gfx config fetching
Move it into the fw_info function since it's logically part
of the same functionality.

Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-22 13:41:22 -04:00
Jiange Zhao
728e7e0cd6 drm/amdgpu: Add autodump debugfs node for gpu reset v8
When GPU got timeout, it would notify an interested part
of an opportunity to dump info before actual GPU reset.

A usermode app would open 'autodump' node under debugfs system
and poll() for readable/writable. When a GPU reset is due,
amdgpu would notify usermode app through wait_queue_head and give
it 10 minutes to dump info.

After usermode app has done its work, this 'autodump' node is closed.
On node closure, amdgpu gets to know the dump is done through
the completion that is triggered in release().

There is no write or read callback because necessary info can be
obtained through dmesg and umr. Messages back and forth between
usermode app and amdgpu are unnecessary.

v2: (1) changed 'registered' to 'app_listening'
    (2) add a mutex in open() to prevent race condition

v3 (chk): grab the reset lock to avoid race in autodump_open,
          rename debugfs file to amdgpu_autodump,
          provide autodump_read as well,
          style and code cleanups

v4: add 'bool app_listening' to differentiate situations, so that
    the node can be reopened; also, there is no need to wait for
    completion when no app is waiting for a dump.

v5: change 'bool app_listening' to 'enum amdgpu_autodump_state'
    add 'app_state_mutex' for race conditions:
	(1)Only 1 user can open this file node
	(2)wait_dump() can only take effect after poll() executed.
	(3)eliminated the race condition between release() and
	   wait_dump()

v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex'
    removed state checking in amdgpu_debugfs_wait_dump
    Improve on top of version 3 so that the node can be reopened.

v7: move reinit_completion into open() so that only one user
    can open it.

v8: remove complete_all() from amdgpu_debugfs_wait_dump().

Signed-off-by: Jiange Zhao <Jiange.Zhao@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-18 11:23:37 -04:00
Nirmoy Das
77f3a5cd70 drm/amdgpu: cleanup sysfs file handling
Create sysfs file using attributes.

Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-08 14:32:24 -04:00
Christian König
9d11eb0d0c drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2
This should speed up debugging VRAM access a lot.

v2: add HDP flush/invalidate

Unrevert: RAS issue at root of the issue has been addressed

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Jonathan Kim <Jonathan.Kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Signed-off-by: Kent Russell <kent.russell@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-06 16:50:40 -04:00
Nathan Chancellor
54b7feb93f drm/amdgpu: Avoid integer overflow in amdgpu_device_suspend_display_audio
When building with Clang:

drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:4160:53: warning: overflow in
expression; result is -294967296 with type 'long' [-Winteger-overflow]
                expires = ktime_get_mono_fast_ns() + NSEC_PER_SEC * 4L;
                                                                  ^
1 warning generated.

Multiplication happens first due to order of operations and both
NSEC_PER_SEC and 4 are long literals so the expression overflows. To
avoid this, make 4 an unsigned long long literal, which matches the
type of expires (u64).

Fixes: 3f12acc8d6 ("drm/amdgpu: put the audio codec into suspend state before gpu reset V3")
Link: https://github.com/ClangBuiltLinux/linux/issues/1017
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-05 13:12:55 -04:00
Evan Quan
3f12acc8d6 drm/amdgpu: put the audio codec into suspend state before gpu reset V3
At default, the autosuspend delay of audio controller is 3S. If the
gpu reset is triggered within 3S(after audio controller idle),
the audio controller may be unable into suspended state. Then
the sudden gpu reset will cause some audio errors. The change
here is targeted to resolve this.

However if the audio controller is in use when the gpu reset
triggered, this change may be still not enough to put the
audio controller into suspend state. Under this case, the
gpu reset will still proceed but there will be a warning
message printed("failed to suspend display audio").

V2: limit this for BACO and mode1 reset only
V3: try 1st to use pm_runtime_autosuspend_expiration() to
    query how much time is left. Use default setting on
    failure

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-30 16:48:10 -04:00
Luben Tuikov
c6252390fc drm/amdgpu: implement TMZ accessor (v3)
Implement an accessor of adev->tmz.enabled. Let not
code around access it as "if (adev->tmz.enabled)"
as the organization may change. Instead...

Recruit "bool amdgpu_is_tmz(adev)" to return
exactly this Boolean value. That is, this function
is now an accessor of an already initialized and
set adev and adev->tmz.

Add "void amdgpu_gmc_tmz_set(adev)" to check and
set adev->gmc.tmz_enabled at initialization
time. After which one uses "bool
amdgpu_is_tmz(adev)" to query whether adev
supports TMZ.

Also, remove circular header file include.

v2: Remove amdgpu_tmz.[ch] as requested.
v3: Move TMZ into GMC.

Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-28 16:20:29 -04:00
Huang Rui
01a8dcec1a drm/amdgpu: add function to check tmz capability (v4)
Add a function to check tmz capability with kernel parameter and ASIC type.

v2: use a per device tmz variable instead of global amdgpu_tmz.
v3: refine the comments for the function. (Luben)
v4: add amdgpu_tmz.c/h for future use.

Signed-off-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-28 16:20:28 -04:00
Jason Yan
b6e79d9a31 drm/amdgpu: remove conversion to bool in amdgpu_device.c
The '>' expression itself is bool, no need to convert it to bool again.
This fixes the following coccicheck warning:

drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:3004:68-73: WARNING:
conversion to bool not needed here

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-27 15:52:16 -04:00
Evan Quan
fde812b32c drm/amdgpu: drop redundant cg/pg ungate on runpm enter
CG/PG ungate is already performed in ip_suspend_phase1. Otherwise,
the CG/PG ungate will be performed twice. That will cause gfxoff
disablement is performed twice also on runpm enter while gfxoff
enablemnt once on rump exit. That will put gfxoff into disabled
state.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-27 15:51:56 -04:00
Evan Quan
94fa566058 drm/amdgpu: move kfd suspend after ip_suspend_phase1
This sequence change should be safe as what did in ip_suspend_phase1
is to suspend DCE only. And this is a prerequisite for coming
redundant cg/pg ungate dropping.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-27 15:51:49 -04:00
Dennis Li
a891d239f9 drm/amdgpu: set error query ready after all IPs late init
If set error query ready in amdgpu_ras_late_init, which will
cause some IP blocks aren't initialized, but their error query
is ready.

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-22 18:11:49 -04:00
Evan Quan
7dd8c205ea drm/amdgpu: code cleanup around gpu reset
Make code more readable.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-22 18:11:49 -04:00
Evan Quan
9e94d22c00 drm/amdgpu: optimize the gpu reset for XGMI setup V2
This is basically just some code cosmetic. The current design
for XGMI setup gput reset is to operate on current device(adev)
first and then on other devices from the hive(by another 'for' loop).
But actually we can do some sort to the device list(to put current
device 1st position) and handle all the devices in a single 'for'
loop.

V2: added missing hive->hive_lock protection

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-22 18:11:49 -04:00
Evan Quan
52fb44cf30 drm/amdgpu: correct cancel_delayed_work_sync on gpu reset
As for XGMI setup, it should be performed on other devices
from the hive also.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-22 18:11:49 -04:00
Evan Quan
a2f63ee8b5 drm/amdgpu: correct fbdev suspend on gpu reset
As for XGMI setup, it needs to be performed on
all the devices from the same hive.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-22 18:11:49 -04:00
Kevin Wang
e05185b341 drm/amdgpu: clean up unused variable about ring lru
clean up unused variable:
1. ring_lru_list
2. ring_lru_list_lock

related-commit:
drm/amdgpu: remove ring lru handling

Signed-off-by: Kevin Wang <kevin1.wang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-22 18:11:49 -04:00
Yong Zhao
d69b8971e5 drm/amdgpu: Print CU information by default during initialization
This is convenient for multiple teams to obtain the information. Also,
add device info by using dev_info().

Signed-off-by: Yong Zhao <Yong.Zhao@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-22 18:11:49 -04:00
Jonathan Kim
d84a430d9f drm/amdgpu: fix race between pstate and remote buffer map
Vega20 arbitrates pstate at hive level and not device level. Last peer to
remote buffer unmap could drop P-State while another process is still
remote buffer mapped.

With this fix, P-States still needs to be disabled for now as SMU bug
was discovered on synchronous P2P transfers.  This should be fixed in the
next FW update.

Signed-off-by: Jonathan Kim <Jonathan.Kim@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-22 18:11:46 -04:00
Kent Russell
fdd21e62b0 Revert "drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2"
This reverts commit c12b84d6e0.

The original patch causes a RAS event and subsequent kernel hard-hang
when running the KFDMemoryTest.PtraceAccessInvisibleVram on VG20 and
Arcturus

dmesg output at hang time:
[drm] RAS event of type ERREVENT_ATHUB_INTERRUPT detected!
amdgpu 0000:67:00.0: GPU reset begin!
Evicting PASID 0x8000 queues
Started evicting pasid 0x8000
qcm fence wait loop timeout expired
The cp might be in an unrecoverable state due to an unsuccessful queues preemption
Failed to evict process queues
Failed to suspend process 0x8000
Finished evicting pasid 0x8000
Started restoring pasid 0x8000
Finished restoring pasid 0x8000
[drm] UVD VCPU state may lost due to RAS ERREVENT_ATHUB_INTERRUPT
amdgpu: [powerplay] Failed to send message 0x26, response 0x0
amdgpu: [powerplay] Failed to set soft min gfxclk !
amdgpu: [powerplay] Failed to upload DPM Bootup Levels!
amdgpu: [powerplay] Failed to send message 0x7, response 0x0
amdgpu: [powerplay] [DisableAllSMUFeatures] Failed to disable all smu features!
amdgpu: [powerplay] [DisableDpmTasks] Failed to disable all smu features!
amdgpu: [powerplay] [PowerOffAsic] Failed to disable DPM!
[drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <powerplay> failed -5

Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-22 18:11:46 -04:00
Prike Liang
ced1ba9761 drm/amdgpu: fix the hw hang during perform system reboot and reset
The system reboot failed as some IP blocks enter power gate before perform
hw resource destory. Meanwhile use unify interface to set device CGPG to ungate
state can simplify the amdgpu poweroff or reset ungate guard.

Fixes: 487eca11a3 ("drm/amdgpu: fix gfx hang during suspend with video playback (v2)")
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Tested-by: Mengbing Wang <Mengbing.Wang@amd.com>
Tested-by: Paul Menzel <pmenzel@molgen.mpg.de>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-22 18:11:45 -04:00
Aurabindo Pillai
dd4fa6c1b8 drm/amd/amdgpu: remove hardcoded module name in prints
Let format prefixes take care of printing the module name
through pr_fmt and dev_fmt definitions.

Signed-off-by: Aurabindo Pillai <mail@aurabindo.in>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-13 12:02:40 -04:00
Evan Quan
dadce777e0 drm/amdgpu: fix wrong vram lost counter increment V2
Vram lost counter is wrongly increased by two during baco reset.

V2: assumed vram lost for mode1 reset on all ASICs

Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-13 12:02:08 -04:00
Hawking Zhang
2eee0229f6 drm/amdgpu: support access regs outside of mmio bar
add indirect access support to registers outside of
mmio bar.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-09 10:43:18 -04:00
Hawking Zhang
f384ff95f6 drm/amdgpu: retire AMDGPU_REGS_KIQ flag
all the register access through kiq is redirected
to amdgpu_kiq_rreg/amdgpu_kiq_wreg

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-09 10:43:18 -04:00
Hawking Zhang
ec59847e74 drm/amdgpu: retire RREG32_IDX/WREG32_IDX
those are not needed anymore

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-09 10:43:18 -04:00
Hawking Zhang
dec0520aff drm/amdgpu: remove inproper workaround for vega10
the workaround is not needed for soc15 ASICs except
for vega10. it is even not needed with latest vega10
vbios.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-09 10:43:18 -04:00
Prike Liang
a23ca7f76d drm/amdgpu: fix gfx hang during suspend with video playback (v2)
The system will be hang up during S3 suspend because of SMU is pending
for GC not respose the register CP_HQD_ACTIVE access request.This issue
root cause of accessing the GC register under enter GFX CGGPG and can
be fixed by disable GFX CGPG before perform suspend.

v2: Use disable the GFX CGPG instead of RLC safe mode guard.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Tested-by: Mengbing Wang <Mengbing.Wang@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-09 10:43:17 -04:00
Jack Zhang
b639c22c98 drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset
[PATCH 2/2]
kfd_pre_reset will free mem_objs allocated by kfd_gtt_sa_allocate

Without this change, sriov tdr code path will never free those
allocated memories and get memory leak.

Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com>
Reviewed-by: Monk Liu <monk.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-09 10:43:15 -04:00
Jack Zhang
fe9824d15e drm/amdkfd Avoid destroy hqd when GPU is on reset
This reverts commit 5161bba4311f in order to split it into two
different patches, and this will make it easier to understand.

[PATCH 1/2]
porting to gfx10 from
commit 1b0bfcff46 ("drm/amdgpu: Avoid destroy hqd when GPU is on reset")

Originally, MEC is touched
without GPU initialized first.

Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com>
Reviewed-by: Monk Liu <monk.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-09 10:43:15 -04:00
Nirmoy Das
1c6d567bdf drm/amdgpu: rework sched_list generation
Generate HW IP's sched_list in amdgpu_ring_init() instead of
amdgpu_ctx.c. This makes amdgpu_ctx_init_compute_sched(),
ring.has_high_prio and amdgpu_ctx_init_sched() unnecessary.
This patch also stores sched_list for all HW IPs in one big
array in struct amdgpu_device which makes amdgpu_ctx_init_entity()
much more leaner.

v2:
fix a coding style issue
do not use drm hw_ip const to populate amdgpu_ring_type enum

v3:
remove ctx reference and move sched array and num_sched to a struct
use num_scheds to detect uninitialized scheduler list

v4:
use array_index_nospec for user space controlled variables
fix possible checkpatch.pl warnings

Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-09 10:43:14 -04:00
Jack Zhang
04bef61e5d drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset
kfd_pre_reset will free mem_objs allocated by kfd_gtt_sa_allocate

Without this change, sriov tdr code path will never free those allocated
memories and get memory leak.

v2:add a bugfix for kiq ring test fail

Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com>
Reviewed-by: Monk Liu <monk.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-09 10:43:14 -04:00
Jiawei
b7b2a316b9 drm/amdgpu: extend compute job timeout
extend compute lockup timeout to 60000 for SR-IOV.

Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Jiawei <Jiawei.Gu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-01 14:44:43 -04:00
Monk Liu
2f2941324c drm/amdgpu: postpone entering fullaccess mode
if host support new handshake we only need to enter
fullaccess_mode in ip_init() part, otherwise we need
to do it before reading vbios (becuase host prepares vbios
for VF only after received REQ_GPU_INIT event under
legacy handshake)

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-01 14:44:43 -04:00
Monk Liu
dffa11b4f7 drm/amdgpu: adjust sequence of ip_discovery init and timeout_setting
what:
1)move timtout setting before ip_early_init to reduce exclusive mode
cost for SRIOV

2)move ip_discovery_init() to inside of amdgpu_discovery_reg_base_init()
it is a prepare for the later upcoming patches.

why:
in later upcoming patches we would use a new mailbox event --
"req_gpu_init_data", which is a callback hooked in adev->virt.ops and
this callback send a new event "REQ_GPU_INIT_DAT" to host to notify
host to do some preparation like "IP discovery/vbios on the VF FB"
and this callback must be:

A) invoked after set_ip_block() because virt.ops is configured during
set_ip_block()

B) invoked before ip_discovery_init() becausen ip_discovery_init()
need host side prepares everything in VF FB first.

current place of ip_discovery_init() is before we can invoke callback
of adev->virt.ops, thus we must move ip_discovery_init() to a place
after the adev->virt.ops all settle done, and the perfect place is in
amdgpu_discovery_reg_base_init()

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-01 14:44:43 -04:00
Monk Liu
122078de16 drm/amdgpu: equip new req_init_data handshake
by this new handshake host side can prepare vbios/ip-discovery
and pf&vf exchange data upon recieving this request without
stopping world switch.

this way the world switch is less impacted by VF's exclusive mode
request

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-01 14:44:43 -04:00
John Clements
61380faa4b drm/amdgpu: disable ras query and iject during gpu reset
added flag to ras context to indicate if ras query functionality is ready

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-01 14:44:42 -04:00
Monk Liu
3aa0115d23 drm/amdgpu: cleanup all virtualization detection routine
we need to move virt detection much earlier because:
1) HW team confirms us that RCC_IOV_FUNC_IDENTIFIER will always
be at DE5 (dw) mmio offset from vega10, this way there is no
need to implement detect_hw_virt() routine in each nbio/chip file.
for VI SRIOV chip (tonga & fiji), the BIF_IOV_FUNC_IDENTIFIER is at
0x1503

2) we need to acknowledged we are SRIOV VF before we do IP discovery because
the IP discovery content will be updated by host everytime after it recieved
a new coming "REQ_GPU_INIT_DATA" request from guest (there will be patches
for this new handshake soon).

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-01 14:44:42 -04:00
Kent Russell
bd607166af drm/amdgpu: Enable reading FRU chip via I2C v3
Allow for reading of information like manufacturer, product number
and serial number from the FRU chip. Report the serial number as
the new sysfs file serial_number. Note that this only works on
server cards, as consumer cards do not feature the FRU chip, which
contains this information.

v2: Add documentation to amdgpu.rst, add helper functions,
    rename functions for consistency, fix bad starting offset
v3: Remove testing definitions

Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-01 14:44:41 -04:00
John Clements
43c4d57618 drm/amdgpu: protect RAS sysfs during GPU reset
MMHub EDC becomes dirty after BACO reset

EDC registers should be cleared early on in reset phase

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-20 10:45:00 -04:00
Monk Liu
2e0cc4d48b drm/amdgpu: revise RLCG access path
what changed:
1)provide new implementation interface for the rlcg access path
2)put SQ_CMD/SQ_IND_INDEX to GFX9 RLCG path to let debugfs's reg_op
function can access reg that need RLCG path help

now even debugfs's reg_op can used to dump wave.

tested-by: Monk Liu <monk.liu@amd.com>
tested-by: Zhou pengju <pengju.zhou@amd.com>
Signed-off-by: Zhou pengju <pengju.zhou@amd.com>
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-16 16:17:55 -04:00
Evan Quan
565d194155 drm/amdgpu: add fbdev suspend/resume on gpu reset
This can fix the baco reset failure seen on Navi10.
And this should be a low risk fix as the same sequence
is already used for system suspend/resume.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-13 11:52:34 -04:00
Monk Liu
752c683dbb drm/amdgpu: fix IB test MCBP bug
1)for gfx IB test we shouldn't insert DE meta data

2)we should make sure IB test finished before we
send event 3 to hypervisor otherwise the IDLE from
event 3 will preempt IB test, which is not designed
as a compatible structure for MCBP

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-05 00:28:11 -05:00
Yintian Tao
d2790e10d3 drm/amdgpu: no need to clean debugfs at amdgpu
drm_minor_unregister will invoke drm_debugfs_cleanup
to clean all the child node under primary minor node.
We don't need to invoke amdgpu_debugfs_fini and
amdgpu_debugfs_regs_cleanup to clean agian.
Otherwise, it will raise the NULL pointer like below.
[   45.046029] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
[   45.047256] PGD 0 P4D 0
[   45.047713] Oops: 0002 [#1] SMP PTI
[   45.048198] CPU: 0 PID: 2796 Comm: modprobe Tainted: G        W  OE     4.18.0-15-generic #16~18.04.1-Ubuntu
[   45.049538] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[   45.050651] RIP: 0010:down_write+0x1f/0x40
[   45.051194] Code: 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb e8 ce d9 ff ff 48 ba 01 00 00 00 ff ff ff ff 48 89 d8 <f0> 48 0f c1 10 85 d2 74 05 e8 53 1c ff ff 65 48 8b 04 25 00 5c 01
[   45.053702] RSP: 0018:ffffad8f4133fd40 EFLAGS: 00010246
[   45.054384] RAX: 00000000000000a8 RBX: 00000000000000a8 RCX: ffffa011327dd814
[   45.055349] RDX: ffffffff00000001 RSI: 0000000000000001 RDI: 00000000000000a8
[   45.056346] RBP: ffffad8f4133fd48 R08: 0000000000000000 R09: ffffffffc0690a00
[   45.057326] R10: ffffad8f4133fd58 R11: 0000000000000001 R12: ffffa0113cff0300
[   45.058266] R13: ffffa0113c0a0000 R14: ffffffffc0c02a10 R15: ffffa0113e5c7860
[   45.059221] FS:  00007f60d46f9540(0000) GS:ffffa0113fc00000(0000) knlGS:0000000000000000
[   45.060809] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   45.061826] CR2: 00000000000000a8 CR3: 0000000136250004 CR4: 00000000003606f0
[   45.062913] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   45.064404] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   45.065897] Call Trace:
[   45.066426]  debugfs_remove+0x36/0xa0
[   45.067131]  amdgpu_debugfs_ring_fini+0x15/0x20 [amdgpu]
[   45.068019]  amdgpu_debugfs_fini+0x2c/0x50 [amdgpu]
[   45.068756]  amdgpu_pci_remove+0x49/0x70 [amdgpu]
[   45.069439]  pci_device_remove+0x3e/0xc0
[   45.070037]  device_release_driver_internal+0x18a/0x260
[   45.070842]  driver_detach+0x3f/0x80
[   45.071325]  bus_remove_driver+0x59/0xd0
[   45.071850]  driver_unregister+0x2c/0x40
[   45.072377]  pci_unregister_driver+0x22/0xa0
[   45.073043]  amdgpu_exit+0x15/0x57c [amdgpu]
[   45.073683]  __x64_sys_delete_module+0x146/0x280
[   45.074369]  do_syscall_64+0x5a/0x120
[   45.074916]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

v2: remove all debugfs cleanup/fini code at amdgpu
v3: squash in unused variable removal

Signed-off-by: Yintian Tao <yttao@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-28 16:59:22 -05:00
Dave Airlie
a2ae604da7 Merge tag 'amd-drm-next-5.7-2020-02-26' of git://people.freedesktop.org/~agd5f/linux into drm-next
amd-drm-next-5.7-2020-02-26:

amdgpu:
- Rework VM update handling in preparation for HMM support
- HDCP srm support
- PSR fixes
- DC watermark fixes
- OLED panel support
- SR-IOV fixes
- BACO fixes
- Optimize debugging vram access
- RAS fixes
- Use BACO for runtime pm
- HDCP fixes
- XGMI fixes
- DDC fixes
- DC clock programming optimizations and fixes
- PSP fw loading sequence updates
- Drop DRIVER_USE_AGP
- Remove legacy drm load and unload callbacks

amdkfd:
- Add runtime pm support

radeon:
- Drop DRIVER_USE_AGP

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexdeucher@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200227043142.4075-1-alexander.deucher@amd.com
2020-02-28 15:40:26 +10:00
Alex Deucher
c6385e503a drm/amdgpu: drop legacy drm load and unload callbacks
We've moved the debugfs handling into a centralized place
so we can remove the legacy load an unload callbacks.

Tested-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26 14:21:13 -05:00
Alex Deucher
cd9e29e717 drm/amdgpu/firmware: move debugfs init into core amdgpu debugfs
In order to remove the load and unload drm callbacks,
we need to reorder the init sequence to move all the drm
debugfs file handling.  Do this for firmware.

Tested-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26 14:21:12 -05:00
Alex Deucher
f9d64e6c4a drm/amdgpu/regs: move debugfs init into core amdgpu debugfs
In order to remove the load and unload drm callbacks,
we need to reorder the init sequence to move all the drm
debugfs file handling.  Do this for register access files.

Tested-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26 14:21:12 -05:00
Alex Deucher
3f5cea671c drm/amdgpu/gem: move debugfs init into core amdgpu debugfs
In order to remove the load and unload drm callbacks,
we need to reorder the init sequence to move all the drm
debugfs file handling.  Do this for gem.

Tested-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26 14:21:12 -05:00
Alex Deucher
923ffa6b02 drm/amdgpu: rename amdgpu_debugfs_preempt_cleanup
to amdgpu_debugfs_fini.  It will be used for other things in
the future.

Tested-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26 14:21:12 -05:00
Yong Zhao
8bdab6bb1c drm/amdgpu: Increase timout on emulator to tenfold instead of twice
Since emulators are slower, sometime some operations like flushing tlb
through FM need more than twice the regular timout of 100ms, so increase
the timeout to 1s on emulators.

Signed-off-by: Yong Zhao <Yong.Zhao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26 14:21:03 -05:00
Dave Airlie
1b245ec5b6 drm-misc-next for 5.7:
UAPI Changes:
   - lima: Add support for heap buffers
 
 Cross-subsystem Changes:
 
 Core Changes:
   - Implement mode_config mode_valid for memory constrained drivers
   - Bus format negociation between bridges
   - Consolidate fake vblank events for drivers without vblank interrupts
   - drm/bufs: dma_alloc related cleanups
   - drm/dp_mst: Various fixes
   - drm/print: New drm_device based print helpers
   - Thomas is a drm-misc maintainer now!
 
 Driver Changes:
   - DPMS cleanups for atomic drivers
   - Removal of owner field in SPI tinydrm drivers
   - Removal of explicit dependency on DT for tinydrm drivers
   - Conversion to YAML schemas for DT bindings
   - tidss: New driver
   - virtio: various reworks and fixes
   - Our usual dozen or so new panels or bridges
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRcEzekXsqa64kGDp7j7w1vZxhRxQUCXkEjOgAKCRDj7w1vZxhR
 xeaDAQD+1MludG4RmfQhATe4jTsPC1r2x63OF2CA0ChMGHXJyQEA8qqQ+8y1Cd/u
 PZ3PpcTl4qYYHgzJ6FwW7kDPTvlaZQE=
 =IJAt
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-next-2020-02-10' of git://anongit.freedesktop.org/drm/drm-misc into drm-next

drm-misc-next for 5.7:

UAPI Changes:
  - lima: Add support for heap buffers

Cross-subsystem Changes:

Core Changes:
  - Implement mode_config mode_valid for memory constrained drivers
  - Bus format negociation between bridges
  - Consolidate fake vblank events for drivers without vblank interrupts
  - drm/bufs: dma_alloc related cleanups
  - drm/dp_mst: Various fixes
  - drm/print: New drm_device based print helpers
  - Thomas is a drm-misc maintainer now!

Driver Changes:
  - DPMS cleanups for atomic drivers
  - Removal of owner field in SPI tinydrm drivers
  - Removal of explicit dependency on DT for tinydrm drivers
  - Conversion to YAML schemas for DT bindings
  - tidss: New driver
  - virtio: various reworks and fixes
  - Our usual dozen or so new panels or bridges

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maxime Ripard <maxime@cerno.tech>
Link: https://patchwork.freedesktop.org/patch/msgid/20200210093421.xu4sofldm6wm6xq6@gilmour.lan
2020-02-21 05:44:40 +10:00
Rajneesh Bhardwaj
9593f4d6a6 drm/amdkfd: refactor runtime pm for baco
So far the kfd driver implemented same routines for runtime and system
wide suspend and resume (s2idle or mem). During system wide suspend the
kfd aquires an atomic lock that prevents any more user processes to
create queues and interact with kfd driver and amd gpu. This mechanism
created problem when amdgpu device is runtime suspended with BACO
enabled. Any application that relies on kfd driver fails to load because
the driver reports a locked kfd device since gpu is runtime suspended.

However, in an ideal case, when gpu is runtime  suspended the kfd driver
should be able to:

 - auto resume amdgpu driver whenever a client requests compute service
 - prevent runtime suspend for amdgpu  while kfd is in use

This change refactors the amdgpu and amdkfd drivers to support BACO and
runtime power management.

Reviewed-by: Oak Zeng <oak.zeng@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-12 16:00:54 -05:00
Christian König
c12b84d6e0 drm/amdgpu: use the BAR if possible in amdgpu_device_vram_access v2
This should speed up debugging VRAM access a lot.

v2: add HDP flush/invalidate

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Jonathan Kim <Jonathan.Kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-07 11:45:25 -05:00
Christian König
ce05ac56e6 drm/amdgpu: optimize amdgpu_device_vram_access a bit.
Only write the _HI register when necessary.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Jonathan Kim <Jonathan.Kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-07 11:45:17 -05:00
Jack Zhang
86b93fd62d drm/amdgpu/sriov Don't send msg when smu suspend
For sriov and pp_onevf_mode, do not send message to set smu
status, because smu doesn't support these messages under VF.

Besides, it should skip smu_suspend when pp_onevf_mode is disabled.

Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-07 11:44:24 -05:00
Alex Deucher
2cb44fb093 drm/amdgpu: enable GPU reset by default on renoir
Everything is in place.

Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-27 16:46:45 -05:00
Alex Deucher
658c663947 drm/amdgpu: enable GPU reset by default on Navi
Has been working fine for a while.

Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-27 16:46:45 -05:00
Chris Wilson
7e13ad8964 drm: Avoid drm_global_mutex for simple inc/dec of dev->open_count
Since drm_global_mutex is a true global mutex across devices, we don't
want to acquire it unless absolutely necessary. For maintaining the
device local open_count, we can use atomic operations on the counter
itself, except when making the transition to/from 0. Here, we tackle the
easy portion of delaying acquiring the drm_global_mutex for the final
release by using atomic_dec_and_mutex_lock(), leaving the global
serialisation across the device opens.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Thomas Hellström (VMware) <thomas_os@shipmail.org>
Reviewed-by: Thomas Hellström <thellstrom@vmware.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200124130107.125404-1-chris@chris-wilson.co.uk
2020-01-24 17:41:34 +00:00
Nirmoy Das
a9d4fe2fd6 drm/amdgpu: remove unnecessary conversion to bool
Better clean that up before some automation starts to complain about it

Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-22 16:55:27 -05:00
chen gong
c68dbcd8f9 drm/amdgpu: add kiq version interface for RREG32/WREG32
Reading some registers by mmio will result in hang when GPU is in
"gfxoff" state.This problem can be solved by GPU in "ring command
packages" way.

Signed-off-by: chen gong <curry.gong@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-22 16:34:22 -05:00
chen gong
d33a99c4b6 drm/amdgpu: provide a generic function interface for reading/writing register by KIQ
Move amdgpu_virt_kiq_rreg/amdgpu_virt_kiq_wreg function to amdgpu_gfx.c,
and rename them to amdgpu_kiq_rreg/amdgpu_kiq_wreg.Make it generic and
flexible.

Signed-off-by: chen gong <curry.gong@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-22 16:34:14 -05:00
Pan, Xinhui
bd05221123 drm/amdgpu: add the lost mutex_init back
Initialize notifier_lock.

Bug: https://gitlab.freedesktop.org/drm/amd/issues/1016
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-17 16:03:09 -05:00
Hawking Zhang
e9d4cf918f drm/amdgpu: add arcturus to gpu recovery check code path
support check if dirver should try gpu recovery for
arcturus

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-16 13:40:54 -05:00
Evan Quan
9530273ec9 drm/amd/powerplay: cover the powerplay implementation details V3
This can save users much troubles. As they do not
actually need to care whether swSMU or traditional
powerplay routine should be used.

V2: apply the fixes to vi.c and cik.c also
V3: squash in oops fix

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-14 10:18:08 -05:00
Jack Zhang
895bd048fb amd/amdgpu/sriov tdr enablement with pp_onevf_mode
Under sriov and pp_onevf mode,
1.take resume instead of hw_init for smc recover to avoid
potential memory leak.

2.add return condition inside smc resume function for
sriov_pp_onevf_mode and pm_enabled param.

Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com>
Acked-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-07 11:58:19 -05:00
zhengbin
2a9b90ae47 drm/amdgpu: use true, false for bool variable in amdgpu_device.c
Fixes coccicheck warning:

drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:3961:1-19: WARNING: Assignment of 0/1 to bool variable
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:3981:1-19: WARNING: Assignment of 0/1 to bool variable

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-23 15:00:01 -05:00
Ma Feng
e3c00faa7a drm/amdgpu: Remove unneeded variable 'ret' in amdgpu_device.c
Fixes coccicheck warning:

drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:1036:5-8: Unneeded variable: "ret". Return "0" on line 1079

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Ma Feng <mafeng.ma@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-23 14:58:56 -05:00
Jane Jian
d83c7a07a7 drm/amdgpu: update VCN1(dual instances) fw types ID and VCN ip block type
Previously there is no VCN1 type ID in psp gfx interface. Also add VCN ip
block type unless the reinit after FLR for sriov would fail.

Signed-off-by: Jane Jian <Jane.Jian@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-18 16:33:26 -05:00
Andrey Grodzovsky
c96cf2823d drm/amdgpu: Switch from system_highpri_wq to system_unbound_wq
This is to avoid queueing jobs to same CPU during XGMI hive reset
because there is a strict timeline for when the reset commands
must reach all the GPUs in the hive.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Le Ma <Le.Ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-18 16:09:13 -05:00
Andrey Grodzovsky
c6a6e2db99 drm/amdgpu: Redo XGMI reset synchronization.
Use task barrier in XGMI hive to synchronize ASIC resets
across devices in XGMI hive.

v2: Return right away with a warning if no xgmi hive, update doc.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Le Ma <Le.Ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-18 16:09:13 -05:00
Andrey Grodzovsky
041a62bc06 drm/amdgpu: reverts commit ce316fa55e.
In preparation for doing XGMI reset synchronization using task barrier.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Le Ma <Le.Ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-18 16:09:12 -05:00
Monk Liu
5a7489a7e1 drm/amdgpu: fix KIQ ring test fail in TDR of SRIOV
issues:
MEC is ruined by the amdkfd_pre_reset after VF FLR done

fix:
amdkfd_pre_reset() would ruin MEC after hypervisor finished the VF FLR,
the correct sequence is do amdkfd_pre_reset before VF FLR but there is
a limitation to block this sequence:
if we do pre_reset() before VF FLR, it would go KIQ way to do register
access and stuck there, because KIQ probably won't work by that time
(e.g. you already made GFX hang)

so the best way right now is to simply remove it.

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-18 16:09:12 -05:00
Nirmoy Das
f880799d7f amd/amdgpu: add sched array to IPs with multiple run-queues
This sched array can be passed on to entity creation routine
instead of manually creating such sched array on every context creation.

v2: squash in missing break fix

Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-18 16:09:12 -05:00
Nirmoy Das
0c88b43032 drm/amdgpu: replace vm_pte's run-queue list with drm gpu scheds list
drm_sched_entity_init() takes drm gpu scheduler list instead of
drm_sched_rq list. This makes conversion of drm_sched_rq list
to drm gpu scheduler list unnecessary

Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-18 16:09:12 -05:00
Emily Deng
8973d9ec8f drm/amdgpu/sriov: Tonga sriov also need load firmware with smu
Fix Tonga sriov load driver fail issue.

Signed-off-by: Emily Deng <Emily.Deng@amd.com>
Reviewd-by Yintian Tao <Yintian.tao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-18 16:09:06 -05:00
Yong Zhao
d7f72fe482 drm/amdgpu: Add CU info print log
The log will be useful for easily getting the CU info on various
emulation models or ASICs.

Signed-off-by: Yong Zhao <Yong.Zhao@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-18 16:09:06 -05:00
Simon Ser
93b09a9a89 drm/amdgpu: log when amdgpu.dc=1 but ASIC is unsupported
This makes it easier to figure out whether the kernel parameter has been
taken into account.

Signed-off-by: Simon Ser <contact@emersion.fr>
Cc: Harry Wentland <hwentlan@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-11 15:22:08 -05:00
Yintian Tao
c9ffa427db drm/amd/powerplay: enable pp one vf mode for vega10
Originally, due to the restriction from PSP and SMU, VF has
to send message to hypervisor driver to handle powerplay
change which is complicated and redundant. Currently, SMU
and PSP can support VF to directly handle powerplay
change by itself. Therefore, the old code about the handshake
between VF and PF to handle powerplay will be removed and VF
will use new the registers below to handshake with SMU.
mmMP1_SMN_C2PMSG_101: register to handle SMU message
mmMP1_SMN_C2PMSG_102: register to handle SMU parameter
mmMP1_SMN_C2PMSG_103: register to handle SMU response

v2: remove module parameter pp_one_vf
v3: fix the parens
v4: forbid vf to change smu feature
v5: use hwmon_attributes_visible to skip sepicified hwmon atrribute
v6: change skip condition at vega10_copy_table_to_smc

Signed-off-by: Yintian Tao <yttao@amd.com>
Acked-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-11 15:22:07 -05:00
Le Ma
00eaa57172 drm/amdgpu: clear err_event_athub flag after reset exit
Otherwise next err_event_athub error cannot call gpu reset. And following
resume sequence will not be affected by this flag.

v2: create function to clear amdgpu_ras_in_intr for modularity of ras driver

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-05 16:26:11 -05:00
Le Ma
b823821f22 drm/amdgpu: support full gpu reset workflow when ras err_event_athub occurs
This athub fatal error can be recovered by baco without system-level reboot,
so add a mode to use baco for the recovery. Not affect the default psp reset
situations for now.

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-05 16:26:03 -05:00
Le Ma
ce316fa55e drm/amdgpu: add concurrent baco reset support for XGMI
Currently each XGMI node reset wq does not run in parrallel if bound to same
cpu. Make change to bound the xgmi_reset_work item to different cpus.

XGMI requires all nodes enter into baco within very close proximity before
any node exit baco. So schedule the xgmi_reset_work wq twice for enter/exit
baco respectively.

To use baco for XGMI, PMFW supported for baco on XGMI needs to be involved.

The case that PSP reset and baco reset coexist within an XGMI hive never exist
and is not in the consideration.

v2: define use_baco flag to simplify the code for xgmi baco sequence

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-05 16:25:53 -05:00
Le Ma
7a22677b95 drm/amdgpu: enable/disable doorbell interrupt in baco entry/exit helper
This operation is needed when baco entry/exit for ras recovery

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-05 16:25:45 -05:00
Emily Deng
0ea203a912 drm/amdgpu/sriov: No need the event 3 and 4 now
As will call unload kms when initialize fail, and the unload kms will
send event 3 and 4, so don't need event 3 and 4 in device init.

Signed-off-by: Emily Deng <Emily.Deng@amd.com>
Reviewed-by: Zhan Liu <zhan.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-02 17:39:28 -05:00
Yintian Tao
7c868b592d drm/amdgpu: not remove sysfs if not create sysfs
When load amdgpu failed before create pm_sysfs and ucode_sysfs,
the pm_sysfs and ucode_sysfs should not be removed.
Otherwise, there will be warning call trace just like below.
[   24.836386] [drm] VCE initialized successfully.
[   24.841352] amdgpu 0000:00:07.0: amdgpu_device_ip_init failed
[   25.370383] amdgpu 0000:00:07.0: Fatal error during GPU init
[   25.889575] [drm] amdgpu: finishing device.
[   26.069128] amdgpu 0000:00:07.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[   26.070110] [drm:gfx_v9_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[   26.200309] [TTM] Finalizing pool allocator
[   26.200314] [TTM] Finalizing DMA pool allocator
[   26.200349] [TTM] Zone  kernel: Used memory at exit: 0 KiB
[   26.200351] [TTM] Zone   dma32: Used memory at exit: 0 KiB
[   26.200353] [drm] amdgpu: ttm finalized
[   26.205329] ------------[ cut here ]------------
[   26.205330] sysfs group 'fw_version' not found for kobject '0000:00:07.0'
[   26.205347] WARNING: CPU: 0 PID: 1228 at fs/sysfs/group.c:256 sysfs_remove_group+0x80/0x90
[   26.205348] Modules linked in: amdgpu(OE+) gpu_sched(OE) ttm(OE) drm_kms_helper(OE) drm(OE) i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache binfmt_misc snd_hda_codec_generic ledtrig_audio crct10dif_pclmul snd_hda_intel crc32_pclmul snd_hda_codec ghash_clmulni_intel snd_hda_core snd_hwdep snd_pcm snd_timer input_leds snd joydev soundcore serio_raw pcspkr evbug aesni_intel aes_x86_64 crypto_simd cryptd mac_hid glue_helper sunrpc ip_tables x_tables autofs4 8139too psmouse 8139cp mii i2c_piix4 pata_acpi floppy
[   26.205369] CPU: 0 PID: 1228 Comm: modprobe Tainted: G           OE     5.2.0-rc1 #1
[   26.205370] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   26.205372] RIP: 0010:sysfs_remove_group+0x80/0x90
[   26.205374] Code: e8 35 b9 ff ff 5b 41 5c 41 5d 5d c3 48 89 df e8 f6 b5 ff ff eb c6 49 8b 55 00 49 8b 34 24 48 c7 c7 48 7a 70 98 e8 60 63 d3 ff <0f> 0b eb d7 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
[   26.205375] RSP: 0018:ffffbee242b0b908 EFLAGS: 00010282
[   26.205376] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
[   26.205377] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff97ad6f817380
[   26.205377] RBP: ffffbee242b0b920 R08: ffffffff98f520c4 R09: 00000000000002b3
[   26.205378] R10: ffffbee242b0b8f8 R11: 00000000000002b3 R12: ffffffffc0e58240
[   26.205379] R13: ffff97ad6d1fe0b0 R14: ffff97ad4db954c8 R15: ffff97ad4db7fff0
[   26.205380] FS:  00007ff3d8a1c4c0(0000) GS:ffff97ad6f800000(0000) knlGS:0000000000000000
[   26.205381] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   26.205381] CR2: 00007f9b2ef1df04 CR3: 000000042aab8001 CR4: 00000000003606f0
[   26.205384] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   26.205385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   26.205385] Call Trace:
[   26.205461]  amdgpu_ucode_sysfs_fini+0x18/0x20 [amdgpu]
[   26.205518]  amdgpu_device_fini+0x3b4/0x560 [amdgpu]
[   26.205573]  amdgpu_driver_unload_kms+0x4f/0xa0 [amdgpu]
[   26.205623]  amdgpu_driver_load_kms+0xcd/0x250 [amdgpu]
[   26.205637]  drm_dev_register+0x12b/0x1c0 [drm]
[   26.205695]  amdgpu_pci_probe+0x12a/0x1e0 [amdgpu]
[   26.205699]  local_pci_probe+0x47/0xa0
[   26.205701]  pci_device_probe+0x106/0x1b0
[   26.205704]  really_probe+0x21a/0x3f0
[   26.205706]  driver_probe_device+0x11c/0x140
[   26.205707]  device_driver_attach+0x58/0x60
[   26.205709]  __driver_attach+0xc3/0x140

Signed-off-by: Yintian Tao <yttao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Nirmoy Das <nirmoy.das@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-02 17:39:05 -05:00
Alex Deucher
de185019bc drm/amdgpu: move pci handling out of pm ops
The documentation says the that PCI core handles this
for you unless you choose to implement it.  Just rely
on the PCI core to handle the pci specific bits.

Reviewed-by: Zhan Liu <zhan.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-26 14:51:03 -05:00
Alex Deucher
3840c5bcc2 drm/amdgpu: disentangle runtime pm and vga_switcheroo
Originally we only supported runtime pm on PX/HG laptops
so vga_switcheroo and runtime pm are sort of entangled.

Attempt to logically separate them.

Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-19 16:42:53 -05:00
Alex Deucher
361dbd01a1 drm/amdgpu: add helpers for baco entry and exit
BACO - Bus Active, Chip Off

Will be used for runtime pm.  Entry will enter the BACO
state (chip off).  Exit will exit the BACO state (chip on).

Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-19 16:42:52 -05:00
Alex Deucher
31af062acf drm/amdgpu: rename amdgpu_device_is_px to amdgpu_device_supports_boco (v2)
BACO - Bus Active, Chip Off
BOCO - Bus Off, Chip Off

To better match what we are checking for and to align with
amdgpu_device_supports_baco.

BOCO is used on PowerXpress/Hybrid Graphics systems and BACO
is used on desktop dGPU boards.

v2: fix typo in documentation

Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-19 16:42:50 -05:00
Alex Deucher
a69cba42b1 drm/amdgpu: add a amdgpu_device_supports_baco helper
BACO - Bus Active, Chip Off

To check if a device supports BACO or not.  This will be
used in determining when to enable runtime pm.

Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-19 16:42:49 -05:00
Yintian Tao
d0d13fe874 drm/amdgpu: put flush_delayed_work at first
There is one regression from 042f3d7b745cd76aa
To put flush_delayed_work after adev->shutdown = true
which will make amdgpu_ih_process not response the irq
At last, all ib ring tests will be failed just like below

[drm] amdgpu: finishing device.
[drm] Fence fallback timer expired on ring gfx
[drm] Fence fallback timer expired on ring comp_1.0.0
[drm] Fence fallback timer expired on ring comp_1.1.0
[drm] Fence fallback timer expired on ring comp_1.2.0
[drm] Fence fallback timer expired on ring comp_1.3.0
[drm] Fence fallback timer expired on ring comp_1.0.1
amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.1.1 (-110).
amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.2.1 (-110).
amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.3.1 (-110).
amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on sdma0 (-110).
amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on sdma1 (-110).
amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on uvd_enc_0.0 (-110).
amdgpu 0000:00:07.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vce0 (-110).
[drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test failed (-110).

v2: replace cancel_delayed_work_sync() with flush_delayed_work()

Signed-off-by: Yintian Tao <yttao@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-19 10:12:51 -05:00
Leo Liu
52f2e779ad drm/amdgpu: add driver support for JPEG2.0 and above
By using JPEG IP block type

Signed-off-by: Leo Liu <leo.liu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-19 10:12:50 -05:00
yu kuai
b8b7213057 drm/amdgpu: add function parameter description in 'amdgpu_device_set_cg_state'
Fixes gcc warning:

drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:1954: warning: Function
parameter or member 'state' not described in 'amdgpu_device_set_cg_state'

Fixes: e3ecdffac9 ("drm/amdgpu: add documentation for amdgpu_device.c")
Signed-off-by: yu kuai <yukuai3@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-13 15:29:44 -05:00
Bhawanpreet Lakha
b86a1aa36a drm/amd/display: rename DCN1_0 kconfig to DCN
Since dcn20 and dcn21 are under dcn1 it doesnt make sense to
have it named dcn1.

Change it to "dcn" to make it generic

Signed-off-by: Bhawanpreet Lakha <Bhawanpreet.Lakha@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-13 15:29:44 -05:00
Bhawanpreet Lakha
aca935c7cc drm/amd/display: Drop CONFIG_DRM_AMD_DC_DCN2_1 flag
[Why]

DCN21 is stable enough to be build by default. So drop the flags.

[How]

Remove them using the unifdef tool. The following commands were executed
in sequence:

$ find -name '*.c' -exec unifdef -m -DCONFIG_DRM_AMD_DC_DCN2_1 -UCONFIG_TRIM_DRM_AMD_DC_DCN2_1 '{}' ';'
$ find -name '*.h' -exec unifdef -m -DCONFIG_DRM_AMD_DC_DCN2_1 -UCONFIG_TRIM_DRM_AMD_DC_DCN2_1 '{}' ';'

In addition:

* Remove from kconfig, and replace any dependencies with DCN1_0.
* Remove from any makefiles.
* Fix and cleanup Renoir definitions in dal_asic_id.h
* Expand DCN1 ifdef to include DCN21 code in the following files:
    * clk_mgr/clk_mgr.c: dc_clk_mgr_create()
    * core/dc_resources.c: dc_create_resource_pool()
    * gpio/hw_factory.c: dal_hw_factory_init()
    * gpio/hw_translate.c: dal_hw_translate_init()

Signed-off-by: Bhawanpreet Lakha <Bhawanpreet.Lakha@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-13 15:29:44 -05:00
Bhawanpreet Lakha
1da37801a8 drm/amd/display: Drop CONFIG_DRM_AMD_DC_DCN2_0 and DSC_SUPPORTED
[Why]

DCN2 and DSC are stable enough to be build by default. So drop the flags.

[How]

Remove them using the unifdef tool. The following commands were executed
in sequence:

$ find -name '*.c' -exec unifdef -m -DCONFIG_DRM_AMD_DC_DSC_SUPPORT -DCONFIG_DRM_AMD_DC_DCN2_0 -UCONFIG_TRIM_DRM_AMD_DC_DCN2_0 '{}' ';'
$ find -name '*.h' -exec unifdef -m -DCONFIG_DRM_AMD_DC_DSC_SUPPORT -DCONFIG_DRM_AMD_DC_DCN2_0 -UCONFIG_TRIM_DRM_AMD_DC_DCN2_0 '{}' ';'

In addition:

* Remove from kconfig, and replace any dependencies with DCN1_0.
* Remove from any makefiles.
* Fix and cleanup NV defninitions in dal_asic_id.h
* Expand DCN1 ifdef to include DCN2 code in the following files:
    * clk_mgr/clk_mgr.c: dc_clk_mgr_create()
    * core/dc_resources.c: dc_create_resource_pool()
    * dce/dce_dmcu.c: dcn20_*lock_phy()
    * dce/dce_dmcu.c: dcn20_funcs
    * dce/dce_dmcu.c: dcn20_dmcu_create()
    * gpio/hw_factory.c: dal_hw_factory_init()
    * gpio/hw_translate.c: dal_hw_translate_init()

Signed-off-by: Leo Li <sunpeng.li@amd.com>
Signed-off-by: Bhawanpreet Lakha <Bhawanpreet.Lakha@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-13 15:29:44 -05:00
Jesse Zhang
9f87516764 drm/amd/amdgpu: finish delay works before release resources
flush/cancel delayed works before doing finalization
to avoid concurrently requests.

Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-11 17:38:13 -05:00
Evan Quan
60599a0363 drm/amdgpu: perform p-state switch after the whole hive initialized
P-state switch should be performed after all devices from the hive
get initialized.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by:  Jonathan Kim <Jonathan.Kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-06 16:27:48 -05:00
Evan Quan
b0adca4d50 drm/amdgpu: register gpu instance before fan boost feature enablment
Otherwise, the feature enablement will be skipped due to wrong count.

Fixes: beff74bc6e ("drm/amdgpu: fix a race in GPU reset with IB test (v2)")
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-06 16:27:47 -05:00
Evan Quan
39ea6e5f9e drm/amdgpu: change pstate only after all XGMI device initialized
Pstate settings should be performed after all device of the
XGMI setup get initialized.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-06 16:27:46 -05:00
Le Ma
bff77e86a3 drm/amdgpu: bypass some cleanup work after err_event_athub (v2)
PSP lost connection when err_event_athub occurs. These cleanup work can be
skipped in BACO reset.

v2: squash in missing include (Alex)

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Hawking Zhang <hawking.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-30 11:06:52 -04:00
Wambui Karuga
f440ff44b1 drm/amd: correct "_LENTH" mispelling in constant
Correct the "_LENTH" mispelling in the AMDGPU_MAX_TIMEOUT_PARAM_LENGTH
constant.

Signed-off-by: Wambui Karuga <wambui.karugax@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-28 11:19:17 -04:00
Andrey Grodzovsky
121a2bc6ae drm/amdgpu: Move amdgpu_ras_recovery_init to after SMU ready.
For Arcturus the I2C traffic is done through SMU tables and so
we must postpone RAS recovery init to after they are ready
which is in amdgpu_device_ip_hw_init_phase2.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-25 16:50:10 -04:00
Tianci.Yin
e35e2b117f drm/amdgpu: add a generic fb accessing helper function(v3)
add a generic helper function for accessing framebuffer via MMIO

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Tianci.Yin <tianci.yin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-17 16:31:13 -04:00
Alex Deucher
897483d8a0 drm/amdgpu: move gpu reset out of amdgpu_device_suspend
Move it into the caller.  There are cases were we don't
want it.  We need it for hibernation, but we don't need
it for runtime pm, so drop it for runtime pm.

Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-15 15:51:39 -04:00
Alex Deucher
803cc26d5c drm/amdgpu: move pci_save_state into suspend path
for amdgpu_device_suspend.  This follows the logic
in the resume path.

Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-15 15:51:39 -04:00
Emily Deng
bcccee89f4 drm/amdgpu: Fix tdr3 could hang with slow compute issue
When index is 1, need to set compute ring timeout for sriov and passthrough.

Signed-off-by: Emily Deng <Emily.Deng@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-15 15:50:23 -04:00
Alex Deucher
71f98027f2 drm/amdgpu: move amdgpu_device_get_job_timeout_settings
It's only used in amdgpu_device.c and the naming also
reflects that.  Move it there.

Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-07 15:10:20 -05:00
Lyude Paul
f8d2d39eb4 drm/amdgpu: Iterate through DRM connectors correctly
Currently, every single piece of code in amdgpu that loops through
connectors does it incorrectly and doesn't use the proper list iteration
helpers, drm_connector_list_iter_begin() and
drm_connector_list_iter_end(). Yeesh.

So, do that.

Cc: Juston Li <juston.li@intel.com>
Cc: Imre Deak <imre.deak@intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Harry Wentland <hwentlan@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Lyude Paul <lyude@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-03 09:11:05 -05:00
Jesse Zhang
df99ac0fcc drm/amd/amdgpu:Fix compute ring unable to detect hang.
When compute fence did not signal, compute ring cannot detect hardware hang
because its timeout value is set to be infinite by default.

In SR-IOV and passthrough mode, if user does not declare custome timeout
value for compute ring, then use gfx ring timeout value as default. So
that when there is a ture hardware hang, compute ring can detect it.

Signed-off-by: Jesse Zhang <zhexi.zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-03 09:11:01 -05:00
Xiaojie Yuan
ec51d3facd drm/amdgpu/discovery: get gpu info from ip discovery table
except soc_bounding_box which is not integrated in discovery table yet

Signed-off-by: Xiaojie Yuan <xiaojie.yuan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-03 09:10:59 -05:00
Evan Quan
0e0b89c0d7 drm/amd/powerplay: properly set mp1 state for SW SMU suspend/reset routine
Set mp1 state properly for SW SMU suspend/reset routine.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16 10:10:58 -05:00
Shirish S
a35ad98bf9 drm/amdgpu: remove needless usage of #ifdef
define sched_policy in case CONFIG_HSA_AMD is not
enabled, with this there is no need to check for CONFIG_HSA_AMD
else where in driver code.

Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Shirish S <shirish.s@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16 10:05:20 -05:00
Shirish S
8c9f69bc5c drm/amdgpu: fix build error without CONFIG_HSA_AMD
If CONFIG_HSA_AMD is not set, build fails:

drivers/gpu/drm/amd/amdgpu/amdgpu_device.o: In function `amdgpu_device_ip_early_init':
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:1626: undefined reference to `sched_policy'

Use CONFIG_HSA_AMD to guard this.

Fixes: 1abb680ad371 ("drm/amdgpu: disable gfxoff while use no H/W scheduling policy")

Signed-off-by: Shirish S <shirish.s@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16 10:05:07 -05:00
Huang Rui
aa978594cf drm/amdgpu: disable gfxoff while use no H/W scheduling policy
While gfxoff is enabled, the mmVM_XXX registers will be 0xfffffff while the GFX
is in "off" state. KFD queue creattion doesn't use ring based method, so it will
trigger a VM fault.

Signed-off-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16 09:55:41 -05:00
Yong Zhao
4e66d7d215 drm/amdgpu: Add a kernel parameter for specifying the asic type
As more and more new asics start to reuse the old device IDs before
launch, there is a need to quickly override the existing asic type
corresponding to the reused device ID through a kernel parameter. With
this, engineers no longer need to rely on local hack patches,
facilitating cooperation across teams.

Signed-off-by: Yong Zhao <Yong.Zhao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-16 09:54:25 -05:00
Tao Zhou
1a6fc071e1 drm/amdgpu: move the call of ras recovery_init and bad page reserve to proper place
ras recovery_init should be called after ttm init,
bad page reserve should be put in front of gpu reset since i2c
may be unstable during gpu reset.
add cleanup for recovery_init and recovery_fini

v2: add more comment and print.
    remove cancel_work_sync in recovery_init.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13 17:50:47 -05:00
Yong Zhao
050091ab6e drm/amdkfd: Query kfd device info by CHIP id instead of pci device id
This optimizes out the pci device id usage in KFD and makes the code
more maintainable.

Signed-off-by: Yong Zhao <Yong.Zhao@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13 17:49:45 -05:00
Andrey Grodzovsky
d5ea093eeb dmr/amdgpu: Add system auto reboot to RAS.
In case of RAS error allow user configure auto system
reboot through ras_ctrl.
This is also part of the temproray work around for the RAS
hang problem.

v4: Use latest kernel API for disk sync.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13 17:41:17 -05:00
Andrey Grodzovsky
7c6e68c777 drm/amdgpu: Avoid HW GPU reset for RAS.
Problem:
Under certain conditions, when some IP bocks take a RAS error,
we can get into a situation where a GPU reset is not possible
due to issues in RAS in SMU/PSP.

Temporary fix until proper solution in PSP/SMU is ready:
When uncorrectable error happens the DF will unconditionally
broadcast error event packets to all its clients/slave upon
receiving fatal error event and freeze all its outbound queues,
err_event_athub interrupt  will be triggered.
In such case and we use this interrupt
to issue GPU reset. THe GPU reset code is modified for such case to avoid HW
reset, only stops schedulers, deatches all in progress and not yet scheduled
job's fences, set error code on them and signals.
Also reject any new incoming job submissions from user space.
All this is done to notify the applications of the problem.

v2:
Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev
Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c
Remove print param from amdgpu_ras_query_error_count

v3:
Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset
for other XGMI hive memebers.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13 17:41:05 -05:00
Andrey Grodzovsky
12ffa55da6 drm/amdgpu: Fix bugs in amdgpu_device_gpu_recover in XGMI case.
Issue 1:
In  XGMI case amdgpu_device_lock_adev for other devices in hive
was called to late, after access to their repsective schedulers.
So relocate the lock to the begining of accessing the other devs.

Issue 2:
Using amdgpu_device_ip_need_full_reset to switch the device list from
all devices in hive to the single 'master' device who owns this reset
call is wrong because when stopping schedulers we iterate all the devices
in hive but when restarting we will only reactivate the 'master' device.
Also, in case amdgpu_device_pre_asic_reset conlcudes that full reset IS
needed we then have to stop schedulers for all devices in hive and not
only the 'master' but with amdgpu_device_ip_need_full_reset  we
already missed the opprotunity do to so. So just remove this logic and
always stop and start all schedulers for all devices in hive.

Also minor cleanup and print fix.

v4: Minor coding style fix.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-09-13 17:39:05 -05:00
Andrey Grodzovsky
0b2d2c2eec drm/amdgpu: Handle job is NULL use case in amdgpu_device_gpu_recover
This should be checked at all places job is accessed.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-30 15:02:39 -05:00
Roman Li
e1c14c4339 drm/amdgpu: Enable DC on Renoir
Enable DC support for renoir.

Acked-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Roman Li <Roman.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-29 15:52:34 -05:00