Commit Graph

9057 Commits

Author SHA1 Message Date
Andrey Grodzovsky
f89f8c6baf drm/amdgpu: Guard against write accesses after device removal
This should prevent writing to memory or IO ranges possibly
already allocated for other uses after our device is removed.

v5:
Protect more places wher memcopy_to/form_io takes place
Protect IB submissions

v6: Switch to !drm_dev_enter instead of scoping entire code
with brackets.

v7:
Drop guard of HW ring commands emission protection since they
are in GART and not in MMIO.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210512142648.666476-10-andrey.grodzovsky@amd.com
2021-05-19 23:50:28 -04:00
Andrey Grodzovsky
35bba8313b drm/amdgpu: Convert driver sysfs attributes to static attributes
This allows to remove explicit creation and destruction
of those attrs and by this avoids warnings on device
finalizing post physical device extraction.

v5: Use newly added pci_driver.dev_groups directly

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210512142648.666476-9-andrey.grodzovsky@amd.com
2021-05-19 23:50:27 -04:00
Andrey Grodzovsky
03f9016ed8 drm/amdgpu: Remap all page faults to per process dummy page.
On device removal reroute all CPU mappings to dummy page
per drm_file instance or imported GEM object.

v4:
Update for modified ttm_bo_vm_dummy_page

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210512142648.666476-7-andrey.grodzovsky@amd.com
2021-05-19 23:50:27 -04:00
Andrey Grodzovsky
d10d0daa20 drm/amdgpu: Handle IOMMU enabled case.
Problem:
Handle all DMA IOMMU group related dependencies before the
group is removed. Those manifest themself in that when IOMMU
enabled DMA map/unmap is dependent on the presence of IOMMU
group the device belongs to but, this group is released once
the device is removed from PCI topology.

Fix:
Expedite all such unmap operations to pci remove driver callback.

v5: Drop IOMMU notifier and switch to lockless call to ttm_tt_unpopulate
v6: Drop the BO unamp list
v7:
Drop amdgpu_gart_fini
In amdgpu_ih_ring_fini do uncinditional  check (!ih->ring)
to avoid freeing uniniitalized rings.
Call amdgpu_ih_ring_fini unconditionally.
v8: Add deatiled explanation

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210517143851.475058-1-andrey.grodzovsky@amd.com
2021-05-19 23:50:27 -04:00
Andrey Grodzovsky
e9669fb782 drm/amdgpu: Add early fini callback
Use it to call disply code dependent on device->drv_data
before it's set to NULL on device unplug

v5:
Move HW finilization into this callback to prevent MMIO accesses
post cpi remove.

v7:
Split kfd suspend from device exit to expdite HW related
stuff to amdgpu_pci_remove

v8:
Squash previous KFD commit into this commit to avoid compile break.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210520032057.497334-1-andrey.grodzovsky@amd.com
2021-05-19 23:48:50 -04:00
Andrey Grodzovsky
72c8c97b15 drm/amdgpu: Split amdgpu_device_fini into early and late
Some of the stuff in amdgpu_device_fini such as HW interrupts
disable and pending fences finilization must be done right away on
pci_remove while most of the stuff which relates to finilizing and
releasing driver data structures can be kept until
drm_driver.release hook is called, i.e. when the last device
reference is dropped.

v4: Change functions prefix early->hw and late->sw

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210512142648.666476-3-andrey.grodzovsky@amd.com
2021-05-19 23:45:49 -04:00
David M Nieto
5c439c38f5 drm/amdgpu: fix fence calculation (v2)
The proper metric for fence utilization over several
contexts is an harmonic mean, but such calculation is
prohibitive in kernel space, so the code approximates it.

Because the approximation diverges when one context has a
very small ratio compared with the other context, this change
filter out ratios smaller that 0.01%

v2: make the fence calculation static and initialize variables
within that function

v3: Fix warnings (Alex)

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: David M Nieto <david.nieto@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210513174539.27409-2-david.nieto@amd.com
2021-05-13 14:09:12 -04:00
David M Nieto
a7f0849682 drm/amdgpu: free resources on fence usage query
Free the resources if the fence needs to be ignored
during the ratio calculation

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: David M Nieto <david.nieto@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210513174539.27409-1-david.nieto@amd.com
2021-05-13 14:02:24 -04:00
Thomas Zimmermann
fd531024ba Merge drm/drm-next into drm-misc-next
Backmerging to get v5.12 fixes. Requested for vmwgfx.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2021-05-11 15:59:18 +02:00
Dave Airlie
0844708ac3 Merge tag 'amd-drm-fixes-5.13-2021-05-05' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-fixes-5.13-2021-05-05:

amdgpu:
- MPO hang workaround
- Fix for concurrent VM flushes on vega/navi
- dcefclk is not adjustable on navi1x and newer
- MST HPD debugfs fix
- Suspend/resumes fixes
- Register VGA clients late in case driver fails to load
- Fix GEM leak in user framebuffer create
- Add support for polaris12 with 32 bit memory interface
- Fix duplicate cursor issue when using overlay
- Fix corruption with tiled surfaces on VCN3
- Add BO size and stride check to fix BO size verification

radeon:
- Fix off-by-one in power state parsing
- Fix possible memory leak in power state parsing

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210506033929.3875-1-alexander.deucher@amd.com
2021-05-07 12:44:51 +10:00
Bas Nieuwenhuizen
234055fd97 drm/amdgpu: Use device specific BO size & stride check.
The builtin size check isn't really the right thing for AMD
modifiers due to a couple of reasons:

1) In the format structs we don't do set any of the tilesize / blocks
etc. to avoid having format arrays per modifier/GPU
2) The pitch on the main plane is pixel_pitch * bytes_per_pixel even
for tiled ...
3) The pitch for the DCC planes is really the pixel pitch of the main
surface that would be covered by it ...

Note that we only handle GFX9+ case but we do this after converting
the implicit modifier to an explicit modifier, so on GFX9+ all
framebuffers should be checked here.

There is a TODO about DCC alignment, but it isn't worse than before
and I'd need to dig a bunch into the specifics. Getting this out in
a reasonable timeframe to make sure it gets the appropriate testing
seemed more important.

Finally as I've found that debugging addfb2 failures is a pita I was
generous adding explicit error messages to every failure case.

Fixes: f258907fdd ("drm/amdgpu: Verify bo size can fit framebuffer size on init.")
Tested-by: Simon Ser <contact@emersion.fr>
Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-05 23:09:54 -04:00
Bas Nieuwenhuizen
8bf073ca92 drm/amdgpu: Init GFX10_ADDR_CONFIG for VCN v3 in DPG mode.
Otherwise tiling modes that require the values form this field
(In particular _*_X) would be corrupted upon video decode.

Copied from the VCN v2 code.

Fixes: 99541f392b ("drm/amdgpu: add mc resume DPG mode for VCN3.0")
Reviewed-and-Tested by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2021-05-05 23:09:12 -04:00
Roy Sun
8744425411 drm/amdgpu: Add show_fdinfo() interface
Tracking devices, process info and fence info using
/proc/pid/fdinfo

Signed-off-by: David M Nieto <David.Nieto@amd.com>
Signed-off-by: Roy Sun <Roy.Sun@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210426062701.39732-2-Roy.Sun@amd.com
2021-05-05 09:26:53 +02:00
Evan Quan
c83c4e1912 drm/amdgpu: add new MC firmware for Polaris12 32bit ASIC
Polaris12 32bit ASIC needs a special MC firmware.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2021-05-04 17:35:25 -04:00
Christian König
d79025c7f5 drm/ttm: always initialize the full ttm_resource v2
Init all fields in ttm_resource_alloc() when we create a new resource.

v2: use place->mem_type instead of res->mem_type

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210430092508.60710-2-christian.koenig@amd.com
2021-05-03 12:50:41 +02:00
Simon Ser
e0c16eb4b3 amdgpu: fix GEM obj leak in amdgpu_display_user_framebuffer_create
This error code-path is missing a drm_gem_object_put call. Other
error code-paths are fine.

Signed-off-by: Simon Ser <contact@emersion.fr>
Fixes: 1769152ac6 ("drm/amdgpu: Fail fb creation from imported dma-bufs. (v2)")
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Harry Wentland <hwentlan@amd.com>
Cc: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2021-04-29 00:04:27 -04:00
Kai-Heng Feng
8c3dd61cfa drm/amdgpu: Register VGA clients after init can no longer fail
When an amdgpu device fails to init, it makes another VGA device cause
kernel splat:
kernel: amdgpu 0000:08:00.0: amdgpu: amdgpu_device_ip_init failed
kernel: amdgpu 0000:08:00.0: amdgpu: Fatal error during GPU init
kernel: amdgpu: probe of 0000:08:00.0 failed with error -110
...
kernel: amdgpu 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
kernel: BUG: kernel NULL pointer dereference, address: 0000000000000018
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD 0 P4D 0
kernel: Oops: 0000 [#1] SMP NOPTI
kernel: CPU: 6 PID: 1080 Comm: Xorg Tainted: G        W         5.12.0-rc8+ #12
kernel: Hardware name: HP HP EliteDesk 805 G6/872B, BIOS S09 Ver. 02.02.00 12/30/2020
kernel: RIP: 0010:amdgpu_device_vga_set_decode+0x13/0x30 [amdgpu]
kernel: Code: 06 31 c0 c3 b8 ea ff ff ff 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 48 8b 87 90 06 00 00 48 89 e5 53 89 f3 <48> 8b 40 18 40 0f b6 f6 e8 40 58 39 fd 80 fb 01 5b 5d 19 c0 83 e0
kernel: RSP: 0018:ffffae3c0246bd68 EFLAGS: 00010002
kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
kernel: RDX: ffff8dd1af5a8560 RSI: 0000000000000000 RDI: ffff8dce8c160000
kernel: RBP: ffffae3c0246bd70 R08: ffff8dd1af5985c0 R09: ffffae3c0246ba38
kernel: R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000246
kernel: R13: 0000000000000000 R14: 0000000000000003 R15: ffff8dce81490000
kernel: FS:  00007f9303d8fa40(0000) GS:ffff8dd1af580000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000018 CR3: 0000000103cfa000 CR4: 0000000000350ee0
kernel: Call Trace:
kernel:  vga_arbiter_notify_clients.part.0+0x4a/0x80
kernel:  vga_get+0x17f/0x1c0
kernel:  vga_arb_write+0x121/0x6a0
kernel:  ? apparmor_file_permission+0x1c/0x20
kernel:  ? security_file_permission+0x30/0x180
kernel:  vfs_write+0xca/0x280
kernel:  ksys_write+0x67/0xe0
kernel:  __x64_sys_write+0x1a/0x20
kernel:  do_syscall_64+0x38/0x90
kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
kernel: RIP: 0033:0x7f93041e02f7
kernel: Code: 75 05 48 83 c4 58 c3 e8 f7 33 ff ff 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
kernel: RSP: 002b:00007fff60e49b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
kernel: RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007f93041e02f7
kernel: RDX: 000000000000000b RSI: 00007fff60e49b40 RDI: 000000000000000f
kernel: RBP: 00007fff60e49b40 R08: 00000000ffffffff R09: 00007fff60e499d0
kernel: R10: 00007f93049350b5 R11: 0000000000000246 R12: 000056111d45e808
kernel: R13: 0000000000000000 R14: 000056111d45e7f8 R15: 000056111d46c980
kernel: Modules linked in: nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_seq input_leds snd_seq_device snd_timer snd soundcore joydev kvm_amd serio_raw k10temp mac_hid hp_wmi ccp kvm sparse_keymap wmi_bmof ucsi_acpi efi_pstore typec_ucsi rapl typec video wmi sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear dm_mirror dm_region_hash dm_log hid_generic usbhid hid amdgpu drm_ttm_helper ttm iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect crct10dif_pclmul sysimgblt crc32_pclmul fb_sys_fops ghash_clmulni_intel cec rc_core aesni_intel crypto_simd psmouse cryptd r8169 i2c_piix4 drm ahci xhci_pci realtek libahci xhci_pci_renesas gpio_amdpt gpio_generic
kernel: CR2: 0000000000000018
kernel: ---[ end trace 76d04313d4214c51 ]---

Commit 4192f7b576 ("drm/amdgpu: unmap register bar on device init
failure") makes amdgpu_driver_unload_kms() skips amdgpu_device_fini(),
so the VGA clients remain registered. So when
vga_arbiter_notify_clients() iterates over registered clients, it causes
NULL pointer dereference.

Since there's no reason to register VGA clients that early, so solve
the issue by putting them after all the goto cleanups.

v2:
 - Remove redundant vga_switcheroo cleanup in failed: label.

Fixes: 4192f7b576 ("drm/amdgpu: unmap register bar on device init failure")
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-29 00:03:21 -04:00
Pavan Kumar Ramayanam
b45aeb2dea drm/amdgpu: Handling of amdgpu_device_resume return value for graceful teardown
The runtime resume PM op disregards the return value from
amdgpu_device_resume(), masking errors for failed resumes at the PM
layer.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Pavan Kumar Ramayanam <pavan.ramayanam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-29 00:03:11 -04:00
Victor Zhao
4b12ee6f42 drm/amdgpu: fix r initial values
Sriov gets suspend of IP block <dce_virtual> failed as return
value was not initialized.

v2: return 0 directly to align original code semantic before this
was broken out into a separate helper function instead of setting
initial values

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2021-04-29 00:02:40 -04:00
Christian König
20a5f5a98e drm/amdgpu: fix concurrent VM flushes on Vega/Navi v2
Starting with Vega the hardware supports concurrent flushes
of VMID which can be used to implement per process VMID
allocation.

But concurrent flushes are mutual exclusive with back to
back VMID allocations, fix this to avoid a VMID used in
two ways at the same time.

v2: don't set ring to NULL

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Tested-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2021-04-28 23:45:54 -04:00
Lyude Paul
0c4fada608 drm/dp: Pass drm_dp_aux to drm_dp*_link_train_channel_eq_delay()
So that we can start using drm_dbg_*() for
drm_dp_link_train_channel_eq_delay() and
drm_dp_lttpr_link_train_channel_eq_delay().

Signed-off-by: Lyude Paul <lyude@redhat.com>
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210423184309.207645-7-lyude@redhat.com
Reviewed-by: Dave Airlie <airlied@redhat.com>
2021-04-27 18:43:42 -04:00
Lyude Paul
9e98666644 drm/dp: Pass drm_dp_aux to drm_dp_link_train_clock_recovery_delay()
So that we can start using drm_dbg_*() in
drm_dp_link_train_clock_recovery_delay().

Signed-off-by: Lyude Paul <lyude@redhat.com>
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210423184309.207645-6-lyude@redhat.com
Reviewed-by: Dave Airlie <airlied@redhat.com>
2021-04-27 18:43:42 -04:00
Lyude Paul
6cba3fe433 drm/dp: Add backpointer to drm_device in drm_dp_aux
This is something that we've wanted for a while now: the ability to
actually look up the respective drm_device for a given drm_dp_aux struct.
This will also allow us to transition over to using the drm_dbg_*() helpers
for debug message printing, as we'll finally have a drm_device to reference
for doing so.

Note that there is one limitation with this - because some DP AUX adapters
exist as platform devices which are initialized independently of their
respective DRM devices, one cannot rely on drm_dp_aux->drm_dev to always be
non-NULL until drm_dp_aux_register() has been called. We make sure to point
this out in the documentation for struct drm_dp_aux.

v3:
* Add WARN_ON_ONCE() to drm_dp_aux_register() if drm_dev isn't filled out

Signed-off-by: Lyude Paul <lyude@redhat.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210423184309.207645-4-lyude@redhat.com
Reviewed-by: Dave Airlie <airlied@redhat.com>
2021-04-27 18:43:42 -04:00
Maxime Ripard
355b602961
Merge drm/drm-next into drm-misc-next
Christian needs some patches from drm/next

Signed-off-by: Maxime Ripard <maxime@cerno.tech>
2021-04-26 14:03:09 +02:00
Christian König
c777dc9e79 drm/ttm: move the page_alignment into the BO v2
The alignment is a constant property and shouldn't change.

v2: move documentation as well as suggested by Matthew.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210413135248.1266-4-christian.koenig@amd.com
2021-04-23 16:23:02 +02:00
Alex Deucher
7845d80dda drm/amdgpu/gmc9: remove dummy read workaround for newer chips
Aldebaran has a hw fix so no longer requires the workaround.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:45:36 -04:00
Jinzhou Su
5c88e3b86a drm/amdgpu: Add mem sync flag for IB allocated by SA
The buffer of SA bo will be used by many cases. So it's better
to invalidate the cache of indirect buffer allocated by SA before
commit the IB.

Signed-off-by: Jinzhou Su <Jinzhou.Su@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:45:24 -04:00
Mukul Joshi
ceb47e0d84 drm/amdgpu: Fix SDMA RAS error reporting on Aldebaran
Fix the following issues with SDMA RAS error reporting:
1. Read the EDC_COUNTER2 register also to fetch error counts
   for all sub-blocks in SDMA.
2. SDMA RAS on Aldebaran suports single-bit uncorrectable errors
   only. So, report error count in UE count instead of CE count.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-By: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:45:17 -04:00
Mukul Joshi
1f0d8e3781 drm/amdgpu: Reset RAS error count and status regs
Reset the RAS error count and error status registers after
reading to prevent over reporting error counts on Aldebaran.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-By: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:45:10 -04:00
Oak Zeng
5f41741a6d Revert "drm/amdgpu: workaround the TMR MC address issue (v2)"
This reverts commit 2f055097da.
2f055097da was a driver workaround
when PSP firmware was not ready. Now the PSP fw is ready so we
revert this driver workaround.

Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:43:49 -04:00
Jiansong Chen
7c49ee9ec5 drm/amdgpu: fix GCR_GENERAL_CNTL offset for dimgrey_cavefish
dimgrey_cavefish has similar gc_10_3 ip with sienna_cichlid,
so follow its registers offset setting.

Signed-off-by: Jiansong Chen <Jiansong.Chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:36:28 -04:00
John Clements
f9727922fc drm/amdgpu: resolve erroneous gfx_v9_4_2 prints
resolve bug on aldebaran where gfx error counts will
print on driver load when there are no errors present

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:36:10 -04:00
Dennis Li
6df23f4c5c drm/amdgpu: fix a error injection failed issue
because "sscanf(str, "retire_page")" always return 0, if application use
the raw data for error injection, it always wrongly falls into "op ==
3". Change to use strstr instead.

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:36:01 -04:00
Hawking Zhang
1f8d3ad2a0 drm/amdgpu: only harvest gcea/mmea error status in aldebaran
In aldebaran, driver only needs to harvest SDP
RdRspStatus, WrRspStatus and first parity error
on RdRsp data. Check error type before harvest
error information.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:35:55 -04:00
Hawking Zhang
53ee6609b4 drm/amdgpu: only harvest gcea/mmea error status in arcturus
SDP RdRspStatus/WrRspStatus or first parity error on
RdRsp data can cause system fatal error in arcturus.
GPU will be freezed in such case.

Driver needs to harvest these error information before
reset the GPU. Check error type to avoid harvest normal
gcea/mmea information.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:35:45 -04:00
Huang Rui
9406d39bb6 drm/amdgpu: enable tmz on renoir asics
The tmz functions are verified on renoir chips as well. So enable it by
default.

Signed-off-by: Huang Rui <ray.huang@amd.com>
Tested-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:35:36 -04:00
Hawking Zhang
28a5d7a589 drm/amdgpu: correct default gfx wdt timeout setting
When gfx wdt was configured to fatal_disable, the
timeout period should be configured to 0x0 (timeout
disabled)

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:35:29 -04:00
Christian König
ce4528daf5 drm/amdgpu: check base size instead of mem.num_pages
Drop some ussage of mem in the code.

Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210413135248.1266-2-christian.koenig@amd.com
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
2021-04-19 15:05:23 +02:00
Christian König
e2ac853156 drm/amdgpu: freeing pinned objects is illegal now
We want to drop support in TTM for this.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210415084730.2057-2-christian.koenig@amd.com
2021-04-19 14:20:36 +02:00
Christian König
2f40801dc5 drm/amdgpu: make sure we unpin the UVD BO
Releasing pinned BOs is illegal now.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Leo Liu <leo.liu@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210415084730.2057-1-christian.koenig@amd.com
2021-04-19 14:15:50 +02:00
Dan Carpenter
90cb3d8aca drm/amdgpu: fix an error code in init_pmu_entry_by_type_and_add()
If the kmemdup() fails then this should return a negative error code
but it currently returns success

Fixes: b4a7db71ea ("drm/amdgpu: add per device user friendly xgmi events for vega20")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15 16:32:45 -04:00
Qingqing Zhuo
fe18017839 Revert "Revert "drm/amdgpu: Ensure that the modifier requested is supported by plane.""
This reverts commit 55fa622fe6.

The regression caused by the original patch has been
cleared, thus introduce back the change.

Signed-off-by: Qingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15 16:32:45 -04:00
Joseph Greathouse
47e5d79a45 drm/amdgpu: Copy MEC FW version to MEC2 if we skipped loading MEC2
If we skipped loading MEC2 firmware separately
from MEC, then MEC2 will be running the same
firmware image. Copy the MEC version and feature
numbers into MEC2 version and feature numbers.
This is needed for things like GWS support, where
we rely on knowing what version of firmware is
running on MEC2. Leaving these MEC2 entries blank
breaks our ability to version-check enables and
workarounds.

Signed-off-by: Joseph Greathouse <Joseph.Greathouse@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15 16:32:45 -04:00
Felix Kuehling
f45e6b9d03 drm/amdkfd: Remove legacy code not acquiring VMs
ROCm user mode has acquired VMs from DRM file descriptors for as long
as it supported the upstream KFD. Legacy code to support older versions
of ROCm is not needed any more.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15 16:32:44 -04:00
Ramesh Errabolu
ba5b662c36 drm/amdgpu: Use iterator methods exposed by amdgpu_res_cursor.h in building SG_TABLE's for a VRAM BO
Extend current implementation of SG_TABLE construction method to
allow exportation of sub-buffers of a VRAM BO. This capability will
enable logical partitioning of a VRAM BO into multiple non-overlapping
sub-buffers. One example of this use case is to partition a VRAM BO
into two sub-buffers, one for SRC and another for DST.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15 16:32:44 -04:00
Luben Tuikov
546aa546b0 drm/amdgpu: Add double-sscanf but invert
Add back the double-sscanf so that both decimal
and hexadecimal values could be read in, but this
time invert the scan so that hexadecimal format
with a leading 0x is tried first, and if that
fails, then try decimal format.

Also use a logical-AND instead of nesting double
if-conditional.

See commit "drm/amdgpu: Fix a bug for input with double sscanf"

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15 16:32:44 -04:00
Kenneth Feng
b960cb25b1 drm/amd/amdgpu: add ASPM support on polaris
add ASPM support on polaris

Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15 16:32:44 -04:00
Kenneth Feng
9d015c0dae drm/amd/amdgpu: enable ASPM on vega
enable ASPM on vega to save the power
without the performance hurt.

Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15 16:32:44 -04:00
Kenneth Feng
3273f8b9e6 drm/amd/amdgpu: enable ASPM on navi1x
enable ASPM on navi1x for the benifit of system power consumption
without performance hurt.

Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15 16:32:44 -04:00
Jack Zhang
d4abd00663 drm/amd/sriov no need to config GECC for sriov
No need to config GECC feature here for sriov
Leave the host drvier to do the configuration job.

Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15 16:32:44 -04:00