Update context and full GPU reset to work with multi-lrc. The idea is
parent context tracks all the active requests inflight for itself and
its children. The parent context owns the reset replaying / canceling
requests as needed.
v2:
(John Harrison)
- Simply loop in find active request
- Add comments to find ative request / reset loop
v3:
(John Harrison)
- s/its'/its/g
- Fix comment when searching for active request
- Reorder if state in __guc_reset_context
v4:
(Kernel test robot)
- Delete unused is_multi_lrc function
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-15-matthew.brost@intel.com
Implement multi-lrc submission via a single workqueue entry and single
H2G. The workqueue entry contains an updated tail value for each
request, of all the contexts in the multi-lrc submission, and updates
these values simultaneously. As such, the tasklet and bypass path have
been updated to coalesce requests into a single submission.
v2:
(John Harrison)
- s/wqe/wqi
- Use FIELD_PREP macros
- Add GEM_BUG_ONs ensures length fits within field
- Add comment / white space to intel_guc_write_barrier
(Kernel test robot)
- Make need_tasklet a static function
v3:
(Docs)
- A comment for submission_stall_reason
v4:
(Kernel test robot)
- Initialize return value in bypass tasklt submit function
(John Harrison)
- Add comment near work queue defs
- Add BUILD_BUG_ON to ensure WQ_SIZE is a power of 2
- Update write_barrier comment to talk about work queue
v5:
(John Harrison)
- Fix typo in work queue comment
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-13-matthew.brost@intel.com
Assign contexts in parent-child relationship consecutive guc_ids. This
is accomplished by partitioning guc_id space between ones that need to
be consecutive (1/16 available guc_ids) and ones that do not (15/16 of
available guc_ids). The consecutive search is implemented via the bitmap
API.
This is a precursor to the full GuC multi-lrc implementation but aligns
to how GuC mutli-lrc interface is defined - guc_ids must be consecutive
when using the GuC multi-lrc interface.
v2:
(Daniel Vetter)
- Explicitly state why we assign consecutive guc_ids
v3:
(John Harrison)
- Bring back in spin lock
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-11-matthew.brost@intel.com
Add multi-lrc context registration H2G. In addition a workqueue and
process descriptor are setup during multi-lrc context registration as
these data structures are needed for multi-lrc submission.
v2:
(John Harrison)
- Move GuC specific fields into sub-struct
- Clean up WQ defines
- Add comment explaining math to derive WQ / PD address
v3:
(John Harrison)
- Add PARENT_SCRATCH_SIZE define
- Update comment explaining multi-lrc register
v4:
(John Harrison)
- Move PARENT_SCRATCH_SIZE to common file
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-9-matthew.brost@intel.com
Introduce context parent-child relationship. Once this relationship is
created all pinning / unpinning operations are directed to the parent
context. The parent context is responsible for pinning all of its
children and itself.
This is a precursor to the full GuC multi-lrc implementation but aligns
to how GuC mutli-lrc interface is defined - a single H2G is used
register / deregister all of the contexts simultaneously.
Subsequent patches in the series will implement the pinning / unpinning
operations for parent / child contexts.
v2:
(Daniel Vetter)
- Add kernel doc, add wrapper to access parent to ensure safety
v3:
(John Harrison)
- Fix comment explaing GEM_BUG_ON in to_parent()
- Make variable names generic (non-GuC specific)
v4:
(John Harrison)
- s/its'/its/g
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-8-matthew.brost@intel.com
Calling switch_to_kernel_context isn't needed if the engine PM reference
is taken while all user contexts are pinned as if don't have PM ref that
guarantees that all user contexts scheduling is disabled. By not calling
switch_to_kernel_context we save on issuing a request to the engine.
v2:
(Daniel Vetter)
- Add FIXME comment about pushing switch_to_kernel_context to backend
v3:
(John Harrison)
- Update commit message
- Fix workding comment
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-5-matthew.brost@intel.com
Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while any user context has scheduling enabled. Returning GT
idle when it is not can cause all sorts of issues throughout the stack.
v2:
(Daniel Vetter)
- Add might_lock annotations to pin / unpin function
v3:
(CI)
- Drop intel_engine_pm_might_put from unpin path as an async put is
used
v4:
(John Harrison)
- Make intel_engine_pm_might_get/put work with GuC virtual engines
- Update commit message
v5:
- Update commit message again
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-4-matthew.brost@intel.com
Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while a deregister context H2G is in flight. To do this must
issue the deregister H2G from a worker as context can be destroyed from
an atomic context and taking GT PM ref blows up. Previously we took a
runtime PM from this atomic context which worked but will stop working
once runtime pm autosuspend in enabled.
So this patch is two fold, stop intel_gt_wait_for_idle from short
circuting and fix runtime pm autosuspend.
v2:
(John Harrison)
- Split structure changes out in different patch
(Tvrtko)
- Don't drop lock in deregister_destroyed_contexts
v3:
(John Harrison)
- Flush destroyed contexts before destroying context reg pool
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-3-matthew.brost@intel.com
Commit 64f7c698be ("drm/nouveau/fifo: add engine_id hook") replaced
fifo/chang84.c g84_fifo_chan_engine() call with an indirect call of
fifo/g84.c g84_fifo_engine_id(). The G84_FIFO_ENGN_* values returned
from the later g84_fifo_engine_id() are incremented by 1 compared to
the previous g84_fifo_chan_engine() return values.
This is fine either way for most of the code, except this one line
where an engine bit programmed into the hardware is derived from the
return value. Decrement the return value accordingly, otherwise the
wrong engine bit is programmed into the hardware and that leads to
the following failure:
nouveau 0000:01:00.0: gr: 00000030 [ILLEGAL_MTHD ILLEGAL_CLASS] ch 1 [003fbce000 DRM] subc 3 class 0000 mthd 085c data 00000420
On the following hardware:
lspci -s 01:00.0
01:00.0 VGA compatible controller: NVIDIA Corporation GT216GLM [Quadro FX 880M] (rev a2)
lspci -ns 01:00.0
01:00.0 0300: 10de:0a3c (rev a2)
Fixes: 64f7c698be ("drm/nouveau/fifo: add engine_id hook")
Signed-off-by: Marek Vasut <marex@denx.de>
Cc: <stable@vger.kernel.org> # 5.12+
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Reviewed-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Ben Skeggs <bskeggs@redhat.com>
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211007214117.231472-1-marex@denx.de
Signed-off-by: Dave Airlie <airlied@redhat.com>
Hyper-V supports a hardware cursor feature. It is not used by Linux VM,
but the Hyper-V host still draws a point as an extra mouse pointer,
which is unwanted, especially when Xorg is running.
The hyperv_fb driver uses synthvid_send_ptr() to hide the unwanted pointer.
When the hyperv_drm driver was developed, the function synthvid_send_ptr()
was not copied from the hyperv_fb driver. Fix the issue by adding the
function into hyperv_drm.
Fixes: 76c56a5aff ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Deepak Rawat <drawat.floss@gmail.com>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20210916193644.45650-1-decui@microsoft.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
In commit e11f5bd822 ("drm: Add support for DP 1.4 Compliance edid
corruption test") the function connector_bad_edid() started assuming
that the memory for the EDID passed to it was big enough to hold
`edid[0x7e] + 1` blocks of data (1 extra for the base block). It
completely ignored the fact that the function was passed `num_blocks`
which indicated how much memory had been allocated for the EDID.
Let's fix this by adding a bounds check.
This is important for handling the case where there's an error in the
first block of the EDID. In that case we will call
connector_bad_edid() without having re-allocated memory based on
`edid[0x7e]`.
Fixes: e11f5bd822 ("drm: Add support for DP 1.4 Compliance edid corruption test")
Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211005192905.v2.1.Ib059f9c23c2611cb5a9d760e7d0a700c1295928d@changeid
Signed-off-by: Dave Airlie <airlied@redhat.com>
5.15-rc1 crashes with blank screen when booting up on two ThinkPads
using i915. Bisections converge convincingly, but arrive at different
and surprising "culprits", none of them the actual culprit.
netconsole (with init_netconsole() hacked to call i915_init() when
logging has started, instead of by module_init()) tells the story:
kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245!
with RSI: ffffffff814d408b pointing to sw_fence_dummy_notify().
I've been building with CONFIG_CC_OPTIMIZE_FOR_SIZE=y, and that
function needs to be 4-byte aligned.
v2:
(Jani Nikula)
- Change BUG_ON to WARN_ON
v3:
(Jani / Tvrtko)
- Short circuit __i915_sw_fence_init on WARN_ON
v4:
(Lucas)
- Break WARN_ON changes out in a different patch
Fixes: 62eaf0ae21 ("drm/i915/guc: Support request cancellation")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210922015039.26411-1-matthew.brost@intel.com
Missed a few asics.
v2: update comment
Fixes: 82d05736c4 ("drm/amdgpu/amdgpu_psp: convert to IP version checking")
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
VEGA20 is 11.0.2, but it's handled by powerplay, not
swsmu.
Fixes: a8967967f6 ("drm/amdgpu/amdgpu_smu: convert to IP version checking")
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
When creating unregistered new svm range to recover retry fault, avoid
new svm range to overlap with ranges or userptr ranges managed by TTM,
otherwise svm migration will trigger TTM or userptr eviction, to evict
user queues unexpectedly.
Change helper amdgpu_ttm_tt_affect_userptr to return userptr which is
inside the range. Add helper svm_range_check_vm_userptr to scan all
userptr of the vm, and return overlap userptr bo start, last.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Size can be any value and is user controlled resulting in overwriting the
40 byte array wr_buf with an arbitrary length of data from buf.
Signed-off-by: Thelford Williams <tdwilliamsiv@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Was missed in the conversion to IP version checking.
Fixes: af3b89d3a6 ("drm/amdgpu/smu11.0: convert to IP version checking")
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
migrate_vma_setup may return cpages 0, means 0 page can be migrated,
treat this as error case to skip the rest of vma migration steps.
Change svm_migrate_vma_to_vram and svm_migrate_vma_to_ram to return the
number of pages migrated successfully or error code. The caller add up
all the successful migration pages and update prange->actual_loc only if
the total migrated pages is not 0.
This also removes the warning message "VRAM BO missing during
validation" if migration cpages is 0.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
No function change, use pr_debug_ratelimited to avoid per page debug
message overflowing dmesg buf and console log.
use dev_err to show error message from unexpected situation, to provide
clue to help debug without enabling dynamic debug log. Define dev_fmt to
output function name in error message.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
when smu->adev->pm.ac_power == 0, message parameter with bit 16 set is saved
to smu->current_power_limit.
Fixes: 0cb4c62125 ("drm/amd/pm: correct power limit setting for SMU V11)"
Signed-off-by: Darren Powell <darren.powell@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
During mode2 reset, the GPU is temporarily removed from the
mgpu_info list. As a result, page retirement fails because it
cannot find the GPU in the GPU list.
To fix this, create our own list of GPUs that support MCE notifier
based page retirement and use that list to check if the UMC error
occurred on a GPU that supports MCE notifier based page retirement.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>