When debugging GPU hangs Mesa developers are finding it useful to replay
the captured error state against the simulator. But due various simulator
limitations which prevent replicating all hangs, one step further is being
able to replay against a real GPU.
This is almost doable today with the missing part being able to upload the
captured context image into the driver state prior to executing the
uploaded hanging batch and all the buffers.
To enable this last part we add a new context parameter called
I915_CONTEXT_PARAM_CONTEXT_IMAGE. It follows the existing SSEU
configuration pattern of being able to select which context to apply
against, paired with the actual image and its size.
Since this is adding a new concept of debug only uapi, we hide it behind
a new kconfig option and also require activation with a module parameter.
Together with a warning banner printed at driver load, all those combined
should be sufficient to guard against inadvertently enabling the feature.
In terms of implementation we allow the legacy context set param to be
used since that removes the need to record the per context data in the
proto context, while still allowing flexibility of specifying context
images for any context.
Mesa MR using the uapi can be seen at:
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27594
v2:
* Fix whitespace alignment as per checkpatch.
* Added warning on userspace misuse.
* Rebase for extracting ce->default_state shadowing.
v3:
* Rebase for I915_CONTEXT_PARAM_LOW_LATENCY.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: Carlos Santa <carlos.santa@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Tested-by: Carlos Santa <carlos.santa@intel.com>
Signed-off-by: Tvrtko Ursulin <tursulin@igalia.com>
Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
Link: https://patchwork.freedesktop.org/patch/msgid/20240514145939.87427-2-tursulin@igalia.com
GPU accumulates the context runtime in a 32 bit counter - CTX_TIMESTAMP
in the context image. This value is saved/restored on context switches.
KMD accumulates these values into a 64 bit counter taking care of any
overflows as needed. This count provides the basis for client specific
busyness in the fdinfo interface.
KMD accumulation happens just before the context is unpinned and when
context switches out. This works for execlist back-end since execlist
scheduling has visibility into context switches. With GuC mode, KMD does
not have visibility into context switches and this counter is
accumulated only when context is unpinned. Context is unpinned once the
context scheduling is successfully disabled. Disabling context
scheduling is an asynchronous operation. Also if a context is servicing
frequent requests, scheduling may never be disabled on it.
For GuC mode, since updates to the context runtime may be delayed, add
hooks to update the context runtime in a worker thread as well as when
a user queries for it.
Limitation:
- If a context is never switched out or runs for a long period of time,
the runtime value of CTX_TIMESTAMP may never be updated, so the
counter value may be unreliable. This patch does not support such
cases. Such support must be available from the GuC FW and it is WIP.
This patch is an extract from previous work authored by John/Umesh here -
https://patchwork.freedesktop.org/patch/496441/?series=105085&rev=4
v2: (Ashutosh)
- Drop COPS_RUNTIME_ACTIVE_TOTAL
- s/guc_context_update_clks/__guc_context_update_stats
- Pin context before accessing in guc_timestamp_ping
- In guc_context_unpin, use spinlock to serialize access to runtime stats
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Co-developed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230427224705.2785566-2-umesh.nerlige.ramappa@intel.com
When GuC support was added to error capture, the reference counting
around the request object was broken. Fix it up.
The context based search manages the spinlocking around the search
internally. So it needs to grab the reference count internally as
well. The execlist only request based search relies on external
locking, so it needs an external reference count but within the
spinlock not outside it.
The only other caller of the context based search is the code for
dumping engine state to debugfs. That code wasn't previously getting
an explicit reference at all as it does everything while holding the
execlist specific spinlock. So, that needs updaing as well as that
spinlock doesn't help when using GuC submission. Rather than trying to
conditionally get/put depending on submission model, just change it to
always do the get/put.
v2: Explicitly document adding an extra blank line in some dense code
(Andy Shevchenko). Fix multiple potential null pointer derefs in case
of no request found (some spotted by Tvrtko, but there was more!).
Also fix a leaked request in case of !started and another in
__guc_reset_context now that intel_context_find_active_request is
actually reference counting the returned request.
v3: Add a _get suffix to intel_context_find_active_request now that it
grabs a reference (Daniele).
v4: Split the intel_guc_find_hung_context change to a separate patch
and rename intel_context_find_active_request_get to
intel_context_get_active_request (Tvrtko).
v5: s/locking/reference counting/ in commit message (Tvrtko)
Fixes: dc0dad365c ("drm/i915/guc: Fix for error capture after full GPU reset with GuC")
Fixes: 573ba126ae ("drm/i915/guc: Capture error state on context reset")
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Andrzej Hajda <andrzej.hajda@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Cc: Michael Cheng <michael.cheng@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: Bruce Chang <yu.bruce.chang@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230127002842.3169194-3-John.C.Harrison@Intel.com
Patch which added graceful exit for non-persistent contexts missed the
fact it is not enough to set the exiting flag on a context and let the
backend handle it from there.
GuC backend cannot handle it because it runs independently in the
firmware and driver might not see the requests ever again. Patch also
missed the fact some usages of intel_context_is_banned in the GuC backend
needed replacing with newly introduced intel_context_is_schedulable.
Fix the first issue by calling into backend revoke when we know this is
the last chance to do it. Fix the second issue by replacing
intel_context_is_banned with intel_context_is_schedulable, which should
always be safe since latter is a superset of the former.
v2:
* Just call ce->ops->revoke unconditionally. (Andrzej)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Fixes: 45c64ecf97 ("drm/i915: Improve user experience and driver robustness under SIGINT or similar")
Cc: Andrzej Hajda <andrzej.hajda@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: <stable@vger.kernel.org> # v6.0+
Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com>
Acked-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221003121630.694249-1-tvrtko.ursulin@linux.intel.com
We have long standing customer complaints that pressing Ctrl-C (or to the
effect of) causes engine resets with otherwise well behaving programs.
Not only is logging engine resets during normal operation not desirable
since it creates support incidents, but more fundamentally we should avoid
going the engine reset path when we can since any engine reset introduces
a chance of harming an innocent context.
Reason for this undesirable behaviour is that the driver currently does
not distinguish between banned contexts and non-persistent contexts which
have been closed.
To fix this we add the distinction between the two reasons for revoking
contexts, which then allows the strict timeout only be applied to banned,
while innocent contexts (well behaving) can preempt cleanly and exit
without triggering the engine reset path.
Note that the added context exiting category applies both to closed non-
persistent context, and any exiting context when hangcheck has been
disabled by the user.
At the same time we rename the backend operation from 'ban' to 'revoke'
which more accurately describes the actual semantics. (There is no ban at
the backend level since banning is a concept driven by the scheduling
frontend. Backends are simply able to revoke a running context so that
is the more appropriate name chosen.)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220527072452.2225610-1-tvrtko.ursulin@linux.intel.com
Track context active (on hardware) status together with the start
timestamp.
This will be used to provide better granularity of context
runtime reporting in conjunction with already tracked pphwsp accumulated
runtime.
The latter is only updated on context save so does not give us visibility
to any currently executing work.
As part of the patch the existing runtime tracking data is moved under the
new ce->stats member and updated under the seqlock. This provides the
ability to atomically read out accumulated plus active runtime.
v2:
* Rename and make __intel_context_get_active_time unlocked.
v3:
* Use GRAPHICS_VER.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com> # v1
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220401142205.3123159-6-tvrtko.ursulin@linux.intel.com
The LRC descriptor pool is going away. So, stop using it as the limit
for how many context ids are available. Instead, size the pool
according to the number of contexts allowed. Note that this is just a
naming change, the actual limit is identical in value.
While at it, also update a kzalloc(sizeof()*count) to be a
kcalloc(count,size).
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220302003357.4188363-4-John.C.Harrison@Intel.com
A weak implementation of parallel submission (multi-bb execbuf IOCTL) for
execlists. Doing as little as possible to support this interface for
execlists - basically just passing submit fences between each request
generated and virtual engines are not allowed. This is on par with what
is there for the existing (hopefully soon deprecated) bonding interface.
We perma-pin these execlists contexts to align with GuC implementation.
v2:
(John Harrison)
- Drop siblings array as num_siblings must be 1
v3:
(John Harrison)
- Drop single submission
v4:
(John Harrison)
- Actually drop single submission
- Use IS_ERR check on return value from intel_context_create
- Set last request to NULL on unpin
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211222223532.28698-1-matthew.brost@intel.com
Rather than stealing bits from i915_sw_fence function pointer use
separate fields for function pointer and flags. If using two different
fields, the 4 byte alignment for the i915_sw_fence function pointer can
also be dropped.
v2:
(CI)
- Set new function field rather than flags in __i915_sw_fence_init
v3:
(Tvrtko)
- Remove BUG_ON(!fence->flags) in reinit as that will now blow up
- Only define fence->flags if CONFIG_DRM_I915_SW_FENCE_CHECK_DAG is
defined
v4:
- Rebase, resend for CI
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211116194929.10211-1-matthew.brost@intel.com
In intel_context_do_pin_ww, when calling into the pre_pin hook(which is
passed the ww context) it could in theory return -EDEADLK(which is very
likely with debug kernels), once we start adding more ww locking in there,
like in the next patch. If so then we need to be mindful of having to
restart the do_pin at this point.
If this is the kernel_context, or some other early in-kernel context
where we have yet to setup the default_state, then we always inhibit the
context restore, and instead rely on the delayed active_release to set
the CONTEXT_VALID_BIT for us(if we even care), which should indicate
that we have context switched away, and that our newly saved context
state should now be valid. However, since we currently grab the active
reference before the potential ww dance, we can end up setting the
CONTEXT_VALID_BIT much too early, if we need to backoff, and then upon
re-trying the do_pin, we could potentially cause the hardware to
incorrectly load some garbage context state when later context switching
to that context, but at the very least this will trigger the
GEM_BUG_ON() in __engine_unpark. For now let's just move any ww dance
stuff prior to arming the active reference.
For normal user contexts this shouldn't be a concern, since we should
already have the default_state ready when initialising the lrc state,
and so there should be no concern with active_release somehow
prematurely setting the CONTEXT_VALID_BIT.
v2(Thomas):
- Also re-order the onion unwind
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211117142024.1043017-1-matthew.auld@intel.com
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmF298ceHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGIJYH/1rsEFQQ6caeQdy1
z9eFIe48DNM4l7bFk+qEj2UAbzPdahVJ299Mg5fW0n2CDemOc9/n0b9TxQ37YObi
mOzu0xwJVupIxkyFMPQSSc2q8aLm67NSpJy08DsmaNses5hSvu8x15RPHLQTybjt
SwtKns+jpCq79P1GWbrB5e5UkLb0VNoxNp4L1U4pMrYGcEkJUXbaxNY2V/JcXdM7
Vtn+qN0T/J6V6QVftv0t8Ecj3bjEnmL3kZHaTaNg3dGeKRpCGyHc5lcBQ0cNFG6t
vjZ9VbuhBzGI3TN2tHH5hpA1UXo7HPBBCwQqxF1jeGLGHULikYwZ3TAPWqL3QZqC
9cxr9SY=
=p75d
-----END PGP SIGNATURE-----
BackMerge tag 'v5.15-rc7' into drm-next
The msm next tree is based on rc3, so let's just backmerge rc7 before pulling it in.
Signed-off-by: Dave Airlie <airlied@redhat.com>
For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
mid BB. To safely enable preemption at the BB boundary, a handshake
between parent and child is needed, syncing the set of BBs at the
beginning and end of each batch. This is implemented via custom
emit_bb_start & emit_fini_breadcrumb functions and enabled by default if
a context is configured by set parallel extension.
Lastly, this patch updates the process descriptor to the correct size as
the memory used in the handshake is directly after the process
descriptor.
v2:
(John Harrison)
- Fix a few comments wording
- Add struture for parent page layout
v3:
(John Harrison)
- A structure for sync semaphore
- Use offsetof to calc address
- Update commit message
v4:
(John Harrison)
- Fix typos in comment explaining memory map of scratch page
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-20-matthew.brost@intel.com
Update context and full GPU reset to work with multi-lrc. The idea is
parent context tracks all the active requests inflight for itself and
its children. The parent context owns the reset replaying / canceling
requests as needed.
v2:
(John Harrison)
- Simply loop in find active request
- Add comments to find ative request / reset loop
v3:
(John Harrison)
- s/its'/its/g
- Fix comment when searching for active request
- Reorder if state in __guc_reset_context
v4:
(Kernel test robot)
- Delete unused is_multi_lrc function
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-15-matthew.brost@intel.com
Introduce context parent-child relationship. Once this relationship is
created all pinning / unpinning operations are directed to the parent
context. The parent context is responsible for pinning all of its
children and itself.
This is a precursor to the full GuC multi-lrc implementation but aligns
to how GuC mutli-lrc interface is defined - a single H2G is used
register / deregister all of the contexts simultaneously.
Subsequent patches in the series will implement the pinning / unpinning
operations for parent / child contexts.
v2:
(Daniel Vetter)
- Add kernel doc, add wrapper to access parent to ensure safety
v3:
(John Harrison)
- Fix comment explaing GEM_BUG_ON in to_parent()
- Make variable names generic (non-GuC specific)
v4:
(John Harrison)
- s/its'/its/g
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-8-matthew.brost@intel.com
Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while any user context has scheduling enabled. Returning GT
idle when it is not can cause all sorts of issues throughout the stack.
v2:
(Daniel Vetter)
- Add might_lock annotations to pin / unpin function
v3:
(CI)
- Drop intel_engine_pm_might_put from unpin path as an async put is
used
v4:
(John Harrison)
- Make intel_engine_pm_might_get/put work with GuC virtual engines
- Update commit message
v5:
- Update commit message again
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-4-matthew.brost@intel.com
Taking a PM reference to prevent intel_gt_wait_for_idle from short
circuiting while a deregister context H2G is in flight. To do this must
issue the deregister H2G from a worker as context can be destroyed from
an atomic context and taking GT PM ref blows up. Previously we took a
runtime PM from this atomic context which worked but will stop working
once runtime pm autosuspend in enabled.
So this patch is two fold, stop intel_gt_wait_for_idle from short
circuting and fix runtime pm autosuspend.
v2:
(John Harrison)
- Split structure changes out in different patch
(Tvrtko)
- Don't drop lock in deregister_destroyed_contexts
v3:
(John Harrison)
- Flush destroyed contexts before destroying context reg pool
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211014172005.27155-3-matthew.brost@intel.com
5.15-rc1 crashes with blank screen when booting up on two ThinkPads
using i915. Bisections converge convincingly, but arrive at different
and surprising "culprits", none of them the actual culprit.
netconsole (with init_netconsole() hacked to call i915_init() when
logging has started, instead of by module_init()) tells the story:
kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245!
with RSI: ffffffff814d408b pointing to sw_fence_dummy_notify().
I've been building with CONFIG_CC_OPTIMIZE_FOR_SIZE=y, and that
function needs to be 4-byte aligned.
v2:
(Jani Nikula)
- Change BUG_ON to WARN_ON
v3:
(Jani / Tvrtko)
- Short circuit __i915_sw_fence_init on WARN_ON
v4:
(Lucas)
- Break WARN_ON changes out in a different patch
Fixes: 62eaf0ae21 ("drm/i915/guc: Support request cancellation")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210922015039.26411-1-matthew.brost@intel.com
Seems to fix some object-debug splat which appeared while debugging
something unrelated.
v2: s/guc_blocked/guc_state.blocked/
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Fixes: 62eaf0ae21 ("drm/i915/guc: Support request cancellation")
Link: https://patchwork.freedesktop.org/patch/msgid/20210924144646.4096402-1-matthew.auld@intel.com
(cherry picked from commit d576b31bde)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
5.15-rc1 crashes with blank screen when booting up on two ThinkPads
using i915. Bisections converge convincingly, but arrive at different
and suprising "culprits", none of them the actual culprit.
netconsole (with init_netconsole() hacked to call i915_init() when
logging has started, instead of by module_init()) tells the story:
kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245!
with RSI: ffffffff814d408b pointing to sw_fence_dummy_notify().
I've been building with CONFIG_CC_OPTIMIZE_FOR_SIZE=y, and that
function needs to be 4-byte aligned.
Fixes: 62eaf0ae21 ("drm/i915/guc: Support request cancellation")
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Seems to fix some object-debug splat which appeared while debugging
something unrelated.
v2: s/guc_blocked/guc_state.blocked/
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210924144646.4096402-1-matthew.auld@intel.com
Now that we have locking hierarchy of sched_engine->lock ->
ce->guc_state everything from guc_active can be moved into guc_state and
protected the guc_state.lock.
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-23-matthew.brost@intel.com
To make ownership of locking clear move fields (guc_id, guc_id_ref,
guc_id_link) to sub structure guc_id in intel_context.
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-22-matthew.brost@intel.com
Move guc_blocked fence to struct guc_state as the lock which protects
the fence lives there.
s/ce->guc_blocked/ce->guc_state.blocked/g
v2:
(Daniele)
- s/blocked_fence/blocked/g
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210909164744.31249-17-matthew.brost@intel.com
With the global kmem_cache shrink infrastructure gone there's nothing
special and we can convert them over.
I'm doing this split up into each patch because there's quite a bit of
noise with removing the static global.slab_ce to just a
slab_ce.
v2: Make slab static (Jason, 0day)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210727121037.2041102-4-daniel.vetter@ffwll.ch
This adds GuC backend support for i915_request_cancel(), which in turn
makes CONFIG_DRM_I915_REQUEST_TIMEOUT work.
This implementation makes use of fence while there are likely simplier
options. A fence was chosen because of another feature coming soon
which requires a user to block on a context until scheduling is
disabled. In that case we return the fence to the user and the user can
wait on that fence.
v2:
(Daniele)
- A comment about locking the blocked incr / decr
- A comments about the use of the fence
- Update commit message explaining why fence
- Delete redundant check blocked count in unblock function
- Ring buffer implementation
- Comment about blocked in submission path
- Shorter rpm path
v3:
(Checkpatch)
- Fix typos in commit message
(Daniel)
- Rework to simplier locking structure in guc_context_block / unblock
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210727002348.97202-26-matthew.brost@intel.com
We receive notification of an engine reset from GuC at its
completion. Meaning GuC has potentially cleared any HW state
we may have been interested in capturing. GuC resumes scheduling
on the engine post-reset, as the resets are meant to be transparent,
further muddling our error state.
There is ongoing work to define an API for a GuC debug state dump. The
suggestion for now is to manually disable FW initiated resets in cases
where debug state is needed.
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210727002348.97202-19-matthew.brost@intel.com
Move active request tracking to a backend vfunc rather than assuming all
backends want to do this in the manner. In the of case execlists /
ring submission the tracking is on the physical engine while with GuC
submission it is on the context.
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210727002348.97202-8-matthew.brost@intel.com
Add intel_context tracing. These trace points are particular helpful
when debugging the GuC firmware and can be enabled via
CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS kernel config option.
Cc: John Harrison <john.c.harrison@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210721215101.139794-19-matthew.brost@intel.com
Disable engine barriers for unpinning with GuC. This feature isn't
needed with the GuC as it disables context scheduling before unpinning
which guarantees the HW will not reference the context. Hence it is
not necessary to defer unpinning until a kernel context request
completes on each engine in the context engine mask.
Cc: John Harrison <john.c.harrison@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210721215101.139794-10-matthew.brost@intel.com
With GuC scheduling, it isn't safe to unpin a context while scheduling
is enabled for that context as the GuC may touch some of the pinned
state (e.g. LRC). To ensure scheduling isn't enabled when an unpin is
done, a call back is added to intel_context_unpin when pin count == 1
to disable scheduling for that context. When the response CTB is
received it is safe to do the final unpin.
Future patches may add a heuristic / delay to schedule the disable
call back to avoid thrashing on schedule enable / disable.
v2:
(John H)
- s/drm_dbg/drm_err
(Daneiel)
- Clean up sched state function
Cc: John Harrison <john.c.harrison@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210721215101.139794-9-matthew.brost@intel.com
Sometimes during context pinning a context with the same guc_id is
registered with the GuC. In this a case deregister must be done before
the context can be registered. A fence is inserted on all requests while
the deregister is in flight. Once the G2H is received indicating the
deregistration is complete the context is registered and the fence is
released.
v2:
(John H)
- Fix commit message
Cc: John Harrison <john.c.harrison@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <john.c.harrison@intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210721215101.139794-8-matthew.brost@intel.com
Implement GuC context operations which includes GuC specific operations
alloc, pin, unpin, and destroy.
v2:
(Daniel Vetter)
- Use msleep_interruptible rather than cond_resched in busy loop
(Michal)
- Remove C++ style comment
v3:
(Matthew Brost)
- Drop GUC_ID_START
(John Harrison)
- Fix a bunch of typos
- Use drm_err rather than drm_dbg for G2H errors
(Daniele)
- Fix ;; typo
- Clean up sched state functions
- Add lockdep for guc_id functions
- Don't call __release_guc_id when guc_id is invalid
- Use MISSING_CASE
- Add comment in guc_context_pin
- Use shorter path to rpm
(Daniele / CI)
- Don't call release_guc_id on an invalid guc_id in destroy
v4:
(Daniel Vetter)
- Add FIXME comment
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210721215101.139794-7-matthew.brost@intel.com
This essentially reverts
commit 84a1074920
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Wed Jan 24 11:36:08 2018 +0000
drm/i915: Shrink the GEM kmem_caches upon idling
mm/vmscan.c:do_shrink_slab() is a thing, if there's an issue with it
then we need to fix that there, not hand-roll our own slab shrinking
code in i915.
Also when this was added there was only one other caller of
kmem_cache_shrink (added 2005 to the acpi code). Now there's a 2nd one
outside of i915 code in a kunit test, which seems legit since that
wants to very carefully control what's in the kmem_cache. This out of
a total of over 500 calls to kmem_cache_create. This alone should have
been warning sign enough that we're doing something silly.
Noticed while reviewing a patch set from Jason to fix up some issues
in our i915_init() and i915_exit() module load/cleanup code. Now that
i915_globals.c isn't any different than normal init/exit functions, we
should convert them over to one unified table and remove
i915_globals.[hc] entirely.
v2: Improve commit message (Jason)
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Cc: David Airlie <airlied@linux.ie>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210721183229.4136488-1-daniel.vetter@ffwll.ch
Previously, we were storing the ring size in the ring pointer before it
was actually allocated. We would then guard setting the ring size on
checking for CONTEXT_ALLOC_BIT. This is error-prone at best and really
only saves us a few bytes on something that already burns at least 4K.
Instead, this patch adds a new ring_size field and makes everything use
that.
v2 (Daniel Vetter):
- Replace 512 * SZ_4K with SZ_2M
v2 (Jason Ekstrand):
- Rebase on top of page migration code
Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20210708154835.528166-3-jason@jlekstrand.net
We use some of the lower bits of the retire function pointer for
potential flags, which is quite thorny, since the caller needs to
remember to give the function the correct alignment with
__i915_active_call, otherwise we might incorrectly unpack the pointer
and jump to some garbage address later. Instead of all this let's just
pass the flags along as a separate parameter.
Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Suggested-by: Daniel Vetter <daniel@ffwll.ch>
References: ca419f407b ("drm/i915: Fix crash in auto_retire")
References: d8e44e4dd2 ("drm/i915/overlay: Fix active retire callback alignment")
References: fd5f262db1 ("drm/i915/selftests: Fix active retire callback alignment")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20210504164136.96456-1-matthew.auld@intel.com
Allow a brief period for continued access to a dead intel_context by
deferring the release of the struct until after an RCU grace period.
As we are using a dedicated slab cache for the contexts, we can defer
the release of the slab pages via RCU, with the caveat that individual
structs may be reused from the freelist within an RCU grace period. To
handle that, we have to avoid clearing members of the zombie struct.
This is required for a later patch to handle locking around virtual
requests in the signaler, as those requests may want to move between
engines and be destroyed while we are holding b->irq_lock on a physical
engine.
v2: Drop mutex_reinit(), if we never mark the mutex as destroyed we
don't need to reset the debug code, at the loss of having the mutex
debug code spot us attempting to destroy a locked mutex.
v3: As the intended use will remain strongly referenced counted, with
very little inflight access across reuse, drop the ctor.
v4: Drop the unrequired change to remove the temporary reference around
dropping the active context, and add back some more missing ctor
operations.
v5: The ctor is back. Tvrtko spotted that ce->signal_lock [introduced
later] maybe accessed under RCU and so needs special care not to be
reinitialised.
v6: Don't mix SLAB_TYPESAFE_BY_RCU and RCU list iteration.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20201126140407.31952-3-chris@chris-wilson.co.uk
(cherry picked from commit 14d1eaf088)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
In case backoff fails with an error, we return an undefined rq,
assign err to rq correctly.
Fixes: 8a929c9eb1 ("drm/i915: Use ww pinning for intel_context_create_request()")
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200918111208.1392128-1-maarten.lankhorst@linux.intel.com
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 4316b19dee)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
The hwsp_gtt object is used for sub-allocation and could therefore
be shared by many contexts causing unnecessary contention during
concurrent context pinning.
However since we're currently locking it only for pinning, it remains
resident until we unpin it, and therefore it's safe to drop the
lock early, allowing for concurrent thread access.
Signed-off-by: Thomas Hellström <thomas.hellstrom@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Some i915 selftests still use i915_vma_lock() as inner lock, and
intel_context_create_request() intel_timeline->mutex as outer lock.
Fortunately for selftests this is not an issue, they should be fixed
but we can move ahead and cleanify lockdep now.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200819140904.1708856-19-maarten.lankhorst@linux.intel.com
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
As a preparation step for full object locking and wait/wound handling
during pin and object mapping, ensure that we always pass the ww context
in i915_gem_execbuffer.c to i915_vma_pin, use lockdep to ensure this
happens.
This also requires changing the order of eb_parse slightly, to ensure
we pass ww at a point where we could still handle -EDEADLK safely.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200819140904.1708856-15-maarten.lankhorst@linux.intel.com
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Instead of doing everything inside of pin_mutex, we move all pinning
outside. Because i915_active has its own reference counting and
pinning is also having the same issues vs mutexes, we make sure
everything is pinned first, so the pinning in i915_active only needs
to bump refcounts. This allows us to take pin refcounts correctly
all the time.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200819140904.1708856-14-maarten.lankhorst@linux.intel.com
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
On eviction, we acquire the vm->mutex and then wait on the vma->active.
Therefore when binding and pinning the vma, we must follow the same
sequence, lock/pin the vma then mark it active. Otherwise, we mark the
vma as active, then wait for the vm->mutex, and meanwhile the evictor
holding the mutex waits upon us to complete our activity.
Fixes: 8ccfc20a7d ("drm/i915/gt: Mark ring->vma as active while pinned")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: <stable@vger.kernel.org> # v5.6+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200706170138.8993-1-chris@chris-wilson.co.uk
(cherry picked from commit 8567774e87)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
This assertion was removed in commit b412c63f1c ("drm/i915/gt: Report
context-is-closed prior to pinning"), but accidentally restored by a
cherry-pick into drm-next and now has percolated back to
drm-intel-next-queued.
Fixes: 2e46a2a0b0 ("drm/i915: Use explicit flag to mark unreachable intel_context")
Fixes: 2b703bbda2 ("Merge drm/drm-next into drm-intel-next-queued")
References: b412c63f1c ("drm/i915/gt: Report context-is-closed prior to pinning")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200520073048.2394034-1-chris@chris-wilson.co.uk
(cherry picked from commit f2c1061a36)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>