I need to keep the GEM context around a bit longer so adding an explicit
flag for syncing execbuf with closed/abandonded contexts.
v2:
* Use already available context flags. (Chris)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20200319170707.8262-1-chris@chris-wilson.co.uk
(cherry picked from commit 207e4a71fb)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
It is strictly sufficient to only delay the intel_engine_pm_put from the
context barrier (and not from the context exit) in order to prevent the
gem_exec_nop contention. Adding the delay to the context exit incurs
noticably extra penalty for soft-rc6.
Fixes: edee52c927 ("drm/i915/gt: Delay release of engine-pm after last retirement")
Testcase: igt/i915_pm_rc6_residency/rc6-idle
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200323192029.20723-1-chris@chris-wilson.co.uk
Keep the engine-pm awake until the next jiffie, to avoid immediate
ping-pong under moderate load. (Forcing the idle barrier excerbates the
moderate load, dramatically increasing the driver overhead.) On the
other hand, delaying the idle-barrier slightly incurs longer rc6-off
and so more power consumption.
Closes: https://gitlab.freedesktop.org/drm/intel/issues/848
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200323092841.22240-4-chris@chris-wilson.co.uk
A silly cut'n'paste copied the unlocked error path and used it inside
the pin_mutex lock, we need to drop that lock before returning.
Fixes: b412c63f1c ("drm/i915/gt: Report context-is-closed prior to pinning")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200322123241.17694-1-chris@chris-wilson.co.uk
I need to keep the GEM context around a bit longer so adding an explicit
flag for syncing execbuf with closed/abandonded contexts.
v2:
* Use already available context flags. (Chris)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20200319170707.8262-1-chris@chris-wilson.co.uk
GPU saves accumulated context runtime (in CS timestamp units) in PPHWSP
which will be useful for us in cases when we are not able to track context
busyness ourselves (like with GuC). Keep a copy of this in struct
intel_context from where it can be easily read even if the context is not
pinned.
v2:
(Chris)
* Do not store pphwsp address in intel_context.
* Log CS wrap-around.
* Simplify calculation by relying on integer wraparound.
v3:
* Include total/avg in traces and error state for debugging
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20200216133620.394962-1-chris@chris-wilson.co.uk
As we use the active state to keep the vma alive while we are reading
its contents during GPU error capture, we need to mark the
ring->vma as active during execution if we want to include the rinbuffer
in the error state.
Reported-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: b1e3177bd1 ("drm/i915: Coordinate i915_active with its own mutex")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200110110402.1231745-3-chris@chris-wilson.co.uk
As we use the active state to keep the vma alive while we are reading
its contents during GPU error capture, we need to mark the
context->state vma as active during execution if we want to include it
in the error state.
Reported-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: b1e3177bd1 ("drm/i915: Coordinate i915_active with its own mutex")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200110110402.1231745-2-chris@chris-wilson.co.uk
Now that we have moved the runtime-pm management out of
intel_context_acctive_acquire, and that itself out of ce->ops->pin(), no
explicit runtime pm wakeref is required in intel_context_pin().
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200109085717.873326-3-chris@chris-wilson.co.uk
While this is encroaching on midlayer territory, having already made the
state allocation a previous step in pinning, we can now pull the common
intel_context_active_acquire() into intel_context_pin() itself. This is
a prelude to make the activation a separate step inside pinning, outside
of the ce->pin_mutex
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200109085717.873326-2-chris@chris-wilson.co.uk
Allow for knowledgeable users to preallocate the context state, and to
separate the allocation step from the pinning step during
intel_context_pin()
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200109085717.873326-1-chris@chris-wilson.co.uk
The last remaining reason for serialising the pin/unpin of the
intel_context is to ensure that our preallocated wakerefs are not
consumed too early (i.e. the unpin of the previous phase does not emit
the idle barriers for this phase before we even submit). All of the
other operations within the context pin/unpin are supposed to be
atomic... Therefore, we can reduce the serialisation to being just on
the i915_active.preallocated_barriers itself and drop the nested
pin_mutex from intel_context_unpin().
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200106114234.2529613-5-chris@chris-wilson.co.uk
Convert the few remaining GEM_TRACE() used for debugging over to the
appropriate GT_TRACE or RQ_TRACE.
References: 639f2f2489 ("drm/i915: Introduce new macros for tracing")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Venkata Sandeep Dhanalakota <venkata.s.dhanalakota@intel.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200106114234.2529613-4-chris@chris-wilson.co.uk
Allocate only an internal intel_context for the kernel_context, forgoing
a global GEM context for internal use as we only require a separate
address space (for our own protection).
Now having weaned GT from requiring ce->gem_context, we can stop
referencing it entirely. This also means we no longer have to create random
and unnecessary GEM contexts for internal use.
GEM contexts are now entirely for tracking GEM clients, and intel_context
the execution environment on the GPU.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andi Shyti <andi.shyti@intel.com>
Acked-by: Andi Shyti <andi.shyti@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191221160324.1073045-1-chris@chris-wilson.co.uk
Instead of rummaging through the intel_context to peek at the GEM
context in the middle of request submission to decide whether to use
semaphores, store that information on the intel_context itself.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andi Shyti <andi.shyti@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191220101230.256839-2-chris@chris-wilson.co.uk
Keep the intel_context as being the primary state for i915_request, with
the GEM context a backpointer from the low level state for the rarer
cases we need client information. Our goal is to remove such references
to clients from the backend, and leave the HW submission agnostic to
client interfaces and self-contained.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andi Shyti <andi.shyti@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191220101230.256839-1-chris@chris-wilson.co.uk
New macros ENGINE_TRACE(), CE_TRACE(), RQ_TRACE() and
GT_TRACE() are introduce to tag device name and engine
name with contexts and requests tracing in i915.
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Venkata Sandeep Dhanalakota <venkata.s.dhanalakota@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20191213155152.69182-2-venkata.s.dhanalakota@intel.com
Rather than assume if and only if the engine->default_state is not set
that the context is invalid, instead track when we know the context has
valid state -- either because we have copied the default_state or we
have completed a context switch to save the HW state.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191203124155.3019926-1-chris@chris-wilson.co.uk
The expected downside to commit 58b4c1a07a ("drm/i915: Reduce nested
prepare_remote_context() to a trylock") was that it would need to return
-EAGAIN to userspace in order to resolve potential mutex inversion. Such
an unsightly round trip is unnecessary if we could atomically insert a
barrier into the i915_active_fence, so make it happen.
Currently, we use the timeline->mutex (or some other named outer lock)
to order insertion into the i915_active_fence (and so individual nodes
of i915_active). Inside __i915_active_fence_set, we only need then
serialise with the interrupt handler in order to claim the timeline for
ourselves.
However, if we remove the outer lock, we need to ensure the order is
intact between not only multiple threads trying to insert themselves
into the timeline, but also with the interrupt handler completing the
previous occupant. We use xchg() on insert so that we have an ordered
sequence of insertions (and each caller knows the previous fence on
which to wait, preserving the chain of all fences in the timeline), but
we then have to cmpxchg() in the interrupt handler to avoid overwriting
the new occupant. The only nasty side-effect is having to temporarily
strip off the RCU-annotations to apply the atomic operations, otherwise
the rules are much more conventional!
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=112402
Fixes: 58b4c1a07a ("drm/i915: Reduce nested prepare_remote_context() to a trylock")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191127134527.3438410-1-chris@chris-wilson.co.uk
Avoid going to the base i915 device when we already have a path from gt
to the runtime powermanagement interface. The benefit is that it looks a
bit more self-consistent to always be acquiring the gt->uncore->rpm for
use with the gt->uncore.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191007154531.1750-1-chris@chris-wilson.co.uk
Keep track of the GEM contexts underneath i915->gem.contexts and assign
them their own lock for the purposes of list management.
v2: Focus on lock tracking; ctx->vm is protected by ctx->mutex
v3: Correct split with removal of logical HW ID
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191004134015.13204-15-chris@chris-wilson.co.uk
Forgo the struct_mutex serialisation for i915_active, and interpose its
own mutex handling for active/retire.
This is a multi-layered sleight-of-hand. First, we had to ensure that no
active/retire callbacks accidentally inverted the mutex ordering rules,
nor assumed that they were themselves serialised by struct_mutex. More
challenging though, is the rule over updating elements of the active
rbtree. Instead of the whole i915_active now being serialised by
struct_mutex, allocations/rotations of the tree are serialised by the
i915_active.mutex and individual nodes are serialised by the caller
using the i915_timeline.mutex (we need to use nested spinlocks to
interact with the dma_fence callback lists).
The pain point here is that instead of a single mutex around execbuf, we
now have to take a mutex for active tracker (one for each vma, context,
etc) and a couple of spinlocks for each fence update. The improvement in
fine grained locking allowing for multiple concurrent clients
(eventually!) should be worth it in typical loads.
v2: Add some comments that barely elucidate anything :(
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191004134015.13204-6-chris@chris-wilson.co.uk
As we need to use a mutex to serialise i915_active activation
(because we want to allow the callback to sleep), we need to push the
i915_active.retire into a worker callback in case we get need to retire
from an atomic context.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191004134015.13204-5-chris@chris-wilson.co.uk
The request->timeline is only valid until the request is retired (i.e.
before it is completed). Upon retiring the request, the context may be
unpinned and freed, and along with it the timeline may be freed. We
therefore need to be very careful when chasing rq->timeline that the
pointer does not disappear beneath us. The vast majority of users are in
a protected context, either during request construction or retirement,
where the timeline->mutex is held and the timeline cannot disappear. It
is those few off the beaten path (where we access a second timeline) that
need extra scrutiny -- to be added in the next patch after first adding
the warnings about dangerous access.
One complication, where we cannot use the timeline->mutex itself, is
during request submission onto hardware (under spinlocks). Here, we want
to check on the timeline to finalize the breadcrumb, and so we need to
impose a second rule to ensure that the request->timeline is indeed
valid. As we are submitting the request, it's context and timeline must
be pinned, as it will be used by the hardware. Since it is pinned, we
know the request->timeline must still be valid, and we cannot submit the
idle barrier until after we release the engine->active.lock, ergo while
submitting and holding that spinlock, a second thread cannot release the
timeline.
v2: Don't be lazy inside selftests; hold the timeline->mutex for as long
as we need it, and tidy up acquiring the timeline with a bit of
refactoring (i915_active_add_request)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190919111912.21631-1-chris@chris-wilson.co.uk
As every i915_active_request should be serialised by a dedicated lock,
i915_active consists of a tree of locks; one for each node. Markup up
the i915_active_request with what lock is supposed to be guarding it so
that we can verify that the serialised updated are indeed serialised.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190816121000.8507-2-chris@chris-wilson.co.uk
Move the timeline from being inside the intel_ring to intel_context
itself. This saves much pointer dancing and makes the relations of the
context to its timeline much clearer.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190809182518.20486-4-chris@chris-wilson.co.uk
Refactor the backends to handle the deferred context allocation in a
consistent manner, and allow calling it as an explicit first step in
pinning a context for the first time. This should make it easier for
backends to keep track of partially constructed contexts from
initialisation.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190809182518.20486-2-chris@chris-wilson.co.uk
The shrinker cannot touch objects used by the contexts (logical state
and ring). Currently we mark those as "pin_global" to let the shrinker
skip over them, however, if we remove them from the shrinker lists
entirely, we don't event have to include them in our shrink accounting.
By keeping the unshrinkable objects in our shrinker tracking, we report
a large number of objects available to be shrunk, and leave the shrinker
deeply unsatisfied when we fail to reclaim those. The shrinker will
persist in trying to reclaim the unavailable objects, forcing the system
into a livelock (not even hitting the dread oomkiller).
v2: Extend unshrinkable protection for perma-pinned scratch and guc
allocations (Tvrtko)
v3: Notice that we should be pinned when marking unshrinkable and so the
link cannot be empty; merge duplicate paths.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190802212137.22207-1-chris@chris-wilson.co.uk
By placing our idle-barriers in the i915_active fence tree, we expose
those for reuse by other components that are issuing requests along the
kernel_context. Reusing the proto-barrier active_node is perfectly fine
as the new request implies a context-switch, and so an opportune point
to run the idle-barrier. However, the proto-barrier is not equivalent
to a normal active_node and care must be taken to avoid dereferencing the
ERR_PTR used as its request marker.
v2: Comment the more egregious cheek
v3: A glossary!
Reported-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: ce476c80b8 ("drm/i915: Keep contexts pinned until after the next kernel context switch")
Fixes: a9877da2d6 ("drm/i915/oa: Reconfigure contexts on the fly")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190802100015.1281-1-chris@chris-wilson.co.uk
Track the currently bound address space used by the HW context. Minor
conversions to use the local intel_context.vm are made, leaving behind
some more surgery required to make intel_context the primary through the
selftests.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190730143209.4549-2-chris@chris-wilson.co.uk
Modifying a remote context requires careful serialisation with requests
on that context, and that serialisation requires us to take their
timeline->mutex. Make it so.
Note that while struct_mutex rules, we can't create more than one
request in parallel, but that age is soon coming to an end.
v2: Though it doesn't affect the current users, contexts may share
timelines so check if we already hold the right mutex.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190725131447.27515-1-chris@chris-wilson.co.uk
Prior to freeing the struct, call the fini function to cleanup the
common members. Currently this only calls the debug functions to mark
the structs as destroyed, but may be extended to real work in future.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190718070024.21781-2-chris@chris-wilson.co.uk
Avoid a global idle barrier by reconfiguring each context by rewriting
them with MI_STORE_DWORD from the kernel context.
v2: We only need to determine the desired register values once, they are
the same for all contexts.
v3: Don't remove the kernel context from the list of known GEM contexts;
the world is not ready for that yet.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190716213443.9874-1-chris@chris-wilson.co.uk
If we introduce a callback for i915_active that is only called the first
time we use the i915_active and is symmetrically paired with the
i915_active.retire callback, we can replace the open-coded and
non-atomic implementations -- which will be very fragile (i.e. broken)
upon removing the struct_mutex serialisation.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190621183801.23252-4-chris@chris-wilson.co.uk
Remember to keep the rings pinned as well as the context image until the
GPU is no longer active.
v2: Introduce a ring->pin_count primarily to hide the
mock_ring that doesn't fit into the normal GGTT vma picture.
v3: Order is important in teardown, ringbuffer submission needs to drop
the pin count on the engine->kernel_context before it can gleefully free
its ring.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110946
Fixes: ce476c80b8 ("drm/i915: Keep contexts pinned until after the next kernel context switch")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190619170135.15281-1-chris@chris-wilson.co.uk
The idea behind keeping the saturation mask local to a context backfired
spectacularly. The premise with the local mask was that we would be more
proactive in attempting to use semaphores after each time the context
idled, and that all new contexts would attempt to use semaphores
ignoring the current state of the system. This turns out to be horribly
optimistic. If the system state is still oversaturated and the existing
workloads have all stopped using semaphores, the new workloads would
attempt to use semaphores and be deprioritised behind real work. The
new contexts would not switch off using semaphores until their initial
batch of low priority work had completed. Given sufficient backload load
of equal user priority, this would completely starve the new work of any
GPU time.
To compensate, remove the local tracking in favour of keeping it as
global state on the engine -- once the system is saturated and
semaphores are disabled, everyone stops attempting to use semaphores
until the system is idle again. One of the reason for preferring local
context tracking was that it worked with virtual engines, so for
switching to global state we could either do a complete check of all the
virtual siblings or simply disable semaphores for those requests. This
takes the simpler approach of disabling semaphores on virtual engines.
The downside is that the decision that the engine is saturated is a
local measure -- we are only checking whether or not this context was
scheduled in a timely fashion, it may be legitimately delayed due to user
priorities. We still have the same dilemma though, that we do not want
to employ the semaphore poll unless it will be used.
v2: Explain why we need to assume the worst wrt virtual engines.
Fixes: ca6e56f654 ("drm/i915: Disable semaphore busywaits on saturated systems")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
Cc: Dmitry Ermilov <dmitry.ermilov@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190618074153.16055-8-chris@chris-wilson.co.uk
We need to keep the context image pinned in memory until after the GPU
has finished writing into it. Since it continues to write as we signal
the final breadcrumb, we need to keep it pinned until the request after
it is complete. Currently we know the order in which requests execute on
each engine, and so to remove that presumption we need to identify a
request/context-switch we know must occur after our completion. Any
request queued after the signal must imply a context switch, for
simplicity we use a fresh request from the kernel context.
The sequence of operations for keeping the context pinned until saved is:
- On context activation, we preallocate a node for each physical engine
the context may operate on. This is to avoid allocations during
unpinning, which may be from inside FS_RECLAIM context (aka the
shrinker)
- On context deactivation on retirement of the last active request (which
is before we know the context has been saved), we add the
preallocated node onto a barrier list on each engine
- On engine idling, we emit a switch to kernel context. When this
switch completes, we know that all previous contexts must have been
saved, and so on retiring this request we can finally unpin all the
contexts that were marked as deactivated prior to the switch.
We can enhance this in future by flushing all the idle contexts on a
regular heartbeat pulse of a switch to kernel context, which will also
be used to check for hung engines.
v2: intel_context_active_acquire/_release
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190614164606.15633-1-chris@chris-wilson.co.uk
Matching the underlying get/put functions.
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Imre Deak <imre.deak@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Imre Deak <imre.deak@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20190613232156.34940-8-daniele.ceraolospurio@intel.com
Asking the GPU to busywait on a memory address, perhaps not unexpectedly
in hindsight for a shared system, leads to bus contention that affects
CPU programs trying to concurrently access memory. This can manifest as
a drop in transcode throughput on highly over-saturated workloads.
The only clue offered by perf, is that the bus-cycles (perf stat -e
bus-cycles) jumped by 50% when enabling semaphores. This corresponds
with extra CPU active cycles being attributed to intel_idle's mwait.
This patch introduces a heuristic to try and detect when more than one
client is submitting to the GPU pushing it into an oversaturated state.
As we already keep track of when the semaphores are signaled, we can
inspect their state on submitting the busywait batch and if we planned
to use a semaphore but were too late, conclude that the GPU is
overloaded and not try to use semaphores in future requests. In
practice, this means we optimistically try to use semaphores for the
first frame of a transcode job split over multiple engines, and fail if
there are multiple clients active and continue not to use semaphores for
the subsequent frames in the sequence. Periodically, we try to
optimistically switch semaphores back on whenever the client waits to
catch up with the transcode results.
With 1 client, on Broxton J3455, with the relative fps normalized by %cpu:
x no semaphores
+ drm-tip
* patched
+------------------------------------------------------------------------+
| * |
| *+ |
| **+ |
| **+ x |
| x * +**+ x |
| x x * * +***x xx |
| x x * * *+***x *x |
| x x* + * * *****x *x x |
| + x xx+x* + *** * ********* x * |
| + x xx+x* * *** +** ********* xx * |
| * + ++++* + x*x****+*+* ***+*************+x* * |
|*+ +** *+ + +* + *++****** *xxx**********x***+*****************+*++ *|
| |__________A_____M_____| |
| |_______________A____M_________| |
| |____________A___M________| |
+------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 120 2.60475 3.50941 3.31123 3.2143953 0.21117399
+ 120 2.3826 3.57077 3.25101 3.1414161 0.28146407
Difference at 95.0% confidence
-0.0729792 +/- 0.0629585
-2.27039% +/- 1.95864%
(Student's t, pooled s = 0.248814)
* 120 2.35536 3.66713 3.2849 3.2059917 0.24618565
No difference proven at 95.0% confidence
With 10 clients over-saturating the pipeline:
x no semaphores
+ drm-tip
* patched
+------------------------------------------------------------------------+
| ++ ** |
| ++ ** |
| ++ ** |
| ++ ** |
| ++ xx *** |
| ++ xx *** |
| ++ xxx*** |
| ++ xxx*** |
| +++ xxx*** |
| +++ xx**** |
| +++ xx**** |
| +++ xx**** |
| +++ xx**** |
| ++++ xx**** |
| +++++ xx**** |
| +++++ x x****** |
| ++++++ xxx******* |
| ++++++ xxx******* |
| ++++++ xxx******* |
| ++++++ xx******** |
| ++++++ xxxx******** |
| ++++++ xxxx******** |
| ++++++++ xxxxx********* |
|+ + + + ++++++++ xxx*xx**********x* *|
| |__A__| |
| |__AM__| |
| |__A_| |
+------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 120 2.47855 2.8972 2.72376 2.7193402 0.074604933
+ 120 1.17367 1.77459 1.71977 1.6966782 0.085850697
Difference at 95.0% confidence
-1.02266 +/- 0.0203502
-37.607% +/- 0.748352%
(Student's t, pooled s = 0.0804246)
* 120 2.57868 3.00821 2.80142 2.7923878 0.058646477
Difference at 95.0% confidence
0.0730476 +/- 0.0169791
2.68622% +/- 0.624383%
(Student's t, pooled s = 0.0671018)
Indicating that we've recovered the regression from enabling semaphores
on this saturated setup, with a hint towards an overall improvement.
Very similar, but of smaller magnitude, results are observed on both
Skylake(gt2) and Kabylake(gt4). This may be due to the reduced impact of
bus-cycles, where we see a 50% hit on Broxton, it is only 10% on the big
core, in this particular test.
One observation to make here is that for a greedy client trying to
maximise its own throughput, using semaphores is the right choice. It is
only the holistic system-wide view that semaphores of one client
impacts another and reduces the overall throughput where we would choose
to disable semaphores.
The most noticeable negactive impact this has is on the no-op
microbenchmarks, which are also very notable for having no cpu bus load.
In particular, this increases the runtime and energy consumption of
gem_exec_whisper.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
Cc: Dmitry Ermilov <dmitry.ermilov@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190504070707.30902-1-chris@chris-wilson.co.uk
We no longer need to track the active intel_contexts within each engine,
allowing us to drop a tricky mutex_lock from inside unpin (which may
occur inside fs_reclaim).
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190426163336.15906-8-chris@chris-wilson.co.uk
We switched to a tree of per-engine HW context to accommodate the
introduction of virtual engines. However, we plan to also support
multiple instances of the same engine within the GEM context, defeating
our use of the engine as a key to looking up the HW context. Just
allocate a logical per-engine instance and always use an index into the
ctx->engines[]. Later on, this ctx->engines[] may be replaced by a user
specified map.
v2: Add for_each_gem_engine() helper to iterator within the engines lock
v3: intel_context_create_request() helper
v4: s/unsigned long/unsigned int/ 4 billion engines is quite enough.
v5: Push iterator locking to caller
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190426163336.15906-7-chris@chris-wilson.co.uk
Move the intel_context_instance() to the caller so that we can decouple
ourselves from one context instance per engine.
v2: Rename pin_lock() to lock_pinned(), hopefully that is clearer.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190426163336.15906-5-chris@chris-wilson.co.uk
In the current scheme, on submitting a request we take a single global
GEM wakeref, which trickles down to wake up all GT power domains. This
is undesirable as we would like to be able to localise our power
management to the available power domains and to remove the global GEM
operations from the heart of the driver. (The intent there is to push
global GEM decisions to the boundary as used by the GEM user interface.)
Now during request construction, each request is responsible via its
logical context to acquire a wakeref on each power domain it intends to
utilize. Currently, each request takes a wakeref on the engine(s) and
the engines themselves take a chipset wakeref. This gives us a
transition on each engine which we can extend if we want to insert more
powermangement control (such as soft rc6). The global GEM operations
that currently require a struct_mutex are reduced to listening to pm
events from the chipset GT wakeref. As we reduce the struct_mutex
requirement, these listeners should evaporate.
Perhaps the biggest immediate change is that this removes the
struct_mutex requirement around GT power management, allowing us greater
flexibility in request construction. Another important knock-on effect,
is that by tracking engine usage, we can insert a switch back to the
kernel context on that engine immediately, avoiding any extra delay or
inserting global synchronisation barriers. This makes tracking when an
engine and its associated contexts are idle much easier -- important for
when we forgo our assumed execution ordering and need idle barriers to
unpin used contexts. In the process, it means we remove a large chunk of
code whose only purpose was to switch back to the kernel context.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Imre Deak <imre.deak@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190424200717.1686-5-chris@chris-wilson.co.uk
We wish to start segregating the power management into different control
domains, both with respect to the hardware and the user interface. The
first step is that at the lowest level flow of requests, we want to
process a context event (and not a global GEM operation). In this patch,
we introduce the context callbacks that in future patches will be
redirected to per-engine interfaces leading to global operations as
required.
The intent is that this will be guarded by the timeline->mutex, except
that retiring has not quite finished transitioning over from being
guarded by struct_mutex. So at the moment it is protected by
struct_mutex with a reminded to switch.
v2: Rename default handlers to intel_context_enter_engine.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190424200717.1686-3-chris@chris-wilson.co.uk
Start partitioning off the code that talks to the hardware (GT) from the
uapi layers and move the device facing code under gt/
One casualty is s/intel_ringbuffer.h/intel_engine.h/ with the plan to
subdivide that header and body further (and split out the submission
code from the ringbuffer and logical context handling). This patch aims
to be simple motion so git can fixup inflight patches with little mess.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190424174839.7141-1-chris@chris-wilson.co.uk