Now that we have fast timeslicing on semaphores, we no longer need to
prioritise none-semaphore work as we will yield any work blocked on a
semaphore to the next in the queue. Previously with no timeslicing,
blocking on the semaphore caused extremely bad scheduling with multiple
clients utilising multiple rings. Now, there is no impact and we can
remove the complication.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200513173504.28322-1-chris@chris-wilson.co.uk
Upon waiting a request (when asked), we gave that request a small
priority boost, not enough for it to cause preemption, but enough for it
to be scheduled next before all equals. We also used that bit to give
new clients a small priority boost, similar to FQ_CODEL, such that we
favoured short interactive tasks ahead of long running streams.
However, this is causing lots of complications with timeslicing where we
both want to honour the boost and yet ignore it. Those complications
cause unexpected user behaviour (tasks not being timesliced and run
concurrently as epxected), and the easiest way to resolve that is to
remove the boost. Hopefully, we can find a compromise again if we need
to, but in theory timeslicing itself and future more advanced schedulers
should give us the interactivity boost we seek.
Testcase: igt/gem_exec_schedule/lateslice
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200507152338.7452-3-chris@chris-wilson.co.uk
The older arches did not convert MI_STORE_DATA_IMM to using the GTT, but
left them writing to a physical address. The notes suggest that the
primary reason would be so that the writes were cache coherent, as the
CPU cache uses physical tagging. As such we did not implement the
legacy variant of MI_STORE_DATA_IMM and so left all the relocations
synchronous -- but with a small function to convert from the vma address
into the physical address, we can implement asynchronous relocs on these
older arches, fixing up a few tests that require them.
In order to be able to test the legacy paths, refactor the gpu
relocations so that we can hook them up to a selftest.
v2: Use an array of offsets not enum labels for the selftest
v3: Refactor the common igt_hexdump()
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/757
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200504140629.28240-1-chris@chris-wilson.co.uk
If at first we don't succeed, try try again.
Not all engines may support the MI ops we need to perform asynchronous
relocation patching, and so we end up falling back to a synchronous
operation that has a liability of blocking. However, Tvrtko pointed out
we don't need to use the same engine to perform the relocations as we
are planning to execute the execbuf on, and so if we switch over to a
working engine, we can perform the relocation asynchronously. The user
execbuf will be queued after the relocations by virtue of fencing.
This patch creates a new context per execbuf requiring asynchronous
relocations on an unusable engines. This is perhaps a bit excessive and
can be ameliorated by a small context cache, but for the moment we only
need it for working around a little used engine on Sandybridge, and only
if relocations are actually required to an active batch buffer.
Now we just need to teach the relocation code to handle physical
addressing for gen2/3, and we should then have universal support!
Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Testcase: igt/gem_exec_reloc/basic-spin # snb
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200501192945.22215-3-chris@chris-wilson.co.uk
The ring is a precious resource: we anticipate to only use a few hundred
bytes for a request, and only try to reserve that before we start. If we
go beyond our guess in building the request, then instead of waiting at
the start of execbuf before we hold any locks or other resources, we
may trigger a wait inside a critical region. One example is in using gpu
relocations, where currently we emit a new MI_BB_START from the ring
every time we overflow a page of relocation entries. However, instead of
insert the command into the precious ring, we can chain the next page of
relocation entries as MI_BB_START from the end of the previous.
v2: Delay the emit_bb_start until after all the chained vma
synchronisation is complete. Since the buffer pool batches are idle, this
_should_ be a no-op, but one day we may some fancy async GPU bindings
for new vma!
v3: Use pool/batch consitently, once we start thinking in terms of the
batch vma, use batch->obj.
v4: Explain the magic number 4.
Tvrtko spotted that we lose propagation of the error for failing to
submit the relocation request; that's easier to fix up in the next
patch.
Testcase: igt/gem_exec_reloc/basic-many-active
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200501192945.22215-1-chris@chris-wilson.co.uk
Since the introduction of 'soft-rc6', we aim to park the device quickly
and that results in frequent idling of the whole device. Currently upon
idling we free the batch buffer pool, and so this renders the cache
ineffective for many workloads. If we want to have an effective cache of
recently allocated buffers available for reuse, we need to decouple that
cache from the engine powermanagement and make it timer based. As there
is no reason then to keep it within the engine (where it once made
retirement order easier to track), we can move it up the hierarchy to the
owner of the memory allocations.
v2: Hook up to debugfs/drop_caches to clear the cache on demand.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200430111819.10262-2-chris@chris-wilson.co.uk
The history of i915_vma_close() is confusing, as is its use. As the
lifetime of the i915_vma is currently bounded by the object it is
attached to, we needed a means of identify when a vma was no longer in
use by userspace (via the user's fd). This is further complicated by
that only ppgtt vma should be closed at the user's behest, as the ggtt
were always shared.
Now that we attach the vma to a lut on the user's context, the open
count does indicate how many unique and open context/vm are referencing
this vma from the user. As such, we can and should just use the
open_count to track when the vma is still in use by userspace.
It's a poor man's replacement for reference counting.
Closes: https://gitlab.freedesktop.org/drm/intel/issues/1193
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Andi Shyti <andi.shyti@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200422190558.30509-1-chris@chris-wilson.co.uk
A recent commit in clang added -Wtautological-compare to -Wall, which is
enabled for i915 after -Wtautological-compare is disabled for the rest
of the kernel so we see the following warning on x86_64:
../drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1433:22: warning:
result of comparison of constant 576460752303423487 with expression of
type 'unsigned int' is always false
[-Wtautological-constant-out-of-range-compare]
if (unlikely(remain > N_RELOC(ULONG_MAX)))
~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
../include/linux/compiler.h:78:42: note: expanded from macro 'unlikely'
# define unlikely(x) __builtin_expect(!!(x), 0)
^
1 warning generated.
It is not wrong in the case where ULONG_MAX > UINT_MAX but it does not
account for the case where this file is built for 32-bit x86, where
ULONG_MAX == UINT_MAX and this check is still relevant.
Cast remain to unsigned long, which keeps the generated code the same
(verified with clang-11 on x86_64 and GCC 9.2.0 on x86 and x86_64) and
the warning is silenced so we can catch more potential issues in the
future.
Closes: https://github.com/ClangBuiltLinux/linux/issues/778
Suggested-by: Michel Dänzer <michel@daenzer.net>
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200214054706.33870-1-natechancellor@gmail.com
If the caller allows and we do not have to wait for any signals,
immediately execute the work within the caller's process. By doing so we
avoid the overhead of scheduling a new task, and the latency in
executing it, at the cost of pulling that work back into the immediate
context. (Sometimes we still prefer to offload the task to another cpu,
especially if we plan on executing many such tasks in parallel for this
client.)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200325120227.8044-2-chris@chris-wilson.co.uk
As we store the handle lookup inside a radix tree, we do not need the
gem_context->mutex except until we need to insert our lookup into the
common radix tree. This takes a small bit of rearranging to ensure that
the lut we insert into the tree is ready prior to actually inserting it
(as soon as it is exposed via the radixtree, it is visible to any other
submission).
v2: For brownie points, remove the goto spaghetti.
v3: Tighten up the closed-handle checks.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200323092841.22240-8-chris@chris-wilson.co.uk
For our convenience, and to avoid frequent allocations, we placed some
lists we use for execbuf inside the common i915_vma struct. As we look
to parallelise execbuf, such fields guarded by the struct_mutex BKL must
be pulled under local control. Instead of using the i915_vma as our
primary means of tracking the user's list of objects and their virtual
mappings, we use a local eb_vma with the same lists as before (just now
local not global).
This should allow us to only perform the lookup of vma used for
execution once during the execbuf ioctl, as currently we need to remove
our secrets from inside i915_vma everytime we drop the struct_mutex as
another execbuf may use the shared locations.
Once potential user visible consequence is that we can remove the
requirement that the execobj[] be unique, and only require that they do
not conflict (i.e. you cannot softpin the same object into two locations.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200303204345.1859734-2-chris@chris-wilson.co.uk
With the goal of removing the serialisation from around execbuf, we will
no longer have the privilege of there being a single execbuf in flight
at any time and so will only be able to inspect the user's flags within
the carefully controlled execbuf context. i915_gem_evict_for_node() is
the only user outside of execbuf that currently peeks at the flag to
convert an overlapping softpinned request from ENOSPC to EINVAL. Retract
this nicety and only report ENOSPC if the location is in current use,
either due to this execbuf or another.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200303204345.1859734-1-chris@chris-wilson.co.uk
Since commit e5dadff4b0 ("drm/i915: Protect request retirement with
timeline->mutex"), the request retirement can happen outside of the
struct_mutex serialised only by the timeline->mutex. We drop the
timeline->mutex on submitting the request (i915_request_add) so after
that point, it is liable to be freed. Make sure our local reference is
kept alive until we have finished attaching it to the signalers. (Note
that this erodes the argument that i915_request_add should consume the
reference, but that is a slightly larger patch!)
Fixes: e5dadff4b0 ("drm/i915: Protect request retirement with timeline->mutex")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191217134729.3297818-1-chris@chris-wilson.co.uk
Execute the cmdparser asynchronously as part of the submission pipeline.
Using our dma-fences, we can schedule execution after an asynchronous
piece of work, so we move the cmdparser out from under the struct_mutex
inside execbuf as run it as part of the submission pipeline. The same
security rules apply, we copy the user batch before validation and
userspace cannot touch the validation shadow. The only caveat is that we
will do request construction before we complete cmdparsing and so we
cannot know the outcome of the validation step until later -- so the
execbuf ioctl does not report -EINVAL directly, but we must cancel
execution of the request and flag the error on the out-fence.
Closes: https://gitlab.freedesktop.org/drm/intel/issues/611
Closes: https://gitlab.freedesktop.org/drm/intel/issues/412
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191211230858.599030-2-chris@chris-wilson.co.uk
The gen7 cmdparser is primarily a promotion-based system to allow access
to additional registers beyond the HW validation, and allows fallback to
normal execution of the user batch buffer if valid and requires
chaining. In the next patch, we will do the cmdparser validation in the
pipeline asynchronously and so at the point of request construction we
will not know if we want to execute the privileged and validated batch,
or the original user batch. The solution employed here is to execute
both batches, one with raised privileges and one as normal. This is
because the gen7 MI_BATCH_BUFFER_START command cannot change privilege
level within a batch and must strictly use the current privilege level
(or undefined behaviour kills the GPU). So in order to execute the
original batch, we need a second non-priviledged batch buffer chain from
the ring, i.e. we need to emit two batches for each user batch. Inside
the two batches we determine which one should actually execute, we
provide a conditional trampoline to call the original batch.
Implementation-wise, we create a single buffer and write the shadow and
the trampoline inside it at different offsets; and bind the buffer into
both the kernel GGTT for the privileged execution of the shadow and into
the user ppGTT for the non-privileged execution of the trampoline and
original batch. One buffer, two batches and two vma.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191211230858.599030-1-chris@chris-wilson.co.uk