We were attempting to use a per-ring spinlock whilst modifying global
IRQ flags. A recipe for rare missed interrupts.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Ben Widawsky <ben@bwidawsk.net>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
We only ever enable/disable one interrupt (namely user_interrupts and
pipe_notify), so we don't need to track the interrupt masking state.
Also rename irq_enable to irq_enable_mask, now that it won't collide -
beforehand both a irq_mask and irq_enable_mask would have looked a bit
strange.
Reviewed-by: Eric Anholt <eric@anholt.net>
Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
We can now open-code the get/put irq functions as they were just
abstracting single register definitions.
It would be nice to merge this in with the IRQ handling code... but that
is too much work for me at present. In addition I could probably
collapse this in to a lot of the Ironlake stuff, but I don't think it's
worth the potential regressions.
This patch itself should not effect functionality.
CC: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Ben Widawsky <benjamin.widawsky@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
By recording the location of every request in the ringbuffer, we know
that in order to retire the request the GPU must have finished reading
it and so the GPU head is now beyond the tail of the request. We can
therefore provide a conservative estimate of where the GPU is reading
from in order to avoid having to read back the ring buffer registers
when polling for space upon starting a new write into the ringbuffer.
A secondary effect is that this allows us to convert
intel_ring_buffer_wait() to use i915_wait_request() and so consolidate
upon the single function to handle the complicated task of waiting upon
the GPU. A necessary precaution is that we need to make that wait
uninterruptible to match the existing conditions as all the callers of
intel_ring_begin() have not been audited to handle ERESTARTSYS
correctly.
By using a conservative estimate for the head, and always processing all
outstanding requests first, we prevent a race condition between using
the estimate and direct reads of I915_RING_HEAD which could result in
the value of the head going backwards, and the tail overflowing once
again. We are also careful to mark any request that we skip over in
order to free space in ring as consumed which provides a
self-consistency check.
Given sufficient abuse, such as a set of unthrottled GPU bound
cairo-traces, avoiding the use of I915_RING_HEAD gives a 10-20% boost on
Sandy Bridge (i5-2520m):
firefox-paintball 18927ms -> 15646ms: 1.21x speedup
firefox-fishtank 12563ms -> 11278ms: 1.11x speedup
which is a mild consolation for the performance those traces achieved from
exploiting the buggy autoreported head.
v2: Add a few more comments and make request->tail a conservative
estimate as suggested by Daniel Vetter.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
[danvet: resolve conflicts with retirement defering and the lack of
the autoreport head removal (that will go in through -fixes).]
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
... and add a helpr function for the places where we want a flag.
This way we can use ring->id to index into arrays.
v2: Resurrect the missing beautification-space Chris Wilson noted.
I'm moving this space around because I'll reuse ring_str in the next
patch.
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Ben Widawsky <ben@bwidawsk.net>
Reviewed-by: Eugeni Dodonov <eugeni.dodonov@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
While I think the previous code is correct, it was hard to follow and
hard to debug. Since we already have a ring abstraction, might as well
use it to handle the semaphore updates and compares.
I don't expect this code to make semaphores better or worse, but you
never know...
v2:
Remove magic per Keith's suggestions.
Ran Daniel's gem_ring_sync_loop test on this.
v3:
Ignored one of Keith's suggestions.
v4:
Removed some bloat per Daniel's recommendation.
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Keith Packard <keithp@keithp.com>
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
Signed-off-by: Keith Packard <keithp@keithp.com>
Various issues involved with the space character were generating
warnings in the checkpatch.pl file. This patch removes most of those
warnings.
Signed-off-by: Akshay Joshi <me@akshayjoshi.com>
Signed-off-by: Keith Packard <keithp@keithp.com>
...which is measured by the size and not the amount of space remaining.
Waiting upon size-8, did one of two things. In the common case with more
than 8 bytes available to write into the ring, it would return
immediately. Otherwise, it would timeout given the impossible condition
of waiting for more space than is available in the ring, leading to
warnings such as:
[drm:intel_cleanup_ring_buffer] *ERROR* failed to quiesce render ring
whilst cleaning up: -16
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Keith Packard <keithp@keithp.com>
Moved the macros around to properly do reads and writes for the given
GPU. This is to address special requirements for gen6 (SNB) reads and
writes.
Registers in the range 0-0x40000 on gen6 platforms require special
handling. Instead of relying on the callers to pick the registers
correctly, move the logic into the read and write functions.
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Added a new function which waits for the ringbuffer space to be equal to
(total - 8). This is the empty condition of the ringbuffer, and
equivalent to head==tail.
Also modified two users of this functionality elsewhere in the code.
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Whilst the GT is powered down (rc6), writes to MMADDR are placed in a
FIFO by the System Agent. This is a limited resource, only 64 entries, of
which 20 are reserved for Display and PCH writes, and so we must take
care not to queue up too many writes. To avoid this, there is counter
which we can poll to ensure there are sufficient free entries in the
fifo.
"Issuing a write to a full FIFO is not supported; at worst it could
result in corruption or a system hang."
Reported-and-Tested-by: Matt Turner <mattst88@gmail.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=34056
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
A lot of minor tweaks to fix the tracepoints, improve the outputting for
ftrace, and to generally make the tracepoints useful again. It is a start
and enough to begin identifying performance issues and gaps in our
coverage.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
During suspend, Linus found that his machine would hang for 3 seconds,
and identified that intel_ring_buffer_wait() was the culprit:
"Because from looking at the code, I get the notion that
"intel_read_status_page()" may not be exact. But what happens if that
inexact value matches our cached ring->actual_head, so we never even
try to read the exact case? Does it _stay_ inexact for arbitrarily
long times? If so, we might wait for the ring to empty forever (well,
until the timeout - the behavior I see), even though the ring really
_is_ empty."
As the reported HEAD position is only updated every time it crosses a
64k boundary, whilst draining the ring it is indeed likely to remain one
value. If that value matches the last known HEAD position, we never read
the true value from the register and so trigger a timeout.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Move code around and invoke iomem annotation in a few more places in
order to silence sparse. Still a few more iomem annotations to go...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As the IMR for the USER interrupts are not modified elsewhere, we can
separate the spinlock used for these from that of hpd and pipestats.
Those two IMR are manipulated under an IRQ and so need heavier locking.
Reported-and-tested-by: Alexey Fisher <bug-track@fisher-privat.net>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
On i830 if the tail pointer is set to within 2 cachelines of the end of
the buffer, the chip may hang. So instead if the tail were to land in
that location, we pad the end of the buffer with NOPs, and start again
at the beginning.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
In order to enforce the correct memory barriers for irq get/put, we need
to perform the actual counting using atomic operations.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If the tail advances beyond the autoreport HEAD value, then we need to
fallback to an uncached read of the HEAD register in order to ascertain
the correct amount of remaining space in the ringbuffer.
Reported-by: Fang, Xun <xunx.fang@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=32259
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The bulk of the change is to convert the growing list of rings into an
array so that the relationship between the rings and the semaphore sync
registers can be easily computed.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
This makes the various rings more consistent by removing the anomalous
handing of the rendering ring execbuffer dispatch.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Before reading ring register, set FORCE_WAKE bit to prevent GT core
power down to low power state, otherwise we may read stale values.
Signed-off-by: Zou Nan hai <nanhai.zou@intel.com>
[ickle: added a udelay which seemed to do the trick on my SNB]
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
On some stepping of SNB cpu, the first command to be parsed in BLT
command streamer should be MI_BATCHBUFFER_START otherwise the GPU
may hang.
(cherry picked from commit 8d19215be8)
Conflicts:
drivers/gpu/drm/i915/intel_ringbuffer.c
drivers/gpu/drm/i915/intel_ringbuffer.h
Signed-off-by: Zou Nan hai <nanhai.zou@intel.com>
Cc: stable@kernel.org
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
On some stepping of SNB cpu, the first command to be parsed in BLT
command streamer should be MI_BATCHBUFFER_START otherwise the GPU
may hang.
Signed-off-by: Zou Nan hai <nanhai.zou@intel.com>
[ickle: rebased for -next]
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Preparing the ringbuffer for adding new commands can fail (a timeout
whilst waiting for the GPU to catch up and free some space). So check
for any potential error before overwriting HEAD with new commands, and
propagate that error back to the user where possible.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The ringbuffer keeps a pointer to the parent device, so we can use that
instead of passing around the pointer on the stack.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
... to prevent flush processing of an idle (or even absent) ring.
This fixes a regression during suspend from 87acb0a5.
Reported-and-tested-by: Alexey Fisher <bug-track@fisher-privat.net>
Tested-by: Peter Clifton <pcjc2@cam.ac.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
This should fix the error along the reset path were we tried to clear the
tail register by setting it to 0, but were in fact setting it to the
current value and complaining when it did not reset to 0.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Based on an original patch by Zhenyu Wang, this initializes the BLT ring for
SandyBridge and enables support for user execbuffers.
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
... before someone tries to use it. The code both calls
intel_ring_begin/advance() and open-codes the bookkeeping performed by
those two functions.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
With multiple rings generating requests independently, the outstanding
requests must also be track independently.
Reported-by: Wang Jinjin <jinjin.wang@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=30380
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
It's the same code, essentially, so kill all copies safe one unified
version.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
All functions are extremely similar, so fold them into one generic
implementation.
This function isn't used anyway, because there's not yet a bsd ring
error state dumper.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Avoid cause latencies in other clients by not taking the global struct
mutex and moving the per-client request manipulation a local per-client
mutex. For example, this allows a compositor to schedule a page-flip
(through X) whilst an OpenGL application is monopolising the GPU.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As we presume space is signed when computing and looking for wrap along,
make it so.
Reported-by: Owain G. Ainsworth <zerooa@googlemail.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Always PAGE_SIZE and only complicates the code.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>