Returns a snapshot of the GPU clock counter. Needed
for certain OpenGL extensions.
v2: agd5f
- address Jerome's comments
- add function documentation
Signed-off-by: Marek Olšák <maraeo@gmail.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Virtual address need to be fenced to know when we can safely remove it.
This patch also properly clear the pagetable. Previously it was
serouisly broken.
Kernel 3.5/3.4 need a similar patch but adapted for difference in mutex locking.
v2: For to update pagetable when unbinding bo (don't bailout if
bo_va->valid is true).
v3: Add kernel 3.5/3.4 comment.
v4: Fix compilation warnings.
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Add support for using memory buffers rather than
scratch registers. Some rings may not be able to
write to scratch registers.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Just store the index in the ring structure.
Idea taken from one of Jerome's wip rptr patches.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Const IBs are executed on the CE not the CP, so we can't
fence them in the normal way.
So submit them directly before the IB instead, just as
the documentation says.
v2: keep the extra documentation
Signed-off-by: Christian König <deathsimple@vodafone.de>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Otherwise we can encounter out of memory situations under extreme load.
v2: add documentation for the new function
Signed-off-by: Christian König <deathsimple@vodafone.de>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Try to save whatever is on the rings when
we encounter an lockup.
v2: Fix spelling error. Free saved ring data if reset fails.
Add documentation for the new functions.
v3: Some more spelling fixes
v4: It doesn't make sense to save anything if all fences
are signaled
Signed-off-by: Christian König <deathsimple@vodafone.de>
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Before emitting any indirect buffer, emit the offset of the next
valid ring content if any. This allow code that want to resume
ring to resume ring right after ib that caused GPU lockup.
v2: use scratch registers instead of storing it into memory
v3: skip over the surface sync for ni and si as well
v4: use SET_CONFIG_REG instead of PACKET0
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Christian König <deathsimple@vodafone.de>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Just restore the page table instead. Addressing three
problem with this change:
1. Calling vm_manager_suspend in the suspend path is
problematic cause it wants to wait for the VM use
to end, which in case of a lockup never happens.
2. In case of a locked up memory controller
unbinding the VM seems to make it even more
unstable, creating an unrecoverable lockup
in the end.
3. If we want to backup/restore the leftover ring
content we must not unbind VMs in between.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Just reinitialize the shader content on resume instead.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
The IB pool is in gart memory, so it is completely
superfluous to unpin / repin it on suspend / resume.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
GPU reset need to be exclusive, one happening at a time. For this
add a rw semaphore so that any path that trigger GPU activities
have to take the semaphore as a reader thus allowing concurency.
The GPU reset path take the semaphore as a writer ensuring that
no concurrent reset take place.
v2: init rw semaphore
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Instead of returning the error handle it directly
and while at it fix the comments about the ring lock.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Try to remove or replace the cs_mutex with a
vm_mutex where it is still needed.
v2: fix locking order
v3: rebased on drm-next
Signed-off-by: Christian König <deathsimple@vodafone.de>
So we can skip the locking. Also renames sw_int to
ring_int, cause that better matches its purpose.
Signed-off-by: Christian Koenig <christian.koenig@amd.com>
1. It is really dangerous to have more than one
spinlock protecting the same information.
2. radeon_irq_set sometimes wasn't called with lock
protection, so it can happen that more than one
CPU would tamper with the irq regs at the same
time.
3. The pm.gui_idle variable was assuming that the 3D
engine wasn't becoming idle between testing the
register and setting the variable. So just remove
it and test the register directly.
v2: Also handle the hpd irq code the same way.
v3: Rename hpd parameter for clarification.
Signed-off-by: Christian Koenig <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
The spinlock was actually there to protect the
rptr, but rptr was read outside of the locked area.
Also we don't really need a spinlock here, an
atomic should to quite fine since we only need to
prevent it from being reentrant.
v2: Keep the spinlock....
v3: Back to an atomic again after finding & fixing the real bug.
Signed-off-by: Christian Koenig <christian.koenig@amd.com>
It is a rw_semaphore now and only write locked
while changing the clock. Also the lock is renamed
to better reflect what it is protecting.
v2: Keep the ttm_vm_ops on IGPs
Signed-off-by: Christian König <deathsimple@vodafone.de>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Move inter ring syncing with semaphores into the
existing ring allocations, with that we need to
lock the ring mutex only once.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
It is completely unnecessary to create fences
before they are emitted, so remove it and a bunch
of checks if fences are emitted or not.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
- Properly set up the RBs
- Properly set up the SPI
- Properly set up gb_addr_config
This should fix rendering issues on certain cards.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Pull drm fixes from Dave Airlie:
"A bunch of fixes:
- vmware memory corruption
- ttm spinlock balance
- cirrus/mgag200 work in the presence of efifb
and finally Alex and Jerome managed to track down a magic set of bits
that on certain rv740 and evergreen cards allow the correct use of the
complete set of render backends, this makes the cards operate
correctly in a number of scenarios we had issues in before, it also
manages to boost speed on benchmarks my large amounts on these
specific gpus."
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/edid: Make the header fixup threshold tunable
drm/radeon: fix regression in UMS CS ioctl
drm/vmwgfx: Fix nasty write past alloced memory area
drm/ttm: Fix spinlock imbalance
drm/radeon: fixup tiling group size and backendmap on r6xx-r9xx (v4)
drm/radeon: fix HD6790, HD6570 backend programming
drm/radeon: properly program gart on rv740, juniper, cypress, barts, hemlock
drm/radeon: fix bank information in tiling config
drm/mgag200: kick off conflicting framebuffers earlier.
drm/cirrus: kick out conflicting framebuffers earlier
cirrus: avoid crash if driver fails to load
Tiling group size is always 256bits on r6xx/r7xx/r8xx/9xx. Also fix and
simplify render backend map. This now properly sets up the backend map
on r6xx-9xx which should improve 3D performance.
Vadim benchmarked also:
Some benchmarks on juniper (5750), fullscreen 1920x1080,
first result - kernel 3.4.0+ (fb21affa), second - with these patches:
Lightsmark: 91 fps => 123 fps +35%
Doom3: 74 fps => 101 fps +36%
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Just move its only caller into the same file as it and make it static.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
No need to malloc it any more.
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Christian König <deathsimple@vodafone.de>
Signed-off-by: Dave Airlie <airlied@redhat.com>
If we don't store local data into global variables
it isn't necessary to lock anything.
v2: rebased on new SA interface
Signed-off-by: Christian König <deathsimple@vodafone.de>
Signed-off-by: Dave Airlie <airlied@redhat.com>
It never really belonged there in the first place.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Signed-off-by: Dave Airlie <airlied@redhat.com>
It isn't necessary any more and the suballocator seems to perform
even better.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Directly use the suballocator to get small chunks of memory.
It's equally fast and doesn't crash when we encounter a GPU reset.
v2: rebased on new SA interface.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
A startover with a new idea for a multiple ring allocator.
Should perform as well as a normal ring allocator as long
as only one ring does somthing, but falls back to a more
complex algorithm if more complex things start to happen.
We store the last allocated bo in last, we always try to allocate
after the last allocated bo. Principle is that in a linear GPU ring
progression was is after last is the oldest bo we allocated and thus
the first one that should no longer be in use by the GPU.
If it's not the case we skip over the bo after last to the closest
done bo if such one exist. If none exist and we are not asked to
block we report failure to allocate.
If we are asked to block we wait on all the oldest fence of all
rings. We just wait for any of those fence to complete.
v2: We need to be able to let hole point to the list_head, otherwise
try free will never free the first allocation of the list. Also
stop calling radeon_fence_signalled more than necessary.
v3: Don't free allocations without considering them as a hole,
otherwise we might lose holes. Also return ENOMEM instead of ENOENT
when running out of fences to wait for. Limit the number of holes
we try for each ring to 3.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Use one wait queue for all rings. When one ring progress, other
likely does to and we are not expecting to have a lot of waiter
anyway.
Also add a fence_wait_any that will wait until the first fence
in the fence array (one fence per ring) is signaled. This allow
to wait on all rings.
v2: some minor cleanups and improvements.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Define the interface without modifying the allocation
algorithm in any way.
v2: rebase on top of fence new uint64 patch
v3: add ring to debugfs output
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Christian König <deathsimple@vodafone.de>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Instead of offset + size keep start and end offset directly.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Make the suballocator self containing to locking.
v2: split the bugfix into a seperate patch.
v3: remove some unreleated changes.
Sig-off-by: Christian König <deathsimple@vodafone.de>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Some callers illegal called fence_wait_next/empty
while holding the ring emission mutex. So don't
relock the mutex in that cases, and move the actual
locking into the fence code.
v2: Don't try to unlock the mutex if it isn't locked.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Using 64bits fence sequence we can directly compare sequence
number to know if a fence is signaled or not. Thus the fence
list became useless, so does the fence lock that mainly
protected the fence list.
Things like ring.ready are no longer behind a lock, this should
be ok as ring.ready is initialized once and will only change
when facing lockup. Worst case is that we return an -EBUSY just
after a successfull GPU reset, or we go into wait state instead
of returning -EBUSY (thus delaying reporting -EBUSY to fence
wait caller).
v2: Remove left over comment, force using writeback on cayman and
newer, thus not having to suffer from possibly scratch reg
exhaustion
v3: Rebase on top of change to uint64 fence patch
v4: Change DCE5 test to force write back on cayman and newer but
also any APU such as PALM or SUMO family
v5: Rebase on top of new uint64 fence patch
v6: Just break if seq doesn't change any more. Use radeon_fence
prefix for all function names. Even if it's now highly optimized,
try avoiding polling to often.
v7: We should never poll the last_seq from the hardware without
waking the sleeping threads, otherwise we might lose events.
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Christian König <deathsimple@vodafone.de>
Signed-off-by: Dave Airlie <airlied@redhat.com>
This convert fence to use uint64_t sequence number intention is
to use the fact that uin64_t is big enough that we don't need to
care about wrap around.
Tested with and without writeback using 0xFFFFF000 as initial
fence sequence and thus allowing to test the wrap around from
32bits to 64bits.
v2: Add comment about possible race btw CPU & GPU, add comment
stressing that we need 2 dword aligned for R600_WB_EVENT_OFFSET
Read fence sequenc in reverse order of GPU write them so we
mitigate the race btw CPU and GPU.
v3: Drop the need for ring to emit the 64bits fence, and just have
each ring emit the lower 32bits of the fence sequence. We
handle the wrap over 32bits in fence_process.
v4: Just a small optimization: Don't reread the last_seq value
if loop restarts, since we already know its value anyway.
Also start at zero not one for seq value and use pre instead
of post increment in emmit, otherwise wait_empty will deadlock.
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Christian König <deathsimple@vodafone.de>
Signed-off-by: Dave Airlie <airlied@redhat.com>
A single global mutex for ring submissions seems sufficient.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Nothing chipset or ring specific with it,
so also move it to radon_ring.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Don't hard code the 10 seconds timeout. Compute jobs
can run much longer.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
It isn't chipset specific, so it makes no sense
to have that inside r100.c.
Signed-off-by: Christian König <deathsimple@vodafone.de>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>