Try to get the interconnect path for the GPU and vote for the maximum
bandwidth to support all frequencies. This is needed for performance.
Later we will want to scale the bandwidth based on the frequency to
also optimize for power but that will require some device tree
infrastructure that does not yet exist.
v6: use icc_set_bw() instead of icc_set()
v5: Remove hardcoded interconnect name and just use the default
v4: Don't use a port string at all to skip the need for names in the DT
v3: Use macros and change port string per Georgi Djakov
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Acked-by: Rob Clark <robdclark@gmail.com>
Reviewed-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Every GPU core only has one interrupt so there isn't any
value in looking up the interrupt by name. Remove the name (which
is legacy anyway) and use platform_get_irq() instead.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
A2XX has its own very simple MMU.
Added a msm_use_mmu() function because we can't rely on iommu_present to
decide to use MMU or not.
Signed-off-by: Jonathan Marek <jonathan@marek.ca>
Signed-off-by: Rob Clark <robdclark@gmail.com>
derived from the a3xx driver and tested on the following hardware:
imx51-zii-rdu1 (a200 with 128kb gmem)
imx53-qsrb (a200)
msm8060-tenderloin (a220)
Signed-off-by: Jonathan Marek <jonathan@marek.ca>
Reviewed-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
When the userspace tries to read the crashstate dump, the read side
implementation in the driver currently ascii85 encodes all the binary
buffers and it does this each time the read system call is called.
A userspace tool like cat typically does a page by page read and the
number of read calls depends on the size of the data captured by the
driver. This is certainly not desirable and does not scale well with
large captures.
This patch encodes the buffer only once in the read path. With this there
is an immediate >10X speed improvement in crashstate save time.
Signed-off-by: Sharat Masetty <smasetty@codeaurora.org>
Reviewed-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
For reasons that I'm sure made perfect sense at the time we were
opting to defer the iova alloc / pin on the ringbuffer until HW
init time so when we moved to iova reference counting we ended
up adding a reference count every time the hardware started.
Not that it mattered (because the ring is always around) but
it did make the debug output look odd. Allocate and pin the iova
at create time instead.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Add a new function to get and pin the iova memory in one
step (basically renaming the old msm_gem_get_iova function)
and switch msm_gem_get_iova() to only allocate an iova but
not map it in the IOMMU. This is only currently used by
msm_ioctl_gem_info() since all other users of of the iova
expect that the memory be immediately available.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
If the GPU target doesn't define a list of registers then gracefully skip
capturing and/or printing them. This is used by more complex targets like
6xx that have other means of capturing register values.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Use DRM_DEV_INFO/ERROR/WARN instead of dev_info/err/debug to generate
drm-formatted specific log messages so that it will be easy to
differentiate in case of multiple instances of driver.
Signed-off-by: Mamta Shukla <mamtashukla555@gmail.com>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Use the new of_get_compatible_child() helper to lookup the legacy
pwrlevels child node instead of using of_find_compatible_node(), which
searches the entire tree from a given start node and thus can return an
unrelated (i.e. non-child) node.
This also addresses a potential use-after-free (e.g. after probe
deferral) as the tree-wide helper drops a reference to its first
argument (i.e. the probed device's node).
While at it, also fix the related child-node reference leak.
Fixes: e2af8b6b0c ("drm/msm: gpu: Use OPP tables if we can")
Cc: stable <stable@vger.kernel.org> # 4.12
Cc: Jordan Crouse <jcrouse@codeaurora.org>
Cc: Rob Clark <robdclark@gmail.com>
Cc: David Airlie <airlied@linux.ie>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Rob Herring <robh@kernel.org>
Failure to load firmware is the primary reason to fail adreno_load_gpu().
Try to load it first before going into the hardware initialization code and
unwinding it. This is important for a6xx because the GMU gets loaded from
the runtime power code and it is more costly to fail in that path because
of missing firmware.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
All users of do_gettimeofday() have been removed, but this one recently
crept in, along with an incorrect printing of the microseconds portion.
This converts it to using ktime_get_real_timespec64() as a direct
replacement, and adds the leading zeroes. I considered using monotonic
times (ktime_get()) instead, but as this timestamp appears to only
be used for humans rather than compared with other timestamps, the
real time domain is probably good enough.
Fixes: e43b045e2c82 ("drm/msm/gpu: Capture the state of the GPU")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rob Clark <robdclark@gmail.com>
For hangs, dump copy out the contents of the buffer objects attached to the
guilty submission and print them in the crash dump report.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
HLSQ, SP and TP registers are only accessible from a special
aperture and to make matters worse the aperture is blocked from
the CPU on targets that can support secure rendering. Luckily the
GPU hardware has its own purpose built register dumper that can
access the registers from the aperture. Add a5xx specific code
to program the crashdumper and retrieve the wayward registers
and dump them for the crash state.
Also, remove a block of registers the regular CPU accessible
list that aren't useful for debug which helps reduce the size
of the crash state file by a goodly amount.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Add the contents of each ringbuffer to the GPU state and dump the
data in the crash file encoded with ascii85. To save space only
the used portions of the ringbuffer are dumped.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Convert the format of the 'show' debugfs file and the crash
dump to a format resembling YAML. This should be easier to
parse and be more flexible for future changes and expansions.
v2: Use a standard .rst for the msm crashdump documentation
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Capture the GPU state on a GPU hang and store it for later playback
via the devcoredump facility. Only one crash state is stored at a
time on the assumption that the first hang is usually the most
interesting. The existing crash state can be cleared after capturing
it and then a new one will be captured on the next hang.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Convert the existing GPU show function to use the GPU state to
dump the information rather than reading it directly from the hardware.
This will require an additional step to capture the state before
dumping it for the existing nodes but it will greatly facilitate reusing
the same code for dumping a previously captured state from a GPU hang.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Add the infrastructure to capture the current state of the GPU and
store it in memory so that it can be dumped later.
For now grab the same basic ringbuffer information and registers
that are provided by the debugfs 'gpu' node but obviously this should
be extended to capture a much larger set of GPU information.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Experimentation shows that resuming power quickly after suspending
ends up forcing a system hang for unknown reasons on 5xx targets.
To avoid cycling the power too much (especially during init)
turn up the autosuspend time for a5xx to 250ms and use
pm_runtime_put_autosuspend() when applicable.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Interrupt commands causes the CP to trigger an interrupt as the command
is processed, regardless of the GPU being done processing previous
commands. This is seen by the interrupt being delivered before the
fence is written on 8974 and is likely the cause of the additional
CP_WAIT_FOR_IDLE workaround found for a306, which would cause the CP to
wait for the GPU to go idle before triggering the interrupt.
Instead we can set the (undocumented) BIT(31) of the CACHE_FLUSH_TS
which will cause a special CACHE_FLUSH_TS interrupt to be triggered from
the GPU as the write event is processed.
Add CACHE_FLUSH_TS to the IRQ masks of A3xx and A4xx and remove the
workaround for A306.
Suggested-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Move a5xx specific code to load firmware into a buffer object to
the generic Adreno code. This will come in useful for future targets.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
The number and type of firmware files required differs for each
target. Instead of using a fixed struct member for each possible
firmware file use a generic list of files that should be loaded
on boot. Use some semi-target specific enums to help each target
find the appropriate firmware(s) that it needs to load.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Add support for devfreq to dynamically control the GPU frequency.
By default try to use the 'simple_ondemand' governor which can
adjust the frequency based on GPU load.
v2: Fix __aeabi_uldivmod issue from the 0 day bot and use
devfreq_recommended_opp() as suggested by Rob.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Move the clock parsing to adreno_gpu_init() to allow for target
specific probing and manipulation of the clock tables.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Remove the downstream bus scaling code. It isn't needed for for
compatibility with a downstream or vendor kernel. Get it out of the
way to clear space for devfreq support.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Trivial fix to spelling mistake in DRM_DEV_ERROR error message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Implement preemption for A5XX targets - this allows multiple
ringbuffers for different priorities with automatic preemption
of a lower priority ringbuffer if a higher one is ready.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
We use a global ringbuffer size and block size for all targets and
at least for 5XX preemption we need to know the value the RB_CNTL
in several locations so it makes sense to calculate it once and use
it everywhere.
The only monkey wrench is that we need to disable the RPTR shadow
for A430 targets but that only needs to be done once and doesn't
affect A5XX so we can or in the value at init time.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Add a shadow pointer to track the current command being written into
the ring. Don't commit it as 'cur' until the command is submitted.
Because 'cur' is used to construct the software copy of the wptr this
ensures that somebody peeking in on the ring doesn't assume that a
command is inflight while it is being written. This isn't a huge deal
with a single ring (though technically the hangcheck could assume
the system is prematurely busy when it isn't) but it will be rather
important for preemption where the decision to preempt is based
on a non-empty ringbuffer. Without a shadow an aggressive preemption
scheme could assume that the ringbuffer is non empty and switch to it
before the CPU is done writing the command and boom.
Even though preemption won't be supported for all targets because of
the way the code is organized it is simpler to make this generic for
all targets. The extra load for non-preemption targets should be
minimal.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
In order to manage ringbuffer priority to its fullest userspace
should know how many ringbuffers it has to work with. Add a
parameter to return the number of active rings.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Add the infrastructure to support the idea of multiple ringbuffers.
Assign each ringbuffer an id and use that as an index for the various
ring specific operations.
The biggest delta is to support legacy fences. Each fence gets its own
sequence number but the legacy functions expect to use a unique integer.
To handle this we return a unique identifier for each submission but
map it to a specific ring/sequence under the covers. Newer users use
a dma_fence pointer anyway so they don't care about the actual sequence
ID or ring.
The actual mechanics for multiple ringbuffers are very target specific
so this code just allows for the possibility but still only defines
one ringbuffer for each target family.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
When we move to multiple ringbuffers we're going to store the data
in the memptrs on a per-ring basis. In order to prepare for that
move the current memptrs from the adreno namespace into msm_gpu.
This is way cleaner and immediately lets us kill off some sub
functions so there is much less cost later when we do move to
per-ring structs.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
When firmware was added to linux-firmware, it was put in a qcom sub-
directory, unlike what we'd been using before. For a300_pfp.fw and
a300_pm4.fw symlinks were created, but we'd prefer not to have to do
this in the future. So add support to look in both places when
loading firmware.
Signed-off-by: Rob Clark <robdclark@gmail.com>
Previously, in an effort to defer initializing the gpu until firmware
was available (ie. rootfs mounted), the gpu was not loaded at when the
subdevice was bound. Which resulted that clks/etc were requested in a
place that devm couldn't really help unwind if something failed.
Instead move request_firmware() to gpu->hw_init() and construct the gpu
earlier in adreno_bind(). To avoid the rest of the driver needing to
be aware of a gpu that hasn't managed to load firmware and hw_init()
yet, stash the gpu ptr in the adreno device's drvdata, and don't set
priv->gpu() until hw_init() succeeds.
Signed-off-by: Rob Clark <robdclark@gmail.com>
Nearly all of the buffer allocations for kernel allocate an buffer object,
virtual address and GPU iova at the same time. Make a helper function to
handle the details.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
[dropped msm_fbdev conversion to new helper, since it interferes with
display-handover work, where we want to separate allocation and mapping]
Signed-off-by: Rob Clark <robdclark@gmail.com>
Currently the GPU MMU is attached in the adreno_gpu code but as
more and more of the GPU initialization moves to the generic
GPU path we have a need to map and use GPU memory earlier and
earlier. There isn't any reason to defer attaching the MMU
until later so attach it right after the address space is
created so it can be used immediately.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
msm_gpu's get_timestamp() op (called by the MSM_GET_PARAM ioctl) can
result in register accesses. We need our power domain and clocks to
be active for that. Make sure they are enabled here.
Signed-off-by: Archit Taneja <architt@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Buffer object specific resources like pages, domains, sg list
need not be protected with struct_mutex. They can be protected
with a buffer object level lock. This simplifies locking and
makes it easier to avoid potential recursive locking scenarios
for SVM involving mmap_sem and struct_mutex. This also removes
unnecessary serialization when creating buffer objects, and also
between buffer object creation and GPU command submission.
Signed-off-by: Sushmita Susheelendra <ssusheel@codeaurora.org>
[robclark: squash in handling new locking for shrinker]
Signed-off-by: Rob Clark <robdclark@gmail.com>
No functional change, that will come later. But this will make it
easier to deal with dynamically created address spaces (ie. per-
process pagetables for gpu).
Signed-off-by: Rob Clark <robdclark@gmail.com>
Most, but not all, paths where calling the with struct_mutex held. The
fast-path in msm_gem_get_iova() (plus some sub-code-paths that only run
the first time) was masking this issue.
So lets just always hold struct_mutex for hw_init(). And sprinkle some
WARN_ON()'s and might_lock() to avoid this sort of problem in the
future.
Signed-off-by: Rob Clark <robdclark@gmail.com>
memptrs->wptr seems to be unused. Remove it to avoid
confusing the upcoming preemption code.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
The amount of information that we need to pass into msm_gpu_init()
is steadily increasing, so add a new struct to stabilize the function
call and make it easier to add new configuration down the line.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Some A3XX and A4XX GPU targets required that the GPU clock be
programmed to a non zero value when it was disabled so
27Mhz was chosen as the "invalid" frequency.
Even though newer targets do not have the same clock restrictions
we still write 27Mhz on clock disable and expect the clock subsystem
to round down to zero.
For unknown reasons even though the slow clock speed is always
27Mhz and it isn't actually a functional level the legacy device tree
frequency tables always defined it and then did gymnastics to work
around it.
Instead of playing the same silly games just hard code the "slow" clock
speed in the code as 27MHz and save ourselves a bit of infrastructure.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
User space needs to know where the GMEM whole starts so that they
can set up the addressing correctly.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
There are reasons for a memory object to outlive the file descriptor
that created it and so the address space that a buffer object is
attached to must also outlive the file descriptor. Reference count
the address space so that it can remain viable until all the objects
have released their addresses.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
We should be detaching the MMU before destroying the address
space. To do this cleanly, the detach has to happen in
adreno_gpu_cleanup() because it needs access to structs
in adreno_gpu.c. Plus it is better symmetry to have
the attach and detach at the same code level.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>