Commit Graph

523 Commits

Author SHA1 Message Date
Ofir Bitton
0811b39146 habanalabs: add CS completion and timeout properties
In order to support staged submission feature, we need to
distinguish on which command submission we want to receive
timeout and for which we want to receive completion.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:51 +02:00
Ofir Bitton
d00697fbe1 habanalabs: add new mem ioctl op for mapping hw blocks
For future ASIC support the driver allows user to map certain regions
in the device's configuration space for direct access from userspace.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:51 +02:00
farah kassabri
89473a1fc3 habanalabs: fix MMU debugfs related nodes
In mmu debugfs node show un-scrambled physical addresses.
before read/write through data nodes, need to unscramble the
physical address before using it for pci transaction.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:50 +02:00
Ofir Bitton
e1fa724dd1 habanalabs: add user available interrupt to hw_ip
In order to support completions that arrive directly to the user,
the driver needs to supply the user with the first available msix
interrupt available.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:50 +02:00
farah kassabri
8d79ce162e habanalabs: always try to use the hint address
Currently hint address is ignored in case va block page size
is not power of 2. We need to support th user hint address also in this
case, but only if the hint address is aligned to page size.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:50 +02:00
Ofir Bitton
d2b980f329 habanalabs: add security violations dump to debugfs
In order to improve driver security debuggability, we add
security violations dump to debugfs.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:50 +02:00
Ofir Bitton
eea4c2557c habanalabs: ignore F/W BMC errors in case no BMC present
In order to support operation mode in which BMC is not active,
driver must not take BMC errors into consideration.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:50 +02:00
Ofir Bitton
f8bc7f091c habanalabs/gaudi: print sync manager SEI interrupt info
Driver must print sync manager SEI information upon receiving
interrupt from FW.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:50 +02:00
Christophe JAILLET
825b30c4f3 habanalabs: Use 'dma_set_mask_and_coherent()'
Axe 'hl_pci_set_dma_mask()' and replace it with an equivalent
'dma_set_mask_and_coherent()' call.

This makes the code a bit less verbose.

It also removes an erroneous comment, because 'hl_pci_set_dma_mask()'
does not try to use a fall-back value.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:50 +02:00
Ofir Bitton
423815bf02 habanalabs/gaudi: remove PCI access to SM block
Due to HW limitation we must remove all direct access to SM
registers, in order to do that we will access SM registers using
the HW QMANS.
When possible and no user context is present, we can directly access
the HW QMANS. Whenever there is an active user, driver will
prepare a pending command buffer list which will be sent upon
user submissions.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:50 +02:00
Ofir Bitton
d3f139c462 habanalabs: add driver support for internal cb scheduling
In order to support scnenarios in which driver needs access to
HW components but it cannot access them directly, we add support for
scheduling command buffers internally.
These command buffers will be transmitted upon next user command
submission context.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:50 +02:00
Ofir Bitton
1e3f2536a8 habanalabs: increment ctx ref from within a cs allocation
A CS must increment the relevant context reference count.
We want to increment the reference inside the CS allocation function
as opposed for today where we increment it outside.
This is logical since we want to avoid explicitly incrementing
the context every time we call the CS allocate function.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:50 +02:00
Ofir Bitton
8563e19159 habanalabs: separate common code to dedicated folders
We separate some of the common code source files to different
folders for a better maintainability and testability.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:50 +02:00
Ofir Bitton
edb07cb69c habanalabs: read device boot errors after cpucp is up
Boot cpu can report errors in various boot stages.
Current implementaion does not take into consideration errors
reported in late stages, hence we will check for errors at the most
late stage when fetching cpucp information.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:50 +02:00
Ofir Bitton
6769cea8de habanalabs: report correct dram size in info ioctl
In case MMU is enabled, we must take MMU page size into
consideration when reporting dram size to the user.
This is because the MMU page size can be a value which is NOT
a power-of-2 value. As a result, the total DRAM size (which is always
a power-of-2 value) needed to be rounded-down.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:49 +02:00
Moti Haimovski
b19dc67aa8 habanalabs: support non power-of-2 DRAM phys page sizes
DRAM physical page sizes depend of the amount of HBMs available in
the device. this number is device-dependent and may also be subject
to binning when one or more of the DRAM controllers are found to
to be faulty. Such a configuration may lead to partitioning the DRAM
to non-power-of-2 pages.

To support this feature we also need to add infrastructure of address
scarmbling.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:49 +02:00
Ofir Bitton
a1f8533269 habanalabs: remove access to kernel memory using debugfs
Accessing kernel allocated memory through debugfs should not
be allowed as it introduces a security vulnerability.
We remove the option to read/write kernel memory for all asics.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:49 +02:00
Ofir Bitton
266cdfa2b7 habanalabs/gaudi: set uninitialized symbol
Initialize local variable that is returned by the function, in
case it is never assigned.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:49 +02:00
Alon Mizrahi
9402a33624 habanalabs: return dram virtual address in info ioctl
When working with DRAM MMU, we should supply the userspace with the
virtual start address of the DRAM instead of the physical one. This
is because the physical one has no meaning for the user as he only
knows the virtual address range.

Signed-off-by: Alon Mizrahi <amizrahi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:49 +02:00
Oded Gabbay
3abe1040ba habanalabs: update to latest hl_boot_if.h
Update the latest version of this file that the F/W exports

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:49 +02:00
Oded Gabbay
1530d46817 habanalabs: add ASIC property of functional HBMs
The number of functional HBMs in the same ASIC can be different due
to malfunctioning HBM banks.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:49 +02:00
Ofir Bitton
2e36856008 habanalabs/gaudi: add debug prints for security status
In order to have more information while debugging boot issues,
we should print the firmware security status at every boot stage.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:49 +02:00
Omer Shpigelman
f19040ce41 habanalabs: modify memory functions signatures
For consistency, modify all memory ioctl functions to get the ioctl
arguments structure rather than the arguments themselves.

Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:49 +02:00
Omer Shpigelman
3b762f55aa habanalabs: kernel doc format in memory functions
Change all memory functions documentation according to kernel doc
format.

Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:49 +02:00
Alon Mizrahi
75d9a2a0aa habanalabs: replace WARN/WARN_ON with dev_crit in driver
Often WARN is defined in data-centers as BUG and we would like to
avoid hanging the entire server on some internal error of the driver
(important as it might be).

Therefore, use dev_crit instead.

Signed-off-by: Alon Mizrahi <amizrahi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:49 +02:00
Moti Haimovski
0eda23d77e habanalabs: report dram_page_size in hw_ip_info ioctl
Instead of having it hard-coded as a define, pass it to the user
in runtime.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:49 +02:00
Ohad Sharabi
e1b85dbaf0 habanalabs/goya: move mmu_prepare to context init
Currently mmu_prepare is located at context switch.
Since we support a single context, no reason to reconfigure
the MMU registers every context switch.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:49 +02:00
Ofir Bitton
f8b0f2ecc5 habanalabs/gaudi: remove duplicated gaudi packets masks
As all packets use the same CTL register masks, we remove duplicated
masks and use common masks instead.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:48 +02:00
Ofir Bitton
c209e74214 habanalabs: allow user to pass a staged submission seq
In order to support the staged submission feature, user must be
allowed to use the same CS sequence for all submissions in the
same staged submission.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:48 +02:00
Ofir Bitton
ac6fdbfe2e habanalabs/gaudi: support CS with no completion
As part of the staged submission feature, we need Gaudi to support
command submissions that will never get a completion.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:48 +02:00
Ofir Bitton
8e39e75a13 habanalabs: Init the VM module for kernel context
In order for reserving VA ranges for kernel memory, we need
to allow the VM module to be initiated with kernel context.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:48 +02:00
Ohad Sharabi
cb6ef0ee6d habanalabs: refactor MMU locks code
remove mmu_cache_lock as it protects a section which is already
protected by mmu_lock.

in addition, wrap mmu cache invalidate calls in hl_vm_ctx_fini with
mmu_lock.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:48 +02:00
Oded Gabbay
4c998836d4 habanalabs: update firmware boot interface
Update to latest firmware hl_boot_if.h file.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-27 21:03:48 +02:00
Oded Gabbay
2dc4a6d791 habanalabs: disable FW events on device removal
When device is removed, we need to make sure the F/W won't send us
any more events because during the remove process we disable the
interrupts.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-21 20:30:22 +02:00
Oded Gabbay
f8abaf379b habanalabs: fix backward compatibility of idle check
Need to take the lower 32 bits of the driver's 64-bit idle mask and put
it in the legacy 32-bit variable that the userspace reads to know the
idle mask.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-21 20:30:22 +02:00
Ofir Bitton
9354f1b421 habanalabs: zero pci counters packet before submit to FW
Driver does not zero some pci counters packets before sending
to FW. This causes an out of sync PI/CI between driver and FW.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-21 20:30:22 +02:00
Oded Gabbay
9488307a55 habanalabs: prevent soft lockup during unmap
When using Deep learning framework such as tensorflow or pytorch, there
are tens of thousands of host memory mappings. When the user frees
all those mappings at the same time, the process of unmapping and
unpinning them can take a long time, which may cause a soft lockup
bug.

To prevent this, we need to free the core to do other things during
the unmapping process. For now, we chose to do it every 32K unmappings
(each unmap is a single 4K page).

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-12 15:00:10 +02:00
Oded Gabbay
aa6df6533b habanalabs: fix reset process in case of failures
There are some points in the reset process where if the code fails
for some reason, and the system admin tries to initiate the reset
process again we will get a kernel panic.

This is because there aren't any protections in different fini
functions that are called during the reset process.

The protections that are added in this patch make sure that if the fini
functions are called multiple times, without calling init functions
between them, there won't be double release of already released
resources.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-12 14:59:52 +02:00
Oded Gabbay
a9d4ef6434 habanalabs: fix dma_addr passed to dma_mmap_coherent
When doing dma_alloc_coherent in the driver, we add a certain hard-coded
offset to the DMA address before returning to the callee function. This
offset is needed when our device use this DMA address to perform
outbound transactions to the host.

However, if we want to map the DMA'able memory to the user via
dma_mmap_coherent(), we need to pass the original dma address, without
this offset. Otherwise, we will get erronouos mapping.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-01-12 14:59:36 +02:00
Dinghao Liu
b000700d6d habanalabs: Fix memleak in hl_device_reset
When kzalloc() fails, we should execute hl_mmu_fini()
to release the MMU module. It's the same when
hl_ctx_init() fails.

Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-29 23:23:12 +02:00
Oded Gabbay
097c62b6f0 habanalabs: fix order of status check
When the device is in reset or needs to be reset, the disabled property
is don't-care.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28 08:47:39 +02:00
Oded Gabbay
fcaebc7354 habanalabs: register to pci shutdown callback
We need to make sure our device is idle when rebooting a virtual
machine. This is done in the driver level.

The firmware will later handle FLR but we want to be extra safe and
stop the devices until the FLR is handled.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28 08:47:39 +02:00
Alon Mizrahi
a3fd283063 habanalabs: add validation cs counter, fix misplaced counters
Up until now validation errors were counted in the parsing field
of the cs_counters struct, so we added a new counter and increased
it when needed.

In addition, there were some locations where only one of the counters
was updated (ctx or aggregate) so add the second one to be updated
as well.

Signed-off-by: Alon Mizrahi <amizrahi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28 08:47:39 +02:00
Oded Gabbay
98e8781f00 habanalabs/gaudi: retry loading TPC f/w on -EINTR
If loading the firmware file for the TPC f/w was interrupted, try
to do it again, up to 5 times.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28 08:47:39 +02:00
Oded Gabbay
377182a3cc habanalabs: adjust pci controller init to new firmware
When the firmware security is enabled, the pcie_aux_dbi_reg_addr
register in the PCI controller is blocked. Therefore, ignore
the result of writing to this register and assume it worked. Also
remove the prints on errors in the internal ELBI write function.

If the security is enabled, the firmware is responsible for setting
this register correctly so we won't have any problem.

If the security is disabled, the write will work (unless something
is totally broken at the PCI level and then the whole sequence
will fail).

In addition, remove a write to register pcie_aux_dbi_reg_addr+4,
which was never actually needed.

Moreover, PCIE_DBI registers are blocked to access from host when
firmware security is enabled. Use a different register to flush the
writes.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28 08:47:39 +02:00
Oded Gabbay
90ffe170a3 habanalabs: update comment in hl_boot_if.h
Hard-reset flag is updated in many stages of the boot sequence of the
firmware.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28 08:47:38 +02:00
Oded Gabbay
13d0ee10b5 habanalabs/gaudi: enhance reset message
Print the initiator who performs the hard-reset for easier debugging.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28 08:47:38 +02:00
Ofir Bitton
6bbb77b9e6 habanalabs: full FW hard reset support
Driver must fetch FW hard reset capability at every FW boot stage:
preboot, CPU boot, CPU application.
If hard reset is triggered, driver will take into consideration
only the last capability received.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28 08:47:38 +02:00
Oded Gabbay
0024c09485 habanalabs/gaudi: disable CGM at HW initialization
In case the clock gating was enabled in preboot we need to disable it
at the H/W initialization stage before touching the MME/TPC registers.
Otherwise, the ASIC can get stuck. If the security is enabled in
the firmware level, the CGM is always disabled and the driver can't
enable it.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28 08:47:38 +02:00
Tomer Tayar
7a585dfc32 habanalabs: Revise comment to align with mirror list name
hw_queues_mirror was renamed to cs_mirror, so revise accordingly a
comment that refers to this list.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-12-28 08:47:38 +02:00