We traditionally have linux/ before asm/
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The XIVE interrupt controller is the new interrupt controller
found in POWER9. It supports advanced virtualization capabilities
among other things.
Currently we use a set of firmware calls that simulate the old
"XICS" interrupt controller but this is fairly inefficient.
This adds the framework for using XIVE along with a native
backend which OPAL for configuration. Later, a backend allowing
the use in a KVM or PowerVM guest will also be provided.
This disables some fast path for interrupts in KVM when XIVE is
enabled as these rely on the firmware emulation code which is no
longer available when the XIVE is used natively by Linux.
A latter patch will make KVM also directly exploit the XIVE, thus
recovering the lost performance (and more).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Fixup pr_xxx("XIVE:"...), don't split pr_xxx() strings,
tweak Kconfig so XIVE_NATIVE selects XIVE and depends on POWERNV,
fix build errors when SMP=n, fold in fixes from Ben:
Don't call cpu_online() on an invalid CPU number
Fix irq target selection returning out of bounds cpu#
Extra sanity checks on cpu numbers
]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some powerpc platforms use this to move IRQs away from a CPU being
unplugged. This function has several bugs such as not taking the right
locks or failing to NULL check pointers.
There's a new generic function doing exactly the same thing without all
the bugs, so let's use it instead.
mpe: The obvious place for the select of GENERIC_IRQ_MIGRATION is on
HOTPLUG_CPU, but that doesn't work. On some configs PM_SLEEP_SMP will
select HOTPLUG_CPU even though its dependencies are not met, which means
the select of GENERIC_IRQ_MIGRATION doesn't happen. That leads to the
build breaking. Fix it by moving the select of GENERIC_IRQ_MIGRATION to
SMP.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some platforms (will) need to perform allocations before bringing
a new CPU online. Doing it from smp_ops->setup_cpu is the wrong
thing to do:
- It has no useful failure path (too late)
- Calling any allocator will enable interrupts prematurely
causing problems with large decrementer among others
Instead, add a new callback that is called from __cpu_up (so from
the context trying to online the new CPU) at a point where we
can safely allocate and handle failures.
This will be used by XIVE support.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
New versions of OPAL have a device node /ibm,opal/firmware/exports, each
property of which describes a range of memory in OPAL that Linux might
want to export to userspace for debugging.
This patch adds a sysfs file under 'opal/exports' for each property
found there, and makes it read-only by root.
Signed-off-by: Matt Brown <matthew.brown.dev@gmail.com>
[mpe: Drop counting of props, rename to attr, free on sysfs error, c'log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When booting very large systems with a large initrd, we run out of
space early in boot for either RTAS or the flattened device tree (FDT).
Boot fails with messages like:
Could not allocate memory for RTAS
or
No memory for flatten_device_tree (no room)
Increasing the minimum RMA size to 512MB fixes the problem. This
should not have an impact on smaller LPARs (with 256MB memory),
as the firmware will cap the RMA to the memory assigned to the LPAR.
Fix is based on input/discussions with Michael Ellerman. Thanks to
Praveen K. Pandey for testing on a large system.
Reported-by: Praveen K. Pandey <preveen.pandey@in.ibm.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nvlink2 supports address translation services (ATS) allowing devices
to request address translations from an mmu known as the nest MMU
which is setup to walk the CPU page tables.
To access this functionality certain firmware calls are required to
setup and manage hardware context tables in the nvlink processing unit
(NPU). The NPU also manages forwarding of TLB invalidates (known as
address translation shootdowns/ATSDs) to attached devices.
This patch exports several methods to allow device drivers to register
a process id (PASID/PID) in the hardware tables and to receive
notification of when a device should stop issuing address translation
requests (ATRs). It also adds a fault handler to allow device drivers
to demand fault pages in.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
[mpe: Fix up comment formatting, use flush_tlb_mm()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The pnv_pci_get_{gpu|npu}_dev functions are used to find associations
between nvlink PCIe devices and standard PCIe devices. However they
lacked basic sanity checking which results in NULL pointer
dereferencing if they are incorrect called can be harder to spot than
an explicit WARN_ON.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is of_property_read_u32_index but no u64 variant. This patch
adds one similar to the u32 version for u64.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The code to fix the problem it describes was removed in commit
c40785ad30 ("powerpc/dart: Use a cachable DART"), and it uses the
stupid comment style. Away it goooooooooooooes!
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Early on in do_page_fault() we call store_updates_sp(), regardless of
the type of exception. For an instruction miss this doesn't make
sense, because we only use this information to detect if a data miss
is the result of a stack expansion instruction or not.
Worse still, it results in a data miss within every userspace
instruction miss handler, because we try and load the very instruction
we are about to install a pte for!
A simple exec microbenchmark runs 6% faster on POWER8 with this fix:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
unsigned long left = atol(argv[1]);
char leftstr[16];
if (left-- == 0)
return 0;
sprintf(leftstr, "%ld", left);
execlp(argv[0], argv[0], leftstr, NULL);
perror("exec failed\n");
return 0;
}
Pass the number of iterations on the command line (eg 10000) and time
how long it takes to execute.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For an MCE (Machine Check Exception) that hits while in user mode
MSR(PR=1), print the task info to the console MCE error log. This may
help to identify an application that triggered the MCE.
After this patch the MCE console looks like:
Severe Machine check interrupt [Recovered]
NIP: [0000000010039778] PID: 762 Comm: ebizzy
Initiator: CPU
Error type: SLB [Multihit]
Effective address: 0000000010039778
Severe Machine check interrupt [Not recovered]
NIP: [0000000010039778] PID: 763 Comm: ebizzy
Initiator: CPU
Error type: UE [Page table walk ifetch]
Effective address: 0000000010039778
ebizzy[763]: unhandled signal 7 at 0000000010039778 nip 0000000010039778 lr 0000000010001b44 code 30004
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For D-side errors we print the load/store address that caused the
machine check as 'Effective address'. But the instruction that may have
caused the machine check can also be helpful, so in addition to printing
the NIP, also print the kernel function name as well.
After this patch the MCE console log would look like:
Severe Machine check interrupt [Recovered]
NIP [d00000001bc70194]: init_module+0x194/0x2b0 [bork_kernel]
Initiator: CPU
Error type: SLB [Parity]
Effective address: d000000026de0000
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Not all user space application is ready to handle wide addresses. It's
known that at least some JIT compilers use higher bits in pointers to
encode their information. It collides with valid pointers with 512TB
addresses and leads to crashes.
To mitigate this, we are not going to allocate virtual address space
above 128TB by default.
But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 128TB.
If hint address set above 128TB, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 128TB window.
This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.
This is going to be a per mmap decision. ie, we can have some mmaps with
larger addresses and other that do not.
A sample memory layout looks like:
10000000-10010000 r-xp 00000000 fc:00 9057045 /home/max_addr_512TB
10010000-10020000 r--p 00000000 fc:00 9057045 /home/max_addr_512TB
10020000-10030000 rw-p 00010000 fc:00 9057045 /home/max_addr_512TB
10029630000-10029660000 rw-p 00000000 00:00 0 [heap]
7fff834a0000-7fff834b0000 rw-p 00000000 00:00 0
7fff834b0000-7fff83670000 r-xp 00000000 fc:00 9177190 /lib/powerpc64le-linux-gnu/libc-2.23.so
7fff83670000-7fff83680000 r--p 001b0000 fc:00 9177190 /lib/powerpc64le-linux-gnu/libc-2.23.so
7fff83680000-7fff83690000 rw-p 001c0000 fc:00 9177190 /lib/powerpc64le-linux-gnu/libc-2.23.so
7fff83690000-7fff836a0000 rw-p 00000000 00:00 0
7fff836a0000-7fff836c0000 r-xp 00000000 00:00 0 [vdso]
7fff836c0000-7fff83700000 r-xp 00000000 fc:00 9177193 /lib/powerpc64le-linux-gnu/ld-2.23.so
7fff83700000-7fff83710000 r--p 00030000 fc:00 9177193 /lib/powerpc64le-linux-gnu/ld-2.23.so
7fff83710000-7fff83720000 rw-p 00040000 fc:00 9177193 /lib/powerpc64le-linux-gnu/ld-2.23.so
7fffdccf0000-7fffdcd20000 rw-p 00000000 00:00 0 [stack]
1000000000000-1000000010000 rw-p 00000000 00:00 0
1ffff83710000-1ffff83720000 rw-p 00000000 00:00 0
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now that we use all the available virtual address range, we need to make
sure we don't generate VSID such that it overlaps with the reserved vsid
range. Reserved vsid range include the virtual address range used by the
adjunct partition and also the VRMA virtual segment. We find the context
value that can result in generating such a VSID and reserve it early in
boot.
We don't look at the adjunct range, because for now we disable the
adjunct usage in a Linux LPAR via CAS interface.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Rewrite hash__reserve_context_id(), move the rest into pseries]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We optmize the slice page size array copy to paca by copying only the
range based on addr_limit. This will require us to not look at page size
array beyond addr_limit in PACA on slb fault. To enable that copy task
size to paca which will be used during slb fault.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Rename from task_size to addr_limit, consolidate #ifdefs]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In the followup patch, we will increase the slice array size to handle
512TB range, but will limit the max addr to 128TB. Avoid doing
unnecessary computation and avoid doing slice mask related operation
above address limit.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We update the hash linux page table layout such that we can support
512TB. But we limit the TASK_SIZE to 128TB. We can switch to 128TB by
default without conditional because that is the max virtual address
supported by other architectures. We will later add a mechanism to
on-demand increase the application's effective address range to 512TB.
Having the page table layout changed to accommodate 512TB makes testing
large memory configuration easier with less code changes to kernel
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This doesn't have any functional change. But helps in avoiding mistakes
in case the shift bit changes
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Inorder to support large effective address range (512TB), we want to
increase the virtual address bits to 68. But we do have platforms like
p4 and p5 that can only do 65 bit VA. We support those platforms by
limiting context bits on them to 16.
The protovsid -> vsid conversion is verified to work with both 65 and 68
bit va values. I also documented the restrictions in a table format as
part of code comments.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
get_kernel_vsid() has a very stern comment saying that it's only valid
for kernel addresses, but there's nothing in the code to enforce that.
Rather than hoping our callers are well behaved, add a check and return
a VSID of 0 (invalid).
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently we use the top 4 context ids (0x7fffc-0x7ffff) for the kernel.
Kernel VSIDs are built using these top context values and effective the
segement ID. In subsequent patches we want to increase the max effective
address to 512TB. We will achieve that by increasing the effective
segment IDs there by increasing virtual address range.
We will be switching to a 68bit virtual address in the following patch.
But platforms like Power4 and Power5 only support a 65 bit virtual
address. We will handle that by limiting the context bits to 16 instead
of 19 on those platforms. That means the max context id will have a
different value on different platforms.
So that we don't have to deal with the kernel context ids changing
between different platforms, move the kernel context ids down to use
context ids 1-4.
We can't use segment 0 of context-id 0, because that maps to VSID 0,
which we want to keep as invalid, so we avoid context-id 0 entirely.
Similarly we can't use the last segment of the maximum context, so we
avoid it too.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Switch from 0-3 to 1-4 so VSID=0 remains invalid]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Complete the split of the radix vs hash mm context initialisation.
This is mostly code movement, with the exception that we now limit the
context allocation to PRTB_ENTRIES - 1 on radix.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The min and max context id values used in alloc_context_id() are
currently the right values for use on hash, and happen to also be safe
for use on radix.
But we need to change that in a subsequent patch, so make the min/max
ids parameters and pull the hash values into hsah__alloc_context_id().
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
KVM wants to be able to allocate an MMU context id, which it does
currently by calling __init_new_context().
We're about to rework that code, so provide a wrapper for KVM so it
can not worry about the details.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We now get output like below which is much better.
[ 0.935306] good_mask low_slice: 0-15
[ 0.935360] good_mask high_slice: 0-511
Compared to
[ 0.953414] good_mask:1111111111111111 - 1111111111111.........
I also fixed an error with slice_dbg printing.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This structure definition need not be in a header since this is used only by
slice.c file. So move it to slice.c. This also allow us to use SLICE_NUM_HIGH
instead of 64.
I also switch the low_slices type to u64 from u16. This doesn't have an impact
on size of struct due to padding added with u16 type. This helps in using
bitmap printing function for printing slice mask.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Remove the checks that TASK_SIZE_USER64 is smaller than H_PGTABLE_RANGE
and USER_VSID_RANGE.
In a following patch we will deliberately add support for a TASK_SIZE
smaller than both ranges, so this will no longer be an error condition.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Keep the check in pgtable_64.c that we don't exceed USER_VSID_RANGE]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We also update the function arg to struct mm_struct. Move this so that function
finds the definition of struct mm_struct. No functional change in this patch.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This avoid copying the slice_mask struct as function return value
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In followup patch we want to increase the va range which will result
in us requiring high_slices to have more than 64 bits. To enable this
convert high_slices to bitmap. We keep the number bits same in this patch
and later change that to higher value
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Fold in fix to use bitmap_empty()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We don't support the full 57 bits of physical address and hence can
overload the top bits of RPN as hash specific pte bits.
Add a BUILD_BUG_ON() to enforce the relationship between H_PAGE_F_SECOND
and H_PAGE_F_GIX.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
[mpe: Move the BUILD_BUG_ON() into hash_utils_64.c and comment it]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Max value supported by hardware is 51 bits address. Radix page table define
a slot of 57 bits for future expansion. We restrict the value supported in
linux kernel 53 bits, so that we can use the bits between 57-53 for storing
hash linux page table bits. This is done in the next patch.
This will free up the software page table bits to be used for features
that are needed for both hash and radix. The current hash linux page table
format doesn't have any free software bits. Moving hash linux page table
specific bits to top of RPN field free up the software bits for other purpose.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Conditional PTE bit definition is confusing and results in coding error.
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Without this if firmware reports 1MB page size support we will crash
trying to use 1MB as hugetlb page size.
echo 300 > /sys/kernel/mm/hugepages/hugepages-1024kB/nr_hugepages
kernel BUG at ./arch/powerpc/include/asm/hugetlb.h:19!
.....
....
[c0000000e2c27b30] c00000000029dae8 .hugetlb_fault+0x638/0xda0
[c0000000e2c27c30] c00000000026fb64 .handle_mm_fault+0x844/0x1d70
[c0000000e2c27d70] c00000000004805c .do_page_fault+0x3dc/0x7c0
[c0000000e2c27e30] c00000000000ac98 handle_page_fault+0x10/0x30
With fix, we don't enable 1MB as hugepage size.
bash-4.2# cd /sys/kernel/mm/hugepages/
bash-4.2# ls
hugepages-16384kB hugepages-16777216kB
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With this we have on powernv and pseries /proc/cpuinfo reporting
timebase : 512000000
platform : PowerNV
model : 8247-22L
machine : PowerNV 8247-22L
firmware : OPAL
MMU : Hash
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This bit is only used by radix and it is nice to follow the naming style of having
bit name start with H_/R_ depending on which translation mode they are used.
No functional change in this patch.
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Define everything based on bits present in pgtable.h. This will help in easily
identifying overlapping bits between hash/radix.
No functional change with this patch.
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For low slice, max addr should be less than 4G. Without limiting this correctly
we will end up with a low slice mask which has 17th bit set. This is not
a problem with the current code because our low slice mask is of type u16. But
in later patch I am switching low slice mask to u64 type and having the 17bit
set result in wrong slice mask which in turn results in mmap failures.
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
BOOKE code is dead code as per the Kconfig details. So make it simpler
by enabling MM_SLICE only for book3s_64. The changes w.r.t nohash is just
removing deadcode. W.r.t ppc64, 4k without hugetlb will now enable MM_SLICE.
But that is good, because we reduce one extra variant which probably is not
getting tested much.
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
sam440ep_setup_rtc() is just called by machine_device_initcall() so make
it __init.
Signed-off-by: Yang Shi <yang.shi@windriver.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With the unnecessary restriction to reserve memory for fadump at the
top of RAM forgone, update the documentation accordingly.
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently, the area to preserve boot memory is reserved at the top of
RAM. This leaves fadump vulnerable to memory hot-remove operations. As
memory for fadump has to be reserved early in the boot process, fadump
can't be registered after a memory hot-remove operation. Though this
problem can't be eleminated completely, the impact can be minimized by
reserving memory at an offset closer to bottom of the RAM. The offset
for fadump memory reservation can be any value greater than fadump boot
memory size.
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
OPAL returns OPAL_WRONG_STATE upon failing to provide sensor data due to
core sleeping/offline. Add a check in opal_get_sensor_data() for sensor
read failure with OPAL_WRONG_STATE return code and return -EIO.
Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.
This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_tce_table_put() and makes
iommu_free_table() static. iommu_tce_table_get() is not used in this patch
but it will be in the following patch.
Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.
This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.
As we are going to have reference counting on tables, we need an unified
way of disposing tables.
This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.
As from now on the iommu_free_table() calls it_ops->free(), we need
to have it_ops initialized before calling iommu_free_table() so this
moves this initialization in pnv_pci_ioda2_create_table().
This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>