We worked around some nasty KVM magic page hcall breakages:
1) NX bit not honored, so ignore NX when we detect it
2) LE guests swizzle hypercall instruction
Without these fixes in place, there's no way it would make sense to expose kvm
hypercalls to a guest. Chances are immensely high it would trip over and break.
So add a new CAP that gives user space a hint that we have workarounds for the
bugs above in place. It can use those as hint to disable PV hypercalls when
the guest CPU is anything POWER7 or higher and the host does not have fixes
in place.
Signed-off-by: Alexander Graf <agraf@suse.de>
When we reset the in-kernel MPIC controller, we forget to reset some hidden
state such as destmask and output. This state is usually set when the guest
writes to the IDR register for a specific IRQ line.
To make sure we stay in sync and don't forget hidden state, treat reset of
the IDR register as a simple write of the IDR register. That automatically
updates all the hidden state as well.
Reported-by: Paul Janzen <pcj@pauljanzen.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
There are LE Linux guests out there that don't handle hypercalls correctly.
Instead of interpreting the instruction stream from device tree as big endian
they assume it's a little endian instruction stream and fail.
When we see an illegal instruction from such a byte reversed instruction stream,
bail out graciously and just declare every hcall as error.
Signed-off-by: Alexander Graf <agraf@suse.de>
We get an array of instructions from the hypervisor via device tree that
we write into a buffer that gets executed whenever we want to make an
ePAPR compliant hypercall.
However, the hypervisor passes us these instructions in BE order which
we have to manually convert to LE when we want to run them in LE mode.
With this fixup in place, I can successfully run LE kernels with KVM
PV enabled on PR KVM.
Signed-off-by: Alexander Graf <agraf@suse.de>
Use make_dsisr instead of open coding it. This also have
the added benefit of handling alignment interrupt on additional
instructions.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Although it's optional, IBM POWER cpus always had DAR value set on
alignment interrupt. So don't try to compute these values.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Because old kernels enable the magic page and then choke on NXed trampoline
code we have to disable NX by default in KVM when we use the magic page.
However, since commit b18db0b8 we have successfully fixed that and can now
leave NX enabled, so tell the hypervisor about this.
Signed-off-by: Alexander Graf <agraf@suse.de>
Old guests try to use the magic page, but map their trampoline code inside
of an NX region.
Since we can't fix those old kernels, try to detect whether the guest is sane
or not. If not, just disable NX functionality in KVM so that old guests at
least work at all. For newer guests, add a bit that we can set to keep NX
functionality available.
Signed-off-by: Alexander Graf <agraf@suse.de>
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size. Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page. This
capability is called mixed page-size segment (MPSS). With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].
We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.
This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Today when KVM tries to reserve memory for the hash page table it
allocates from the normal page allocator first. If that fails it
falls back to CMA's reserved region. One of the side effects of
this is that we could end up exhausting the page allocator and
get linux into OOM conditions while we still have plenty of space
available in CMA.
This patch addresses this issue by first trying hash page table
allocation from CMA's reserved region before falling back to the normal
page allocator. So if we run out of memory, we really are out of memory.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
POWER8 introduces transactional memory which brings along a number of new
registers and MSR bits.
Implementing all of those is a pretty big headache, so for now let's at least
emulate enough to make Linux's context switching code happy.
Signed-off-by: Alexander Graf <agraf@suse.de>
POWER8 introduces a new facility called the "Event Based Branch" facility.
It contains of a few registers that indicate where a guest should branch to
when a defined event occurs and it's in PR mode.
We don't want to really enable EBB as it will create a big mess with !PR guest
mode while hardware is in PR and we don't really emulate the PMU anyway.
So instead, let's just leave it at emulation of all its registers.
Signed-off-by: Alexander Graf <agraf@suse.de>
POWER8 implements a new register called TAR. This register has to be
enabled in FSCR and then from KVM's point of view is mere storage.
This patch enables the guest to use TAR.
Signed-off-by: Alexander Graf <agraf@suse.de>
POWER8 introduced a new interrupt type called "Facility unavailable interrupt"
which contains its status message in a new register called FSCR.
Handle these exits and try to emulate instructions for unhandled facilities.
Follow-on patches enable KVM to expose specific facilities into the guest.
Signed-off-by: Alexander Graf <agraf@suse.de>
In parallel to the Processor ID Register (PIR) threaded POWER8 also adds a
Thread ID Register (TIR). Since PR KVM doesn't emulate more than one thread
per core, we can just always expose 0 here.
Signed-off-by: Alexander Graf <agraf@suse.de>
When we expose a POWER8 CPU into the guest, it will start accessing PMU SPRs
that we don't emulate. Just ignore accesses to them.
Signed-off-by: Alexander Graf <agraf@suse.de>
With the previous patches applied, we can now successfully use PR KVM on
little endian hosts which means we can now allow users to select it.
However, HV KVM still needs some work, so let's keep the kconfig conflict
on that one.
Signed-off-by: Alexander Graf <agraf@suse.de>
When the host CPU we're running on doesn't support dcbz32 itself, but the
guest wants to have dcbz only clear 32 bytes of data, we loop through every
executable mapped page to search for dcbz instructions and patch them with
a special privileged instruction that we emulate as dcbz32.
The only guests that want to see dcbz act as 32byte are book3s_32 guests, so
we don't have to worry about little endian instruction ordering. So let's
just always search for big endian dcbz instructions, also when we're on a
little endian host.
Signed-off-by: Alexander Graf <agraf@suse.de>
The shared (magic) page is a data structure that contains often used
supervisor privileged SPRs accessible via memory to the user to reduce
the number of exits we have to take to read/write them.
When we actually share this structure with the guest we have to maintain
it in guest endianness, because some of the patch tricks only work with
native endian load/store operations.
Since we only share the structure with either host or guest in little
endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv.
For booke, the shared struct stays big endian. For book3s_64 hv we maintain
the struct in host native endian, since it never gets shared with the guest.
For book3s_64 pr we introduce a variable that tells us which endianness the
shared struct is in and route every access to it through helper inline
functions that evaluate this variable.
Signed-off-by: Alexander Graf <agraf@suse.de>
We expose a blob of hypercall instructions to user space that it gives to
the guest via device tree again. That blob should contain a stream of
instructions necessary to do a hypercall in big endian, as it just gets
passed into the guest and old guests use them straight away.
Signed-off-by: Alexander Graf <agraf@suse.de>
When the guest does an RTAS hypercall it keeps all RTAS variables inside a
big endian data structure.
To make sure we don't have to bother about endianness inside the actual RTAS
handlers, let's just convert the whole structure to host endian before we
call our RTAS handlers and back to big endian when we return to the guest.
Signed-off-by: Alexander Graf <agraf@suse.de>
The HTAB on PPC is always in big endian. When we access it via hypercalls
on behalf of the guest and we're running on a little endian host, we need
to make sure we swap the bits accordingly.
Signed-off-by: Alexander Graf <agraf@suse.de>
The default MSR when user space does not define anything should be identical
on little and big endian hosts, so remove MSR_LE from it.
Signed-off-by: Alexander Graf <agraf@suse.de>
The "shadow SLB" in the PACA is shared with the hypervisor, so it has to
be big endian. We access the shadow SLB during world switch, so let's make
sure we access it in big endian even when we're on a little endian host.
Signed-off-by: Alexander Graf <agraf@suse.de>
The HTAB is always big endian. We access the guest's HTAB using
copy_from/to_user, but don't yet take care of the fact that we might
be running on an LE host.
Wrap all accesses to the guest HTAB with big endian accessors.
Signed-off-by: Alexander Graf <agraf@suse.de>
The HTAB is always big endian. We access the guest's HTAB using
copy_from/to_user, but don't yet take care of the fact that we might
be running on an LE host.
Wrap all accesses to the guest HTAB with big endian accessors.
Signed-off-by: Alexander Graf <agraf@suse.de>
Commit 9308ab8e2d made C/R HTAB updates go byte-wise into the target HTAB.
However, it didn't update the guest's copy of the HTAB, but instead the
host local copy of it.
Write to the guest's HTAB instead.
Signed-off-by: Alexander Graf <agraf@suse.de>
CC: Paul Mackerras <paulus@samba.org>
Acked-by: Paul Mackerras <paulus@samba.org>
This patch make sure we inherit the LE bit correctly in different case
so that we can run Little Endian distro in PR mode
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
The dcbtls instruction is able to lock data inside the L1 cache.
We don't want to give the guest actual access to hardware cache locks,
as that could influence other VMs on the same system. But we can tell
the guest that its locking attempt failed.
By implementing the instruction we at least don't give the guest a
program exception which it definitely does not expect.
Signed-off-by: Alexander Graf <agraf@suse.de>
The L1 instruction cache control register contains bits that indicate
that we're still handling a request. Mask those out when we set the SPR
so that a read doesn't assume we're still doing something.
Signed-off-by: Alexander Graf <agraf@suse.de>
It hangs the hardware.
Signed-off-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Cc: stable@vger.kernel.org
ACPI Battery device receives notifications from firmware frequently,
and most of these notifications are some general events, like battery
remaining capacity change, etc, which should not wake the system up
if the system is in suspend/hibernate state.
This causes a problem that the system wakes up from suspend to freeze
shortly, because there is an ACPI battery notification every 10 seconds.
Fix the problem in this patch by registering ACPI battery device'
own wakeup source, and waking up the system only when the battery remaining
capacity is critical low, or lower than the alarm capacity set via _BTP.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=76221
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, all the power supply devices are registered with wakeup source,
this results in that every power_supply_changed() invocation brings
the system out of suspend-to-freeze state.
This is overkill as some device drivers, e.g. ACPI battery driver,
have the ability to check the device status and wake up the system
from sleeping only when necessary.
Thus introduce a new API which allows device to be registered
w/o wakeup source.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
ACPI battery device receives notifications when
1. the remaining battery capacity becomes critical low
2. the trip point set by the _BTP (Design capacity of Warning by default)
is reached or crossed.
So it is able to support POWER_SUPPLY_PROP_CAPACITY_LEVEL to report
POWER_SUPPLY_CAPACITY_LEVEL_CRITICAL,
POWER_SUPPLY_CAPACITY_LEVEL_LOW,
POWER_SUPPLY_CAPACITY_LEVEL_NORMAL,
POWER_SUPPLY_CAPACITY_LEVEL_FULL,
capacity levels to power supply core and user space.
Introduce support for POWER_SUPPLY_PROP_CAPACITY_LEVEL in this patch.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When enabling a device' wakeup capability, a wakeup source
is created for the device automatically. But the wakeup source
is not unregistered when disabling the device' wakeup capability.
This results in zombie wakeup sources, after devices/drivers are unregistered.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The kfree() function already NULL checks the parameter so remove the
redundant NULL checks before kfree() calls in arch/mips/kvm/.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: kvm@vger.kernel.org
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: Sanjay Lal <sanjayl@kymasys.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The logging from MIPS KVM is fairly noisy with kvm_info() in places
where it shouldn't be, such as on VM creation and migration to a
different CPU. Replace these kvm_info() calls with kvm_debug().
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: kvm@vger.kernel.org
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: Sanjay Lal <sanjayl@kymasys.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_debug() uses pr_debug() which is already compiled out in the absence
of a DEBUG define, so remove the unnecessary ifdef DEBUG lines around
kvm_debug() calls which are littered around arch/mips/kvm/.
As well as generally cleaning up, this prevents future bit-rot due to
DEBUG not being commonly used.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: kvm@vger.kernel.org
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: Sanjay Lal <sanjayl@kymasys.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix build errors when DEBUG is defined in arch/mips/kvm/.
- The DEBUG code in kvm_mips_handle_tlbmod() was missing some variables.
- The DEBUG code in kvm_mips_host_tlb_write() was conditional on an
undefined "debug" variable.
- The DEBUG code in kvm_mips_host_tlb_inv() accessed asid_map directly
rather than using kvm_mips_get_user_asid(). Also fixed brace
placement.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: kvm@vger.kernel.org
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: Sanjay Lal <sanjayl@kymasys.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The kvm_mips_comparecount_func() and kvm_mips_comparecount_wakeup()
functions are only used within arch/mips/kvm/kvm_mips.c, so make them
static.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: kvm@vger.kernel.org
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: Sanjay Lal <sanjayl@kymasys.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose the KVM guest CP0_Count frequency to userland via a new
KVM_REG_MIPS_COUNT_HZ register accessible with the KVM_{GET,SET}_ONE_REG
ioctls.
When the frequency is altered the bias is adjusted such that the guest
CP0_Count doesn't jump discontinuously or lose any timer interrupts.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: kvm@vger.kernel.org
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: David Daney <david.daney@cavium.com>
Cc: Sanjay Lal <sanjayl@kymasys.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose two new virtual registers to userland via the
KVM_{GET,SET}_ONE_REG ioctls.
KVM_REG_MIPS_COUNT_CTL is for timer configuration fields and just
contains a master disable count bit. This can be used by userland to
freeze the timer in order to read a consistent state from the timer
count value and timer interrupt pending bit. This cannot be done with
the CP0_Cause.DC bit because the timer interrupt pending bit (TI) is
also in CP0_Cause so it would be impossible to stop the timer without
also risking a race with an hrtimer interrupt and having to explicitly
check whether an interrupt should have occurred.
When the timer is re-enabled it resumes without losing time, i.e. the
CP0_Count value jumps to what it would have been had the timer not been
disabled, which would also be impossible to do from userland with
CP0_Cause.DC. The timer interrupt also cannot be lost, i.e. if a timer
interrupt would have occurred had the timer not been disabled it is
queued when the timer is re-enabled.
This works by storing the nanosecond monotonic time when the master
disable is set, and using it for various operations instead of the
current monotonic time (e.g. when recalculating the bias when the
CP0_Count is set), until the master disable is cleared again, i.e. the
timer state is read/written as it would have been at that time. This
state is exposed to userland via the read-only KVM_REG_MIPS_COUNT_RESUME
virtual register so that userland can determine the exact time the
master disable took effect.
This should allow userland to atomically save the state of the timer,
and later restore it.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: kvm@vger.kernel.org
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: David Daney <david.daney@cavium.com>
Cc: Sanjay Lal <sanjayl@kymasys.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The KVM_HOST_FREQ Kconfig symbol was used by KVM guest kernels to
override the timer frequency calculation to a value based on the host
frequency. Now that the KVM timer emulation is implemented independent
of the host timer frequency and defaults to 100MHz, adjust the working
of CONFIG_KVM_HOST_FREQ to match.
The Kconfig symbol now specifies the guest timer frequency directly, and
has been renamed accordingly to KVM_GUEST_TIMER_FREQ. It now defaults to
100MHz too and the help text is updated to make it clear that a zero
value will allow the normal timer frequency calculation to take place
(based on the emulated RTC).
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: kvm@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: Sanjay Lal <sanjayl@kymasys.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>