Dynamically assign host PIDs to guest PIDs, splitting each guest PID into
multiple host (shadow) PIDs based on kernel/user and MSR[IS/DS]. Use
both PID0 and PID1 so that the shadow PIDs for the right mode can be
selected, that correspond both to guest TID = zero and guest TID = guest
PID.
This allows us to significantly reduce the frequency of needing to
invalidate the entire TLB. When the guest mode or PID changes, we just
update the host PID0/PID1. And since the allocation of shadow PIDs is
global, multiple guests can share the TLB without conflict.
Note that KVM does not yet support the guest setting PID1 or PID2 to
a value other than zero. This will need to be fixed for nested KVM
to work. Until then, we enforce the requirement for guest PID1/PID2
to stay zero by failing the emulation if the guest tries to set them
to something else.
Signed-off-by: Liu Yu <yu.liu@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Instead of a fully separate set of TLB entries, keep just the
pfn and dirty status.
Signed-off-by: Liu Yu <yu.liu@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
This is a shared page used for paravirtualization. It is always present
in the guest kernel's effective address space at the address indicated
by the hypercall that enables it.
The physical address specified by the hypercall is not used, as
e500 does not have real mode.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
This allows large pages to be used on guest mappings backed by things like
/dev/mem, resulting in a significant speedup when guest memory
is mapped this way (it's useful for directly-assigned MMIO, too).
This is not a substitute for hugetlbfs integration, but is useful for
configurations where devices are directly assigned on chips without an
IOMMU -- in these cases, we need guest physical and true physical to
match, and be contiguous, so static reservation and mapping via /dev/mem
is the most straightforward way to set things up.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
This is in line with what other architectures do, and will allow us to
map things other than ordinary, unreserved kernel pages -- such as
dedicated devices, or large contiguous reserved regions.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
This avoids races. It also means that we use the shadow TLB way,
rather than the hardware hint -- if this is a problem, we could do
a tlbsx before inserting a TLB0 entry.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Since TLB1 loading doesn't check the shadow TLB before allocating another
entry, you can get duplicates.
Once shadow PIDs are enabled in a later patch, we won't need to
invalidate the TLB on every switch, so this optimization won't be
needed anyway.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
This is done lazily. The SPE save will be done only if the guest has
used SPE since the last preemption or heavyweight exit. Restore will be
done only on demand, when enabling MSR_SPE in the shadow MSR, in response
to an SPE fault or mtmsr emulation.
For SPEFSCR, Linux already switches it on context switch (non-lazily), so
the only remaining bit is to save it between qemu and the guest.
Signed-off-by: Liu Yu <yu.liu@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Keep the guest MSR and the guest-mode true MSR separate, rather than
modifying the guest MSR on each guest entry to produce a true MSR.
Any bits which should be modified based on guest MSR must be explicitly
propagated from vcpu->arch.shared->msr to vcpu->arch.shadow_msr in
kvmppc_set_msr().
While we're modifying the guest entry code, reorder a few instructions
to bury some load latencies.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Up until now, Book3S KVM had variables stored in the kernel that a kernel module
or the kvm code in the kernel could read from to figure out where some real mode
helper functions are located.
This is all unnecessary. The high bits of the EA get ignore in real mode, so we
can just use the pointer as is. Also, it's a lot easier on relocations when we
use the normal way of resolving the address to a function, instead of jumping
through hoops.
This patch fixes compilation with CONFIG_RELOCATABLE=y.
Signed-off-by: Alexander Graf <agraf@suse.de>
When http://www.spinics.net/lists/kvm-ppc/msg02664.html
was applied to produce commit b51e7aa7ed6d8d134d02df78300ab0f91cfff4d2,
the removal of the conversion in add_exit_timing was left out.
Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
* 'kvm-updates/2.6.40' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (131 commits)
KVM: MMU: Use ptep_user for cmpxchg_gpte()
KVM: Fix kvm mmu_notifier initialization order
KVM: Add documentation for KVM_CAP_NR_VCPUS
KVM: make guest mode entry to be rcu quiescent state
KVM: x86 emulator: Make jmp far emulation into a separate function
KVM: x86 emulator: Rename emulate_grpX() to em_grpX()
KVM: x86 emulator: Remove unused arg from emulate_pop()
KVM: x86 emulator: Remove unused arg from writeback()
KVM: x86 emulator: Remove unused arg from read_descriptor()
KVM: x86 emulator: Remove unused arg from seg_override()
KVM: Validate userspace_addr of memslot when registered
KVM: MMU: Clean up gpte reading with copy_from_user()
KVM: PPC: booke: add sregs support
KVM: PPC: booke: save/restore VRSAVE (a.k.a. USPRG0)
KVM: PPC: use ticks, not usecs, for exit timing
KVM: PPC: fix exit accounting for SPRs, tlbwe, tlbsx
KVM: PPC: e500: emulate SVR
KVM: VMX: Cache vmcs segment fields
KVM: x86 emulator: consolidate segment accessors
KVM: VMX: Avoid reading %rip unnecessarily when handling exceptions
...
Linux doesn't use USPRG0 (now renamed VRSAVE in the architecture, even
when Altivec isn't involved), but a guest might.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Convert to microseconds when displaying
(with fix from Bharat Bhushan <Bharat.Bhushan@freescale.com>).
This reduces rounding error with large quantities of short exits.
Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
The exit type setting for mfspr/mtspr is moved from 44x to toplevel SPR
emulation. This enables it on e500, and makes sure that all SPRs
are covered.
Exit accounting for tlbwe and tlbsx is added to e500.
Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Return the actual host SVR for now, as we already do for PVR. Eventually
we may support Qemu overriding PVR/SVR if the situation is appropriate,
once we implement KVM_SET_SREGS on e500.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Commits a5d4f3ad3a ("powerpc: Base support for exceptions using
HSRR0/1") and 673b189a2e ("powerpc: Always use SPRN_SPRG_HSCRATCH0
when running in HV mode") cause compile and link errors for 32-bit
classic Book 3S processors when KVM is enabled. This fixes these
errors.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The vcpu->arch.pending_exceptions field is a bitfield indexed by
interrupt priority number as returned by kvmppc_book3s_vec2irqprio.
However, kvmppc_core_pending_dec was using an interrupt vector shifted
by 7 as the bit index. Fix it to use the irqprio value for the
decrementer interrupt instead. This problem was found by code
inspection.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Following dump is observed on host when clearing the exit timing counters
[root@p1021mds kvm]# echo -n 'c' > vm1200_vcpu0_timing
INFO: task echo:1276 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
echo D 0ff5bf94 0 1276 1190 0x00000000
Call Trace:
[c2157e40] [c0007908] __switch_to+0x9c/0xc4
[c2157e50] [c040293c] schedule+0x1b4/0x3bc
[c2157e90] [c04032dc] __mutex_lock_slowpath+0x74/0xc0
[c2157ec0] [c00369e4] kvmppc_init_timing_stats+0x20/0xb8
[c2157ed0] [c0036b00] kvmppc_exit_timing_write+0x84/0x98
[c2157ef0] [c00b9f90] vfs_write+0xc0/0x16c
[c2157f10] [c00ba284] sys_write+0x4c/0x90
[c2157f40] [c000e320] ret_from_syscall+0x0/0x3c
The vcpu->mutex is used by kvm_ioctl_* (KVM_RUN etc) and same was
used when clearing the stats (in kvmppc_init_timing_stats()). What happens
is that when the guest is idle then it held the vcpu->mutx. While the
exiting timing process waits for guest to release the vcpu->mutex and
a hang state is reached.
Now using seprate lock for exit timing stats.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
This uses feature sections to arrange that we always use HSPRG1
as the scratch register in the interrupt entry code rather than
SPRG2 when we're running in hypervisor mode on POWER7. This will
ensure that we don't trash the guest's SPRG2 when we are running
KVM guests. To simplify the code, we define GET_SCRATCH0() and
SET_SCRATCH0() macros like the GET_PACA/SET_PACA macros.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pass the register type to the prolog, also provides alternate "HV"
version of hardware interrupt (0x500) and adjust LPES accordingly
We tag those interrupts by setting bit 0x2 in the trap number
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When running in Hypervisor mode (arch 2.06 or later), we store the PACA
in HSPRG0 instead of SPRG1. The architecture specifies that SPRGs may be
lost during a "nap" power management operation (though they aren't
currently on POWER7) and this enables use of SPRG1 by KVM guests.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Previously SPRGs 4-7 were improperly read and written in
kvm_arch_vcpu_ioctl_get_regs() and kvm_arch_vcpu_ioctl_set_regs();
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Peter Tyser <ptyser@xes-inc.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
IA64 support forces us to abstract the allocation of the kvm structure.
But instead of mixing this up with arch-specific initialization and
doing the same on destruction, split both steps. This allows to move
generic destruction calls into generic code.
It also fixes error clean-up on failures of kvm_create_vm for IA64.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This was preventing the guest from setting any bits in the
hardware MSR which aren't forced on, such as MSR[SPE].
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
It is not legal to call mutex_lock() with interrupts disabled.
This will assert with debug checks enabled.
If there's a real need to disable interrupts here, it could be done
after the mutex is acquired -- but I don't see why it's needed at all.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
The VCPU uninit calls some TLB functions, and the TLB uninit function
frees the memory used by them.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Acked-by: Liu Yu <yu.liu@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Structure kvm_ppc_pvinfo is copied to userland with flags and
pad fields unitialized. It leads to leaking of contents of
kernel stack memory.
Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* 'kvm-updates/2.6.37' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (321 commits)
KVM: Drop CONFIG_DMAR dependency around kvm_iommu_map_pages
KVM: Fix signature of kvm_iommu_map_pages stub
KVM: MCE: Send SRAR SIGBUS directly
KVM: MCE: Add MCG_SER_P into KVM_MCE_CAP_SUPPORTED
KVM: fix typo in copyright notice
KVM: Disable interrupts around get_kernel_ns()
KVM: MMU: Avoid sign extension in mmu_alloc_direct_roots() pae root address
KVM: MMU: move access code parsing to FNAME(walk_addr) function
KVM: MMU: audit: check whether have unsync sps after root sync
KVM: MMU: audit: introduce audit_printk to cleanup audit code
KVM: MMU: audit: unregister audit tracepoints before module unloaded
KVM: MMU: audit: fix vcpu's spte walking
KVM: MMU: set access bit for direct mapping
KVM: MMU: cleanup for error mask set while walk guest page table
KVM: MMU: update 'root_hpa' out of loop in PAE shadow path
KVM: x86 emulator: Eliminate compilation warning in x86_decode_insn()
KVM: x86: Fix constant type in kvm_get_time_scale
KVM: VMX: Add AX to list of registers clobbered by guest switch
KVM guest: Move a printk that's using the clock before it's ready
KVM: x86: TSC catchup mode
...
The e500_tlb.c file didn't compile for me due to the following error:
arch/powerpc/kvm/e500_tlb.c: In function ‘kvmppc_e500_shadow_map’:
arch/powerpc/kvm/e500_tlb.c:300: error: format ‘%lx’ expects type ‘long unsigned int’, but argument 2 has type ‘gfn_t’
So let's explicitly cast the argument to make printk happy.
Signed-off-by: Alexander Graf <agraf@suse.de>
The kvmppc_e500_stlbe_invalidate() function was trying to pass too many
parameters to trace_kvm_stlb_inval(). This appears to be a bad
copy-paste from a call to trace_kvm_stlb_write().
Signed-off-by: Kyle Moffett <Kyle.D.Moffett@boeing.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
BookE also wants to support level based interrupts, so let's implement
all the necessary logic there. We need to trick a bit here because the
irqprios are 1:1 assigned to architecture defined values. But since there
is some space left there, we can just pick a random one and move it later
on - it's internal anyways.
Signed-off-by: Alexander Graf <agraf@suse.de>
Now that we have all the level interrupt magic in place, let's
expose the capability to user space, so it can make use of it!
Signed-off-by: Alexander Graf <agraf@suse.de>
The current interrupt logic is just completely broken. We get a notification
from user space, telling us that an interrupt is there. But then user space
expects us that we just acknowledge an interrupt once we deliver it to the
guest.
This is not how real hardware works though. On real hardware, the interrupt
controller pulls the external interrupt line until it gets notified that the
interrupt was received.
So in reality we have two events: pulling and letting go of the interrupt line.
To maintain backwards compatibility, I added a new request for the pulling
part. The letting go part was implemented earlier already.
With this in place, we can now finally start guests that do not randomly stall
and stop to work at random times.
This patch implements above logic for Book3S.
Signed-off-by: Alexander Graf <agraf@suse.de>
Match only the first part of cur_cpu_spec->platform.
440GP (the first 440 processor) is identified by the string "ppc440gp", while
all later 440 processors use simply "ppc440".
Signed-off-by: Hollis Blanchard <hollis_blanchard@mentor.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Developers can now tell at a glace the exact type of the premature interrupt,
instead of just knowing that there was some premature interrupt.
Signed-off-by: Hollis Blanchard <hollis_blanchard@mentor.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
On Book3S a mtmsr with the MSR_POW bit set indicates that the OS is in
idle and only needs to be waked up on the next interrupt.
Now, unfortunately we let that bit slip into the stored MSR value which
is not what the real CPU does, so that we ended up executing code like
this:
r = mfmsr();
/* r containts MSR_POW */
mtmsr(r | MSR_EE);
This obviously breaks, as we're going into idle mode in code sections that
don't expect to be idling.
This patch masks MSR_POW out of the stored MSR value on wakeup, making
guests happy again.
Signed-off-by: Alexander Graf <agraf@suse.de>
Up until now we were doing segment mappings wrong on Book3s_32. For Book3s_64
we were using a trick where we know that a single mmu_context gives us 16 bits
of context ids.
The mm system on Book3s_32 instead uses a clever algorithm to distribute VSIDs
across the available range, so a context id really only gives us 16 available
VSIDs.
To keep at least a few guest processes in the SID shadow, let's map a number of
contexts that we can use as VSID pool. This makes the code be actually correct
and shouldn't hurt performance too much.
Signed-off-by: Alexander Graf <agraf@suse.de>
When having a decrementor interrupt pending, the dequeuing happens manually
through an mtdec instruction. This instruction simply calls dequeue on that
interrupt, so the int_pending hint doesn't get updated.
This patch enables updating the int_pending hint also on dequeue, thus
correctly enabling guests to stay in guest contexts more often.
Signed-off-by: Alexander Graf <agraf@suse.de>
Now that the actual mtsr doesn't do anything anymore, we can move the sr
contents over to the shared page, so a guest can directly read and write
its sr contents from guest context.
Signed-off-by: Alexander Graf <agraf@suse.de>
Right now we're examining the contents of Book3s_32's segment registers when
the register is written and put the interpreted contents into a struct.
There are two reasons this is bad. For starters, the struct has worse real-time
performance, as it occupies more ram. But the more important part is that with
segment registers being interpreted from their raw values, we can put them in
the shared page, allowing guests to mess with them directly.
This patch makes the internal representation of SRs be u32s.
Signed-off-by: Alexander Graf <agraf@suse.de>
The current approach duplicates the spr->bat finding logic and makes it harder
to reuse the actually used variables. So let's move everything down to the spr
handler.
Signed-off-by: Alexander Graf <agraf@suse.de>
We will soon add SR PV support to the shared page, so we need some
infrastructure that allows the guest to query for features KVM exports.
This patch adds a second return value to the magic mapping that
indicated to the guest which features are available.
Signed-off-by: Alexander Graf <agraf@suse.de>
It turns out the in-kernel hash function is sub-optimal for our subtle
hash inputs where every bit is significant. So let's revert to the original
hash functions.
This reverts commit 05340ab4f9a6626f7a2e8f9fe5397c61d494f445.
Signed-off-by: Alexander Graf <agraf@suse.de>
There is a race condition in the pte invalidation code path where we can't
be sure if a pte was invalidated already. So let's move the spin lock around
to get rid of the race.
Signed-off-by: Alexander Graf <agraf@suse.de>