This patch provides VIRT_CPU_ACCOUTING to PPC32 architecture.
PPC32 doesn't have the PACA structure, so we use the task_info
structure to store the accounting data.
In order to reuse on PPC32 the PPC64 functions, all u64 data has
been replaced by 'unsigned long' so that it is u32 on PPC32 and
u64 on PPC64
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
THREAD_DSCR:
Added in efcac6589a "powerpc: Per process DSCR + some fixes (try#4)"
Last usage removed in 152d523e63 "powerpc: Create context switch helpers save_sprs() and restore_sprs()"
THREAD_DSCR_INHERIT:
Added in 714332858b "powerpc: Restore correct DSCR in context switch"
Last usage removed in 152d523e63 "powerpc: Create context switch helpers save_sprs() and restore_sprs()"
THREAD_TAR:
Added in 2468dcf641 "powerpc: Add support for context switching the TAR register"
Last usage removed in 152d523e63 "powerpc: Create context switch helpers save_sprs() and restore_sprs()"
THREAD_BESCR, THREAD_EBBHR and THREAD_EBBRR:
Added in 9353374b8e "powerpc: Context switch the new EBB SPRs"
Last usage removed in 152d523e63 "powerpc: Create context switch helpers save_sprs() and restore_sprs()"
THREAD_SIAR, THREAD_SDAR, THREAD_SIER, THREAD_MMCR0, and THREAD_MMCR2:
Added in 59affcd3e4 "powerpc: Context switch more PMU related SPRs"
Last usage removed in b11ae95100 "powerpc: Partial revert of "Context switch more PMU related SPRs""
PACA_LOCK_TOKEN:
Added in 9e368f2915 "KVM: PPC: book3s_hv: Add support for PPC970-family processors"
Last usage removed in c17b98cf60 "KVM: PPC: Book3S HV: Remove code for PPC970 processors"
HCALL_STAT_SIZE, HCALL_STAT_CALLS, HCALL_STAT_TB and HCALL_STAT_PURR:
Added in 57852a853b "[POWERPC] powerpc: Instrument Hypervisor Calls"
Last usage removed in c8cd093a6e "powerpc: tracing: Add hypervisor call tracepoints"
VCPU_EPLC:
Added in d30f6e4800 "KVM: PPC: booke: category E.HV (GS-mode) support"
Never used.
CPU_DOWN_FLUSH:
Added in e7affb1dba "powerpc/cache: add cache flush operation for various e500"
Never used.
CFG_STAMP_XSEC:
Added in 14cf11af6c "powerpc: Merge enough to start building in arch/powerpc."
Last usage removed in 0e469db8f7 "powerpc: Rework VDSO gettimeofday to prevent time going backwards"
KVM_LPCR:
Added in aa04b4cc5b "KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests"
Last usage removed in a0144e2a6b "KVM: PPC: Book3S HV: Store LPCR value for each virtual core"
GPR15, GPR16, GPR17, GPR18, GPR19, GPR20, GPR21, GPR22, GPR23, GPR24,
GPR25, GPR26, GPR27, GPR28, GPR29, GPR30 and GPR31:
Added in 14cf11af6c "powerpc: Merge enough to start building in arch/powerpc."
Never used.
VCPU_SHADOW_FSCR:
Added in 616dff8602 "KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR"
Never used.
VCPU_SHADOW_SRR1:
Added in a2d56020d1 "KVM: PPC: Book3S PR: Keep volatile reg values in vcpu rather than shadow_vcpu"
Never used.
KVM_SPLIT_SIZE:
Added in b4deba5c41 "KVM: PPC: Book3S HV: Implement dynamicmicro-threading on POWER8"
Never used.
VCPU_VCPUID:
Added in de56a948b9 "KVM: PPC: Add support for Book3S processors in hypervisor mode"
Last usage removed 1b400ba0cd "KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations"
_MQ:
Added in 14cf11af6c "powerpc: Merge enough to start building in arch/powerpc."
Never used.
AUDITCONTEXT:
Added in 14cf11af6c "powerpc: Merge enough to start building in arch/powerpc."
Last usage removed in 401d1f029b "[PATCH] syscall entry/exit revamp"
CLONE_VM:
Added in 14cf11af6c "powerpc: Merge enough to start building in arch/powerpc."
Currently unused.
CLONE_UNTRACED:
Added in 14cf11af6c "powerpc: Merge enough to start building in arch/powerpc."
Currently unused.
Signed-off-by: Rashmica Gupta <rashmicy@gmail.com>
[mpe: Munge change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Radix and hash MMU models support different page table sizes. Make
the #defines a variable so that existing code can work with variable
sizes.
Slice related code is only used by hash, so use hash constants there. We
will replicate some of the boundary conditions with resepct to TASK_SIZE
using radix values too. Right now we do boundary condition check using
hash constants.
Swapper pgdir size is initialized in asm code. We select the max pgd
size to keep it simple. For now we select hash pgdir. When adding radix
we will switch that to radix pgdir which is 64K.
BUILD_BUG_ON check which is removed is already done in hugepage_init()
using MAYBE_BUILD_BUG_ON().
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add the kconfig logic & assembly support for handling live patched
functions. This depends on DYNAMIC_FTRACE_WITH_REGS, which in turn
depends on the new -mprofile-kernel ftrace ABI, which is only supported
currently on ppc64le.
Live patching is handled by a special ftrace handler. This means it runs
from ftrace_caller(). The live patch handler modifies the NIP so as to
redirect the return from ftrace_caller() to the new patched function.
However there is one particularly tricky case we need to handle.
If a function A calls another function B, and it is known at link time
that they share the same TOC, then A will not save or restore its TOC,
and will call the local entry point of B.
When we live patch B, we replace it with a new function C, which may
not have the same TOC as A. At live patch time it's too late to modify A
to do the TOC save/restore, so the live patching code must interpose
itself between A and C, and do the TOC save/restore that A omitted.
An additionaly complication is that the livepatch code can not create a
stack frame in order to save the TOC. That is because if C takes > 8
arguments, or is varargs, A will have written the arguments for C in
A's stack frame.
To solve this, we introduce a "livepatch stack" which grows upward from
the base of the regular stack, and is used to store the TOC & LR when
calling a live patched function.
When the patched function returns, we retrieve the real LR & TOC from
the livepatch stack, restore them, and pop the livepatch "stack frame".
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Torsten Duwe <duwe@suse.de>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Various e500 core have different cache architecture, so they
need different cache flush operations. Therefore, add a callback
function cpu_flush_caches to the struct cpu_spec. The cache flush
operation for the specific kind of e500 is selected at init time.
The callback function will flush all caches inside the current cpu.
Signed-off-by: Chenhui Zhao <chenhui.zhao@freescale.com>
Signed-off-by: Tang Yuantian <Yuantian.Tang@feescale.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Currently the FPU, VEC and VSX facilities are lazily loaded. This is not
a problem unless a process is using these facilities.
Modern versions of GCC are very good at automatically vectorising code,
new and modernised workloads make use of floating point and vector
facilities, even the kernel makes use of vectorised memcpy.
All this combined greatly increases the cost of a syscall since the
kernel uses the facilities sometimes even in syscall fast-path making it
increasingly common for a thread to take an *_unavailable exception soon
after a syscall, not to mention potentially taking all three.
The obvious overcompensation to this problem is to simply always load
all the facilities on every exit to userspace. Loading up all FPU, VEC
and VSX registers every time can be expensive and if a workload does
avoid using them, it should not be forced to incur this penalty.
An 8bit counter is used to detect if the registers have been used in the
past and the registers are always loaded until the value wraps to back
to zero.
Several versions of the assembly in entry_64.S were tested:
1. Always calling C.
2. Performing a common case check and then calling C.
3. A complex check in asm.
After some benchmarking it was determined that avoiding C in the common
case is a performance benefit (option 2). The full check in asm (option
3) greatly complicated that codepath for a negligible performance gain
and the trade-off was deemed not worth it.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
[mpe: Move load_vec in the struct to fill an existing hole, reword change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
fixup
Currently we copy the whole mm_context_t to the paca but only access a
few bits of it. This is wasteful of space paca and also takes quite
some time in the hot path of context switching.
This patch pulls in only the required bits from the mm_context_t to
the paca and on context switch, copies only those.
Benchmarking this (On top of Anton's recent MSR context switching
changes [1]) using processes and yield shows an improvement of almost
3% on POWER8:
http://ozlabs.org/~anton/junkcode/context_switch2.c
./context_switch2 --test=yield --process 0 0
1. https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-October/135700.html
Signed-off-by: Michael Neuling <mikey@neuling.org>
[mpe: Rename paca fields to be mm_ctx_foo rather than context_foo]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a function to copy the mm->context to the paca. This is
only a basic conversion for now but will be used more extensively in
the next patch.
This also adds #ifdef CONFIG_PPC_BOOK3S around this code since it's
not used elsewhere.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This builds on the ability to run more than one vcore on a physical
core by using the micro-threading (split-core) modes of the POWER8
chip. Previously, only vcores from the same VM could be run together,
and (on POWER8) only if they had just one thread per core. With the
ability to split the core on guest entry and unsplit it on guest exit,
we can run up to 8 vcpu threads from up to 4 different VMs, and we can
run multiple vcores with 2 or 4 vcpus per vcore.
Dynamic micro-threading is only available if the static configuration
of the cores is whole-core mode (unsplit), and only on POWER8.
To manage this, we introduce a new kvm_split_mode struct which is
shared across all of the subcores in the core, with a pointer in the
paca on each thread. In addition we extend the core_info struct to
have information on each subcore. When deciding whether to add a
vcore to the set already on the core, we now have two possibilities:
(a) piggyback the vcore onto an existing subcore, or (b) start a new
subcore.
Currently, when any vcpu needs to exit the guest and switch to host
virtual mode, we interrupt all the threads in all subcores and switch
the core back to whole-core mode. It may be possible in future to
allow some of the subcores to keep executing in the guest while
subcore 0 switches to the host, but that is not implemented in this
patch.
This adds a module parameter called dynamic_mt_modes which controls
which micro-threading (split-core) modes the code will consider, as a
bitmap. In other words, if it is 0, no micro-threading mode is
considered; if it is 2, only 2-way micro-threading is considered; if
it is 4, only 4-way, and if it is 6, both 2-way and 4-way
micro-threading mode will be considered. The default is 6.
With this, we now have secondary threads which are the primary thread
for their subcore and therefore need to do the MMU switch. These
threads will need to be started even if they have no vcpu to run, so
we use the vcore pointer in the PACA rather than the vcpu pointer to
trigger them.
It is now possible for thread 0 to find that an exit has been
requested before it gets to switch the subcore state to the guest. In
that case we haven't added the guest's timebase offset to the
timebase, so we need to be careful not to subtract the offset in the
guest exit path. In fact we just skip the whole path that switches
back to host context, since we haven't switched to the guest context.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
When running a virtual core of a guest that is configured with fewer
threads per core than the physical cores have, the extra physical
threads are currently unused. This makes it possible to use them to
run one or more other virtual cores from the same guest when certain
conditions are met. This applies on POWER7, and on POWER8 to guests
with one thread per virtual core. (It doesn't apply to POWER8 guests
with multiple threads per vcore because they require a 1-1 virtual to
physical thread mapping in order to be able to use msgsndp and the
TIR.)
The idea is that we maintain a list of preempted vcores for each
physical cpu (i.e. each core, since the host runs single-threaded).
Then, when a vcore is about to run, it checks to see if there are
any vcores on the list for its physical cpu that could be
piggybacked onto this vcore's execution. If so, those additional
vcores are put into state VCORE_PIGGYBACK and their runnable VCPU
threads are started as well as the original vcore, which is called
the master vcore.
After the vcores have exited the guest, the extra ones are put back
onto the preempted list if any of their VCPUs are still runnable and
not idle.
This means that vcpu->arch.ptid is no longer necessarily the same as
the physical thread that the vcpu runs on. In order to make it easier
for code that wants to send an IPI to know which CPU to target, we
now store that in a new field in struct vcpu_arch, called thread_cpu.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Since we moved the "lock" to be the first element of
struct tlb_core_data in commit 82d86de25b ("powerpc/e6500: Make TLB
lock recursive"), this macro is not used by any code. Just delete it.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
PACA_DSCR offset macro tracks dscr_default element in the paca
structure. Better change the name of this macro to match that of the
data element it tracks. Makes the code more readable.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This uses msgsnd where possible for signalling other threads within
the same core on POWER8 systems, rather than IPIs through the XICS
interrupt controller. This includes waking secondary threads to run
the guest, the interrupts generated by the virtual XICS, and the
interrupts to bring the other threads out of the guest when exiting.
Aggregated statistics from debugfs across vcpus for a guest with 32
vcpus, 8 threads/vcore, running on a POWER8, show this before the
change:
rm_entry: 3387.6ns (228 - 86600, 1008969 samples)
rm_exit: 4561.5ns (12 - 3477452, 1009402 samples)
rm_intr: 1660.0ns (12 - 553050, 3600051 samples)
and this after the change:
rm_entry: 3060.1ns (212 - 65138, 953873 samples)
rm_exit: 4244.1ns (12 - 9693408, 954331 samples)
rm_intr: 1342.3ns (12 - 1104718, 3405326 samples)
for a test of booting Fedora 20 big-endian to the login prompt.
The time taken for a H_PROD hcall (which is handled in the host
kernel) went down from about 35 microseconds to about 16 microseconds
with this change.
The noinline added to kvmppc_run_core turned out to be necessary for
good performance, at least with gcc 4.9.2 as packaged with Fedora 21
and a little-endian POWER8 host.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Currently, the entry_exit_count field in the kvmppc_vcore struct
contains two 8-bit counts, one of the threads that have started entering
the guest, and one of the threads that have started exiting the guest.
This changes it to an entry_exit_map field which contains two bitmaps
of 8 bits each. The advantage of doing this is that it gives us a
bitmap of which threads need to be signalled when exiting the guest.
That means that we no longer need to use the trick of setting the
HDEC to 0 to pull the other threads out of the guest, which led in
some cases to a spurious HDEC interrupt on the next guest entry.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
We can tell when a secondary thread has finished running a guest by
the fact that it clears its kvm_hstate.kvm_vcpu pointer, so there
is no real need for the nap_count field in the kvmppc_vcore struct.
This changes kvmppc_wait_for_nap to poll the kvm_hstate.kvm_vcpu
pointers of the secondary threads rather than polling vc->nap_count.
Besides reducing the size of the kvmppc_vcore struct by 8 bytes,
this also means that we can tell which secondary threads have got
stuck and thus print a more informative error message.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
* Remove unused kvmppc_vcore::n_busy field.
* Remove setting of RMOR, since it was only used on PPC970 and the
PPC970 KVM support has been removed.
* Don't use r1 or r2 in setting the runlatch since they are
conventionally reserved for other things; use r0 instead.
* Streamline the code a little and remove the ext_interrupt_to_host
label.
* Add some comments about register usage.
* hcall_try_real_mode doesn't need to be global, and can't be
called from C code anyway.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This reads the timebase at various points in the real-mode guest
entry/exit code and uses that to accumulate total, minimum and
maximum time spent in those parts of the code. Currently these
times are accumulated per vcpu in 5 parts of the code:
* rm_entry - time taken from the start of kvmppc_hv_entry() until
just before entering the guest.
* rm_intr - time from when we take a hypervisor interrupt in the
guest until we either re-enter the guest or decide to exit to the
host. This includes time spent handling hcalls in real mode.
* rm_exit - time from when we decide to exit the guest until the
return from kvmppc_hv_entry().
* guest - time spend in the guest
* cede - time spent napping in real mode due to an H_CEDE hcall
while other threads in the same vcore are active.
These times are exposed in debugfs in a directory per vcpu that
contains a file called "timings". This file contains one line for
each of the 5 timings above, with the name followed by a colon and
4 numbers, which are the count (number of times the code has been
executed), the total time, the minimum time, and the maximum time,
all in nanoseconds.
The overhead of the extra code amounts to about 30ns for an hcall that
is handled in real mode (e.g. H_SET_DABR), which is about 25%. Since
production environments may not wish to incur this overhead, the new
code is conditional on a new config symbol,
CONFIG_KVM_BOOK3S_HV_EXIT_TIMING.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
We have two arrays in kvm_host_state that contain register values for
the PMU. Currently we only create an asm-offsets symbol for the base of
the arrays, and do the array offset in the assembly code.
Creating an asm-offsets symbol for each field individually makes the
code much nicer to read, particularly for the MMCRx/SIxR/SDAR fields, and
might have helped us notice the recent double restore bug we had in this
code.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Alexander Graf <agraf@suse.de>
The highlight is the series that reworks the idle management on powernv, which
allows us to use deeper idle states on those machines.
There's the fix from Anton for the "BUG at kernel/smpboot.c:134!" problem.
An i2c driver for powernv. This is acked by Wolfram Sang, and he asked that we
take it through the powerpc tree.
A fix for audit from rgb at Red Hat, acked by Paul Moore who is one of the audit
maintainers.
A patch from Ben to export the symbol map of our OPAL firmware as a sysfs file,
so that tools can use it.
Also some CXL fixes, a couple of powerpc perf fixes, a fix for smt-enabled, and
the patch to add __force to get_user() so we can use bitwise types.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUk+oCAAoJEFHr6jzI4aWADBAP/i/CJ+cu6o4mzNDdfs8bnxqn
RGZCSV+SrkTZPcoLbLiM9iaqq34ORVIn7hwkhkTz2/koluMVfTsqtVulMoFf+hVd
GTVt81MjMFzA3hM3bXEV58KRT79+64K54dLCe0F7OaD6f4AikKR4LLz/PY0EBMiZ
2h13uQlfglaMeYTsaD9eeUpIIKs7+PwsNqUknmN9We07WWfxWqnRpiTR4TYTMXx4
3lQPvCnnHokwDqjuKgwiqDVSaCfCl8laS1i+BPk0G0aRV1AnPDvR3MhgVb2IpNxX
Joxy2D1HSawwDhqHOsId8dkGZXOM4vzo+Y658qnC1XfThqE0MhA+kCfa5/b6xlOR
K7nDO5A41B6nXB3mMOQh/szTXSIa8KJRTR3ibbJJrMdF6F0TN0JLLQNUcmM4j/5D
vvgZEzvFNZhWX98ktlQLde2E4ClWJg6mWESCGSgJeVjIXaxe/6GneIa8vLKm5QMu
OoykNsASMDGqddYMGoYeX/mSsvjPjK0PDO2q19sPbkP8xpyDLx6J8xo+5hO4l8xc
0Cdb38ECfeno+w5oKAnjidHnz0KYBsuYFLeS+rV0b8sUSWAzfdEjSn2AVIQ8gLOv
IOCAqwZ5tL9EcUs+AKru5EHtBEV+2XB54xPRxfdFS/k+vYRE7MpS3ipxveIynN2l
eRxf9hsSO7ASNDd0b3ID
=GXdK
-----END PGP SIGNATURE-----
Merge tag 'powerpc-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Pull second batch of powerpc updates from Michael Ellerman:
"The highlight is the series that reworks the idle management on
powernv, which allows us to use deeper idle states on those machines.
There's the fix from Anton for the "BUG at kernel/smpboot.c:134!"
problem.
An i2c driver for powernv. This is acked by Wolfram Sang, and he
asked that we take it through the powerpc tree.
A fix for audit from rgb at Red Hat, acked by Paul Moore who is one of
the audit maintainers.
A patch from Ben to export the symbol map of our OPAL firmware as a
sysfs file, so that tools can use it.
Also some CXL fixes, a couple of powerpc perf fixes, a fix for
smt-enabled, and the patch to add __force to get_user() so we can use
bitwise types"
* tag 'powerpc-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
powerpc/powernv: Ignore smt-enabled on Power8 and later
powerpc/uaccess: Allow get_user() with bitwise types
powerpc/powernv: Expose OPAL firmware symbol map
powernv/powerpc: Add winkle support for offline cpus
powernv/cpuidle: Redesign idle states management
powerpc/powernv: Enable Offline CPUs to enter deep idle states
powerpc/powernv: Switch off MMU before entering nap/sleep/rvwinkle mode
i2c: Driver to expose PowerNV platform i2c busses
powerpc: add little endian flag to syscall_get_arch()
power/perf/hv-24x7: Use kmem_cache_free() instead of kfree
powerpc/perf/hv-24x7: Use per-cpu page buffer
cxl: Unmap MMIO regions when detaching a context
cxl: Add timeout to process element commands
cxl: Change contexts_lock to a mutex to fix sleep while atomic bug
powerpc: Secondary CPUs must set cpu_callin_map after setting active and online
- spring cleaning: removed support for IA64, and for hardware-assisted
virtualization on the PPC970
- ARM, PPC, s390 all had only small fixes
For x86:
- small performance improvements (though only on weird guests)
- usual round of hardware-compliancy fixes from Nadav
- APICv fixes
- XSAVES support for hosts and guests. XSAVES hosts were broken because
the (non-KVM) XSAVES patches inadvertently changed the KVM userspace
ABI whenever XSAVES was enabled; hence, this part is going to stable.
Guest support is just a matter of exposing the feature and CPUID leaves
support.
Right now KVM is broken for PPC BookE in your tree (doesn't compile).
I'll reply to the pull request with a patch, please apply it either
before the pull request or in the merge commit, in order to preserve
bisectability somewhat.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJUkpg+AAoJEL/70l94x66DUmoH/jzXYkptSW9NGgm79KqxGJlD
lzLnLBkitVvx++Mz5YBhdJEhKKLUlCtifFT1zPJQ/pthQhIRSaaAwZyNGgUs5w5x
yMGKHiPQFyZRbmQtZhCInW0BftJoYHHciO3nUfHCZnp34My9MP2D55W7/z+fYFfQ
DuqBSE9ThyZJtZ4zh8NRA9fCOeuqwVYRyoBs820Wbsh4cpIBoIK63Dg7k+CLE+ZV
MZa/mRL6bAfsn9W5bnOUAgHJ3SPznnWbO3/g0aV+roL/5pffblprJx9lKNR08xUM
6hDFLop2gDehDJesDkY/o8Ckp1hEouvfsVpSShry4vcgtn0hgh2O5/6Orbmj6vE=
=Zwq1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM update from Paolo Bonzini:
"3.19 changes for KVM:
- spring cleaning: removed support for IA64, and for hardware-
assisted virtualization on the PPC970
- ARM, PPC, s390 all had only small fixes
For x86:
- small performance improvements (though only on weird guests)
- usual round of hardware-compliancy fixes from Nadav
- APICv fixes
- XSAVES support for hosts and guests. XSAVES hosts were broken
because the (non-KVM) XSAVES patches inadvertently changed the KVM
userspace ABI whenever XSAVES was enabled; hence, this part is
going to stable. Guest support is just a matter of exposing the
feature and CPUID leaves support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (179 commits)
KVM: move APIC types to arch/x86/
KVM: PPC: Book3S: Enable in-kernel XICS emulation by default
KVM: PPC: Book3S HV: Improve H_CONFER implementation
KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register
KVM: PPC: Book3S HV: Remove code for PPC970 processors
KVM: PPC: Book3S HV: Tracepoints for KVM HV guest interactions
KVM: PPC: Book3S HV: Simplify locking around stolen time calculations
arch: powerpc: kvm: book3s_paired_singles.c: Remove unused function
arch: powerpc: kvm: book3s_pr.c: Remove unused function
arch: powerpc: kvm: book3s.c: Remove some unused functions
arch: powerpc: kvm: book3s_32_mmu.c: Remove unused function
KVM: PPC: Book3S HV: Check wait conditions before sleeping in kvmppc_vcore_blocked
KVM: PPC: Book3S HV: ptes are big endian
KVM: PPC: Book3S HV: Fix inaccuracies in ICP emulation for H_IPI
KVM: PPC: Book3S HV: Fix KSM memory corruption
KVM: PPC: Book3S HV: Fix an issue where guest is paused on receiving HMI
KVM: PPC: Book3S HV: Fix computation of tlbie operand
KVM: PPC: Book3S HV: Add missing HPTE unlock
KVM: PPC: BookE: Improve irq inject tracepoint
arm/arm64: KVM: Require in-kernel vgic for the arch timers
...
There are two ways in which a guest instruction can be obtained from
the guest in the guest exit code in book3s_hv_rmhandlers.S. If the
exit was caused by a Hypervisor Emulation interrupt (i.e. an illegal
instruction), the offending instruction is in the HEIR register
(Hypervisor Emulation Instruction Register). If the exit was caused
by a load or store to an emulated MMIO device, we load the instruction
from the guest by turning data relocation on and loading the instruction
with an lwz instruction.
Unfortunately, in the case where the guest has opposite endianness to
the host, these two methods give results of different endianness, but
both get put into vcpu->arch.last_inst. The HEIR value has been loaded
using guest endianness, whereas the lwz will load the instruction using
host endianness. The rest of the code that uses vcpu->arch.last_inst
assumes it was loaded using host endianness.
To fix this, we define a new vcpu field to store the HEIR value. Then,
in kvmppc_handle_exit_hv(), we transfer the value from this new field to
vcpu->arch.last_inst, doing a byte-swap if the guest and host endianness
differ.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This removes the code that was added to enable HV KVM to work
on PPC970 processors. The PPC970 is an old CPU that doesn't
support virtualizing guest memory. Removing PPC970 support also
lets us remove the code for allocating and managing contiguous
real-mode areas, the code for the !kvm->arch.using_mmu_notifiers
case, the code for pinning pages of guest memory when first
accessed and keeping track of which pages have been pinned, and
the code for handling H_ENTER hypercalls in virtual mode.
Book3S HV KVM is now supported only on POWER7 and POWER8 processors.
The KVM_CAP_PPC_RMA capability now always returns 0.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Winkle is a deep idle state supported in power8 chips. A core enters
winkle when all the threads of the core enter winkle. In this state
power supply to the entire chiplet i.e core, private L2 and private L3
is turned off. As a result it gives higher powersavings compared to
sleep.
But entering winkle results in a total hypervisor state loss. Hence the
hypervisor context has to be preserved before entering winkle and
restored upon wake up.
Power-on Reset Engine (PORE) is a dedicated engine which is responsible
for powering on the chiplet during wake up. It can be programmed to
restore the register contests of a few specific registers. This patch
uses PORE to restore register state wherever possible and uses stack to
save and restore rest of the necessary registers.
With hypervisor state restore things fall under three categories-
per-core state, per-subcore state and per-thread state. To manage this,
extend the infrastructure introduced for sleep. Mainly we add a paca
variable subcore_sibling_mask. Using this and the core_idle_state we can
distingush first thread in core and subcore.
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Deep idle states like sleep and winkle are per core idle states. A core
enters these states only when all the threads enter either the
particular idle state or a deeper one. There are tasks like fastsleep
hardware bug workaround and hypervisor core state save which have to be
done only by the last thread of the core entering deep idle state and
similarly tasks like timebase resync, hypervisor core register restore
that have to be done only by the first thread waking up from these
state.
The current idle state management does not have a way to distinguish the
first/last thread of the core waking/entering idle states. Tasks like
timebase resync are done for all the threads. This is not only is
suboptimal, but can cause functionality issues when subcores and kvm is
involved.
This patch adds the necessary infrastructure to track idle states of
threads in a per-core structure. It uses this info to perform tasks like
fastsleep workaround and timebase resync only once per core.
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Originally-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: linux-pm@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Cleanup OpalMCE_* definitions/declarations and other related code which
is not used anymore.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Benjamin Herrrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
they had small conflicts (respectively within KVM documentation,
and with 3.16-rc changes). Since they were all within the subsystem,
I took care of them.
Stephen Rothwell reported some snags in PPC builds, but they are all
fixed now; the latest linux-next report was clean.
New features for ARM include:
- KVM VGIC v2 emulation on GICv3 hardware
- Big-Endian support for arm/arm64 (guest and host)
- Debug Architecture support for arm64 (arm32 is on Christoffer's todo list)
And for PPC:
- Book3S: Good number of LE host fixes, enable HV on LE
- Book3S HV: Add in-guest debug support
This release drops support for KVM on the PPC440. As a result, the
PPC merge removes more lines than it adds. :)
I also included an x86 change, since Davidlohr tied it to an independent
bug report and the reporter quickly provided a Tested-by; there was no
reason to wait for -rc2.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJT4iIJAAoJEBvWZb6bTYbyZqoP/3Wxy8NWPFJ8HGt81NHlGnDS
a9UbL7EibcOEG+aaKqmtBglTD5YDiGBDNCxxiSJaDHt+grLN4fsWIliJob1nJFoO
90f89EWN2XjeCrJXA5nUoeg5tpc5OoYKsiP6pTgzIwkP8vvs/H1+zpcTS/UmYsr/
qipVMMsM+zZeHWZcSbqjW88z7YqIn1sr5282wJ85cbyv4KGizb/G4dyPuDqLb6np
hkAD8Ah6VV2suQ2FSy7G2fg20R0vglUi60hkEHLoCBPVqJCl7SmC8MvxNbjBnP8S
J36R0R0u1wHYKzAGooLJGVOZ/o/gSiVqKX+++L2EvJBN+kuA6u/7fxLyBT+LwDAE
IF/Aln5rpg1fe+eywvhz86WljTVEQ8bO1zVsIQUPY+/ZOPedZHMwyvXft8ogbjSp
2m9OJ/3e8Aggh0OeHpCDoeow+QDUXvX0YdCw+2Yh0p+7VMXqkyp0QEiBu38jrusC
rB3VNifJbDSWLKdG9LfCAPHnxZD2XYEwv2WFBo6KQOGMGHfx0GXpCOL/jQihrhA6
HtEG5Bs3lvnHQemdpUZ58xojiABbMaUPdcnPXQQEp23WhZzrfLMLzqVG0VYnhSsC
9pi7MJj8c31rqx5WU2oRM28i/BvNxN0NCtkDpineO5s3f89Ws1xnwxqlm38AKP0J
irJQTYFEqec+GM9JK1rG
=hyQP
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull second round of KVM changes from Paolo Bonzini:
"Here are the PPC and ARM changes for KVM, which I separated because
they had small conflicts (respectively within KVM documentation, and
with 3.16-rc changes). Since they were all within the subsystem, I
took care of them.
Stephen Rothwell reported some snags in PPC builds, but they are all
fixed now; the latest linux-next report was clean.
New features for ARM include:
- KVM VGIC v2 emulation on GICv3 hardware
- Big-Endian support for arm/arm64 (guest and host)
- Debug Architecture support for arm64 (arm32 is on Christoffer's todo list)
And for PPC:
- Book3S: Good number of LE host fixes, enable HV on LE
- Book3S HV: Add in-guest debug support
This release drops support for KVM on the PPC440. As a result, the
PPC merge removes more lines than it adds. :)
I also included an x86 change, since Davidlohr tied it to an
independent bug report and the reporter quickly provided a Tested-by;
there was no reason to wait for -rc2"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (122 commits)
KVM: Move more code under CONFIG_HAVE_KVM_IRQFD
KVM: nVMX: fix "acknowledge interrupt on exit" when APICv is in use
KVM: nVMX: Fix nested vmexit ack intr before load vmcs01
KVM: PPC: Enable IRQFD support for the XICS interrupt controller
KVM: Give IRQFD its own separate enabling Kconfig option
KVM: Move irq notifier implementation into eventfd.c
KVM: Move all accesses to kvm::irq_routing into irqchip.c
KVM: irqchip: Provide and use accessors for irq routing table
KVM: Don't keep reference to irq routing table in irqfd struct
KVM: PPC: drop duplicate tracepoint
arm64: KVM: fix 64bit CP15 VM access for 32bit guests
KVM: arm64: GICv3: mandate page-aligned GICV region
arm64: KVM: GICv3: move system register access to msr_s/mrs_s
KVM: PPC: PR: Handle FSCR feature deselects
KVM: PPC: HV: Remove generic instruction emulation
KVM: PPC: BOOKEHV: rename e500hv_spr to bookehv_spr
KVM: PPC: Remove DCR handling
KVM: PPC: Expose helper functions for data/inst faults
KVM: PPC: Separate loadstore emulation from priv emulation
KVM: PPC: Handle magic page in kvmppc_ld/st
...
SPRN_SPRG is used by debug interrupt handler, so this is required for
debug support.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
This provides a way for userspace controls which sPAPR hcalls get
handled in the kernel. Each hcall can be individually enabled or
disabled for in-kernel handling, except for H_RTAS. The exception
for H_RTAS is because userspace can already control whether
individual RTAS functions are handled in-kernel or not via the
KVM_PPC_RTAS_DEFINE_TOKEN ioctl, and because the numeric value for
H_RTAS is out of the normal sequence of hcall numbers.
Hcalls are enabled or disabled using the KVM_ENABLE_CAP ioctl for the
KVM_CAP_PPC_ENABLE_HCALL capability on the file descriptor for the VM.
The args field of the struct kvm_enable_cap specifies the hcall number
in args[0] and the enable/disable flag in args[1]; 0 means disable
in-kernel handling (so that the hcall will always cause an exit to
userspace) and 1 means enable. Enabling or disabling in-kernel
handling of an hcall is effective across the whole VM.
The ability for KVM_ENABLE_CAP to be used on a VM file descriptor
on PowerPC is new, added by this commit. The KVM_CAP_ENABLE_CAP_VM
capability advertises that this ability exists.
When a VM is created, an initial set of hcalls are enabled for
in-kernel handling. The set that is enabled is the set that have
an in-kernel implementation at this point. Any new hcall
implementations from this point onwards should not be added to the
default set without a good reason.
No distinction is made between real-mode and virtual-mode hcall
implementations; the one setting controls them both.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Old cpus didn't have a Segment Lookaside Buffer (SLB), instead they had
a Segment Table (STAB). Now that we've dropped support for those cpus,
we can remove the STAB support entirely.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull powerpc updates from Ben Herrenschmidt:
"Here is the bulk of the powerpc changes for this merge window. It got
a bit delayed in part because I wasn't paying attention, and in part
because I discovered I had a core PCI change without a PCI maintainer
ack in it. Bjorn eventually agreed it was ok to merge it though we'll
probably improve it later and I didn't want to rebase to add his ack.
There is going to be a bit more next week, essentially fixes that I
still want to sort through and test.
The biggest item this time is the support to build the ppc64 LE kernel
with our new v2 ABI. We previously supported v2 userspace but the
kernel itself was a tougher nut to crack. This is now sorted mostly
thanks to Anton and Rusty.
We also have a fairly big series from Cedric that add support for
64-bit LE zImage boot wrapper. This was made harder by the fact that
traditionally our zImage wrapper was always 32-bit, but our new LE
toolchains don't really support 32-bit anymore (it's somewhat there
but not really "supported") so we didn't want to rely on it. This
meant more churn that just endian fixes.
This brings some more LE bits as well, such as the ability to run in
LE mode without a hypervisor (ie. under OPAL firmware) by doing the
right OPAL call to reinitialize the CPU to take HV interrupts in the
right mode and the usual pile of endian fixes.
There's another series from Gavin adding EEH improvements (one day we
*will* have a release with less than 20 EEH patches, I promise!).
Another highlight is the support for the "Split core" functionality on
P8 by Michael. This allows a P8 core to be split into "sub cores" of
4 threads which allows the subcores to run different guests under KVM
(the HW still doesn't support a partition per thread).
And then the usual misc bits and fixes ..."
[ Further delayed by gmail deciding that BenH is a dirty spammer.
Google knows. ]
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (155 commits)
powerpc/powernv: Add missing include to LPC code
selftests/powerpc: Test the THP bug we fixed in the previous commit
powerpc/mm: Check paca psize is up to date for huge mappings
powerpc/powernv: Pass buffer size to OPAL validate flash call
powerpc/pseries: hcall functions are exported to modules, need _GLOBAL_TOC()
powerpc: Exported functions __clear_user and copy_page use r2 so need _GLOBAL_TOC()
powerpc/powernv: Set memory_block_size_bytes to 256MB
powerpc: Allow ppc_md platform hook to override memory_block_size_bytes
powerpc/powernv: Fix endian issues in memory error handling code
powerpc/eeh: Skip eeh sysfs when eeh is disabled
powerpc: 64bit sendfile is capped at 2GB
powerpc/powernv: Provide debugfs access to the LPC bus via OPAL
powerpc/serial: Use saner flags when creating legacy ports
powerpc: Add cpu family documentation
powerpc/xmon: Fix up xmon format strings
powerpc/powernv: Add calls to support little endian host
powerpc: Document sysfs DSCR interface
powerpc: Fix regression of per-CPU DSCR setting
powerpc: Split __SYSFS_SPRSETUP macro
arch: powerpc/fadump: Cleaning up inconsistent NULL checks
...
POWER8 implements a new register called TAR. This register has to be
enabled in FSCR and then from KVM's point of view is mere storage.
This patch enables the guest to use TAR.
Signed-off-by: Alexander Graf <agraf@suse.de>
POWER8 introduced a new interrupt type called "Facility unavailable interrupt"
which contains its status message in a new register called FSCR.
Handle these exits and try to emulate instructions for unhandled facilities.
Follow-on patches enable KVM to expose specific facilities into the guest.
Signed-off-by: Alexander Graf <agraf@suse.de>
The shared (magic) page is a data structure that contains often used
supervisor privileged SPRs accessible via memory to the user to reduce
the number of exits we have to take to read/write them.
When we actually share this structure with the guest we have to maintain
it in guest endianness, because some of the patch tricks only work with
native endian load/store operations.
Since we only share the structure with either host or guest in little
endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv.
For booke, the shared struct stays big endian. For book3s_64 hv we maintain
the struct in host native endian, since it never gets shared with the guest.
For book3s_64 pr we introduce a variable that tells us which endianness the
shared struct is in and route every access to it through helper inline
functions that evaluate this variable.
Signed-off-by: Alexander Graf <agraf@suse.de>
This patch make sure we inherit the LE bit correctly in different case
so that we can run Little Endian distro in PR mode
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Since commit "efcac65 powerpc: Per process DSCR + some fixes (try#4)"
it is no longer possible to set the DSCR on a per-CPU basis.
The old behaviour was to minipulate the DSCR SPR directly but this is no
longer sufficient: the value is quickly overwritten by context switching.
This patch stores the per-CPU DSCR value in a kernel variable rather than
directly in the SPR and it is used whenever a process has not set the DSCR
itself. The sysfs interface (/sys/devices/system/cpu/cpuN/dscr) is unchanged.
Writes to the old global default (/sys/devices/system/cpu/dscr_default)
now set all of the per-CPU values and reads return the last written value.
The new per-CPU default is added to the paca_struct and is used everywhere
outside of sysfs.c instead of the old global default.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Previously SPRG3 was marked for use by both VDSO and critical
interrupts (though critical interrupts were not fully implemented).
In commit 8b64a9dfb0 ("powerpc/booke64:
Use SPRG0/3 scratch for bolted TLB miss & crit int"), Mihai Caraman
made an attempt to resolve this conflict by restoring the VDSO value
early in the critical interrupt, but this has some issues:
- It's incompatible with EXCEPTION_COMMON which restores r13 from the
by-then-overwritten scratch (this cost me some debugging time).
- It forces critical exceptions to be a special case handled
differently from even machine check and debug level exceptions.
- It didn't occur to me that it was possible to make this work at all
(by doing a final "ld r13, PACA_EXCRIT+EX_R13(r13)") until after
I made (most of) this patch. :-)
It might be worth investigating using a load rather than SPRG on return
from all exceptions (except TLB misses where the scratch never leaves
the SPRG) -- it could save a few cycles. Until then, let's stick with
SPRG for all exceptions.
Since we cannot use SPRG4-7 for scratch without corrupting the state of
a KVM guest, move VDSO to SPRG7 on book3e. Since neither SPRG4-7 nor
critical interrupts exist on book3s, SPRG3 is still used for VDSO
there.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: kvm-ppc@vger.kernel.org
two s390 guest features that need some handling in the host,
and all the PPC changes. The PPC changes include support for
little-endian guests and enablement for new POWER8 features.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJS6UF5AAoJEBvWZb6bTYby55kP/AgTJnyu7avN653/2aSHkjkx
KgYSMYhZPIFoY5LyZuNetXaoXFRvCykux1VYSZ6V6s35h2PZ+hdJNbHGjFYKPGTq
FQ92xQVNuWCAPxmFCjDNuDV/0BauG5y08/Orh/jpjz+GAfH43LruUQGbtXUuyJ8u
vf+yTHniU5gguqsAmodqjHUgbf+GoPJ1j7hmRoWwt8IWm7Ns3v/IK4l0p6G0h26a
RjE6aK+Tm208Yr5hD/dRAqeTbBNt3c4xub+QPsKoiEMaZBSuAOiux7D3Kx+If1gp
WsmqEQxoymihVtkZhUFO9ONLJepvmG2QwJVVyMSUW9iqxX9rraXsvVyVMwcQAhog
JuOAYxBftH07xu6Fs4eym5KvCFghM+EaJvxxt+kgnvdD4htK1+eK5trntc2zygSi
/qGiIrkqjXpkskW8kujLayF0eAU3CrZvFWveEPBfFgYiOGX/2wzJCtSm/bt9Jo0M
v60qgNFK3LNqAyeEfnm9VtlwGr6ZgsAB6DHNPX4fM5s2IBjL+qloXk/e/+aVKkW0
I3yeRdy/ExhLAab6w81JtMeR7G3YS0UNuAEVvcoxzNb5wIBY8qnpfUzTKyMxQR94
64EVpxWEYO1s55eCCyMujWrSvc+YAwhJcWHGKgC4K7mxxLD3FVyQXX6YZvgRozMX
HjQju+DToj9CskyrFlRL
=yd0Z
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more KVM updates from Paolo Bonzini:
"Second batch of KVM updates. Some minor x86 fixes, two s390 guest
features that need some handling in the host, and all the PPC changes.
The PPC changes include support for little-endian guests and
enablement for new POWER8 features"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (45 commits)
x86, kvm: correctly access the KVM_CPUID_FEATURES leaf at 0x40000101
x86, kvm: cache the base of the KVM cpuid leaves
kvm: x86: move KVM_CAP_HYPERV_TIME outside #ifdef
KVM: PPC: Book3S PR: Cope with doorbell interrupts
KVM: PPC: Book3S HV: Add software abort codes for transactional memory
KVM: PPC: Book3S HV: Add new state for transactional memory
powerpc/Kconfig: Make TM select VSX and VMX
KVM: PPC: Book3S HV: Basic little-endian guest support
KVM: PPC: Book3S HV: Add support for DABRX register on POWER7
KVM: PPC: Book3S HV: Prepare for host using hypervisor doorbells
KVM: PPC: Book3S HV: Handle new LPCR bits on POWER8
KVM: PPC: Book3S HV: Handle guest using doorbells for IPIs
KVM: PPC: Book3S HV: Consolidate code that checks reason for wake from nap
KVM: PPC: Book3S HV: Implement architecture compatibility modes for POWER8
KVM: PPC: Book3S HV: Add handler for HV facility unavailable
KVM: PPC: Book3S HV: Flush the correct number of TLB sets on POWER8
KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs
KVM: PPC: Book3S HV: Align physical and virtual CPU thread numbers
KVM: PPC: Book3S HV: Don't set DABR on POWER8
kvm/ppc: IRQ disabling cleanup
...
Pull powerpc updates from Ben Herrenschmidt:
"So here's my next branch for powerpc. A bit late as I was on vacation
last week. It's mostly the same stuff that was in next already, I
just added two patches today which are the wiring up of lockref for
powerpc, which for some reason fell through the cracks last time and
is trivial.
The highlights are, in addition to a bunch of bug fixes:
- Reworked Machine Check handling on kernels running without a
hypervisor (or acting as a hypervisor). Provides hooks to handle
some errors in real mode such as TLB errors, handle SLB errors,
etc...
- Support for retrieving memory error information from the service
processor on IBM servers running without a hypervisor and routing
them to the memory poison infrastructure.
- _PAGE_NUMA support on server processors
- 32-bit BookE relocatable kernel support
- FSL e6500 hardware tablewalk support
- A bunch of new/revived board support
- FSL e6500 deeper idle states and altivec powerdown support
You'll notice a generic mm change here, it has been acked by the
relevant authorities and is a pre-req for our _PAGE_NUMA support"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (121 commits)
powerpc: Implement arch_spin_is_locked() using arch_spin_value_unlocked()
powerpc: Add support for the optimised lockref implementation
powerpc/powernv: Call OPAL sync before kexec'ing
powerpc/eeh: Escalate error on non-existing PE
powerpc/eeh: Handle multiple EEH errors
powerpc: Fix transactional FP/VMX/VSX unavailable handlers
powerpc: Don't corrupt transactional state when using FP/VMX in kernel
powerpc: Reclaim two unused thread_info flag bits
powerpc: Fix races with irq_work
Move precessing of MCE queued event out from syscall exit path.
pseries/cpuidle: Remove redundant call to ppc64_runlatch_off() in cpu idle routines
powerpc: Make add_system_ram_resources() __init
powerpc: add SATA_MV to ppc64_defconfig
powerpc/powernv: Increase candidate fw image size
powerpc: Add debug checks to catch invalid cpu-to-node mappings
powerpc: Fix the setup of CPU-to-Node mappings during CPU online
powerpc/iommu: Don't detach device without IOMMU group
powerpc/eeh: Hotplug improvement
powerpc/eeh: Call opal_pci_reinit() on powernv for restoring config space
powerpc/eeh: Add restore_config operation
...
Add new state for transactional memory (TM) to kvm_vcpu_arch. Also add
asm-offset bits that are going to be required.
This also moves the existing TFHAR, TFIAR and TEXASR SPRs into a
CONFIG_PPC_TRANSACTIONAL_MEM section. This requires some code changes to
ensure we still compile with CONFIG_PPC_TRANSACTIONAL_MEM=N. Much of the added
the added #ifdefs are removed in a later patch when the bulk of the TM code is
added.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix merge conflict]
Signed-off-by: Alexander Graf <agraf@suse.de>
We create a guest MSR from scratch when delivering exceptions in
a few places. Instead of extracting LPCR[ILE] and inserting it
into MSR_LE each time, we simply create a new variable intr_msr which
contains the entire MSR to use. For a little-endian guest, userspace
needs to set the ILE (interrupt little-endian) bit in the LPCR for
each vcpu (or at least one vcpu in each virtual core).
[paulus@samba.org - removed H_SET_MODE implementation from original
version of the patch, and made kvmppc_set_lpcr update vcpu->arch.intr_msr.]
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
The DABRX (DABR extension) register on POWER7 processors provides finer
control over which accesses cause a data breakpoint interrupt. It
contains 3 bits which indicate whether to enable accesses in user,
kernel and hypervisor modes respectively to cause data breakpoint
interrupts, plus one bit that enables both real mode and virtual mode
accesses to cause interrupts. Currently, KVM sets DABRX to allow
both kernel and user accesses to cause interrupts while in the guest.
This adds support for the guest to specify other values for DABRX.
PAPR defines a H_SET_XDABR hcall to allow the guest to set both DABR
and DABRX with one call. This adds a real-mode implementation of
H_SET_XDABR, which shares most of its code with the existing H_SET_DABR
implementation. To support this, we add a per-vcpu field to store the
DABRX value plus code to get and set it via the ONE_REG interface.
For Linux guests to use this new hcall, userspace needs to add
"hcall-xdabr" to the set of strings in the /chosen/hypertas-functions
property in the device tree. If userspace does this and then migrates
the guest to a host where the kernel doesn't include this patch, then
userspace will need to implement H_SET_XDABR by writing the specified
DABR value to the DABR using the ONE_REG interface. In that case, the
old kernel will set DABRX to DABRX_USER | DABRX_KERNEL. That should
still work correctly, at least for Linux guests, since Linux guests
cope with getting data breakpoint interrupts in modes that weren't
requested by just ignoring the interrupt, and Linux guests never set
DABRX_BTI.
The other thing this does is to make H_SET_DABR and H_SET_XDABR work
on POWER8, which has the DAWR and DAWRX instead of DABR/X. Guests that
know about POWER8 should use H_SET_MODE rather than H_SET_[X]DABR, but
guests running in POWER7 compatibility mode will still use H_SET_[X]DABR.
For them, this adds the logic to convert DABR/X values into DAWR/X values
on POWER8.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This adds fields to the struct kvm_vcpu_arch to store the new
guest-accessible SPRs on POWER8, adds code to the get/set_one_reg
functions to allow userspace to access this state, and adds code to
the guest entry and exit to context-switch these SPRs between host
and guest.
Note that DPDES (Directed Privileged Doorbell Exception State) is
shared between threads on a core; hence we store it in struct
kvmppc_vcore and have the master thread save and restore it.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
On a threaded processor such as POWER7, we group VCPUs into virtual
cores and arrange that the VCPUs in a virtual core run on the same
physical core. Currently we don't enforce any correspondence between
virtual thread numbers within a virtual core and physical thread
numbers. Physical threads are allocated starting at 0 on a first-come
first-served basis to runnable virtual threads (VCPUs).
POWER8 implements a new "msgsndp" instruction which guest kernels can
use to interrupt other threads in the same core or sub-core. Since
the instruction takes the destination physical thread ID as a parameter,
it becomes necessary to align the physical thread IDs with the virtual
thread IDs, that is, to make sure virtual thread N within a virtual
core always runs on physical thread N.
This means that it's possible that thread 0, which is where we call
__kvmppc_vcore_entry, may end up running some other vcpu than the
one whose task called kvmppc_run_core(), or it may end up running
no vcpu at all, if for example thread 0 of the virtual core is
currently executing in userspace. However, we do need thread 0
to be responsible for switching the MMU -- a previous version of
this patch that had other threads switching the MMU was found to
be responsible for occasional memory corruption and machine check
interrupts in the guest on POWER7 machines.
To accommodate this, we no longer pass the vcpu pointer to
__kvmppc_vcore_entry, but instead let the assembly code load it from
the PACA. Since the assembly code will need to know the kvm pointer
and the thread ID for threads which don't have a vcpu, we move the
thread ID into the PACA and we add a kvm pointer to the virtual core
structure.
In the case where thread 0 has no vcpu to run, it still calls into
kvmppc_hv_entry in order to do the MMU switch, and then naps until
either its vcpu is ready to run in the guest, or some other thread
needs to exit the guest. In the latter case, thread 0 jumps to the
code that switches the MMU back to the host. This control flow means
that now we switch the MMU before loading any guest vcpu state.
Similarly, on guest exit we now save all the guest vcpu state before
switching the MMU back to the host. This has required substantial
code movement, making the diff rather large.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
There are a few things that make the existing hw tablewalk handlers
unsuitable for e6500:
- Indirect entries go in TLB1 (though the resulting direct entries go in
TLB0).
- It has threads, but no "tlbsrx." -- so we need a spinlock and
a normal "tlbsx". Because we need this lock, hardware tablewalk
is mandatory on e6500 unless we want to add spinlock+tlbsx to
the normal bolted TLB miss handler.
- TLB1 has no HES (nor next-victim hint) so we need software round robin
(TODO: integrate this round robin data with hugetlb/KVM)
- The existing tablewalk handlers map half of a page table at a time,
because IBM hardware has a fixed 1MiB indirect page size. e6500
has variable size indirect entries, with a minimum of 2MiB.
So we can't do the half-page indirect mapping, and even if we
could it would be less efficient than mapping the full page.
- Like on e5500, the linear mapping is bolted, so we don't need the
overhead of supporting nested tlb misses.
Note that hardware tablewalk does not work in rev1 of e6500.
We do not expect to support e6500 rev1 in mainline Linux.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
This modifies kvmppc_load_fp and kvmppc_save_fp to use the generic
FP/VSX and VMX load/store functions instead of open-coding the
FP/VSX/VMX load/store instructions. Since kvmppc_load/save_fp don't
follow C calling conventions, we make them private symbols within
book3s_hv_rmhandlers.S.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This uses struct thread_fp_state and struct thread_vr_state to store
the floating-point, VMX/Altivec and VSX state, rather than flat arrays.
This makes transferring the state to/from the thread_struct simpler
and allows us to unify the get/set_one_reg implementations for the
VSX registers.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
We don't use PACATOC for PR. Avoid updating HOST_R2 with PR
KVM mode when both HV and PR are enabled in the kernel. Without this we
get the below crash
(qemu)
Unable to handle kernel paging request for data at address 0xffffffffffff8310
Faulting instruction address: 0xc00000000001d5a4
cpu 0x2: Vector: 300 (Data Access) at [c0000001dc53aef0]
pc: c00000000001d5a4: .vtime_delta.isra.1+0x34/0x1d0
lr: c00000000001d760: .vtime_account_system+0x20/0x60
sp: c0000001dc53b170
msr: 8000000000009032
dar: ffffffffffff8310
dsisr: 40000000
current = 0xc0000001d76c62d0
paca = 0xc00000000fef1100 softe: 0 irq_happened: 0x01
pid = 4472, comm = qemu-system-ppc
enter ? for help
[c0000001dc53b200] c00000000001d760 .vtime_account_system+0x20/0x60
[c0000001dc53b290] c00000000008d050 .kvmppc_handle_exit_pr+0x60/0xa50
[c0000001dc53b340] c00000000008f51c kvm_start_lightweight+0xb4/0xc4
[c0000001dc53b510] c00000000008cdf0 .kvmppc_vcpu_run_pr+0x150/0x2e0
[c0000001dc53b9e0] c00000000008341c .kvmppc_vcpu_run+0x2c/0x40
[c0000001dc53ba50] c000000000080af4 .kvm_arch_vcpu_ioctl_run+0x54/0x1b0
[c0000001dc53bae0] c00000000007b4c8 .kvm_vcpu_ioctl+0x478/0x730
[c0000001dc53bca0] c0000000002140cc .do_vfs_ioctl+0x4ac/0x770
[c0000001dc53bd80] c0000000002143e8 .SyS_ioctl+0x58/0xb0
[c0000001dc53be30] c000000000009e58 syscall_exit+0x0/0x98
Signed-off-by: Alexander Graf <agraf@suse.de>
Move machine check entry point into Linux. So far we were dependent on
firmware to decode MCE error details and handover the high level info to OS.
This patch introduces early machine check routine that saves the MCE
information (srr1, srr0, dar and dsisr) to the emergency stack. We allocate
stack frame on emergency stack and set the r1 accordingly. This allows us to be
prepared to take another exception without loosing context. One thing to note
here that, if we get another machine check while ME bit is off then we risk a
checkstop. Hence we restrict ourselves to save only MCE information and
register saved on PACA_EXMC save are before we turn the ME bit on. We use
paca->in_mce flag to differentiate between first entry and nested machine check
entry which helps proper use of emergency stack. We increment paca->in_mce
every time we enter in early machine check handler and decrement it while
leaving. When we enter machine check early handler first time (paca->in_mce ==
0), we are sure nobody is using MC emergency stack and allocate a stack frame
at the start of the emergency stack. During subsequent entry (paca->in_mce >
0), we know that r1 points inside emergency stack and we allocate separate
stack frame accordingly. This prevents us from clobbering MCE information
during nested machine checks.
The early machine check handler changes are placed under CPU_FTR_HVMODE
section. This makes sure that the early machine check handler will get executed
only in hypervisor kernel.
This is the code flow:
Machine Check Interrupt
|
V
0x200 vector ME=0, IR=0, DR=0
|
V
+-----------------------------------------------+
|machine_check_pSeries_early: | ME=0, IR=0, DR=0
| Alloc frame on emergency stack |
| Save srr1, srr0, dar and dsisr on stack |
+-----------------------------------------------+
|
(ME=1, IR=0, DR=0, RFID)
|
V
machine_check_handle_early ME=1, IR=0, DR=0
|
V
+-----------------------------------------------+
| machine_check_early (r3=pt_regs) | ME=1, IR=0, DR=0
| Things to do: (in next patches) |
| Flush SLB for SLB errors |
| Flush TLB for TLB errors |
| Decode and save MCE info |
+-----------------------------------------------+
|
(Fall through existing exception handler routine.)
|
V
machine_check_pSerie ME=1, IR=0, DR=0
|
(ME=1, IR=1, DR=1, RFID)
|
V
machine_check_common ME=1, IR=1, DR=1
.
.
.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
side: the HV and emulation flavors can now coexist in a single kernel
is probably the most interesting change from a user point of view.
On the x86 side there are nested virtualization improvements and a
few bugfixes. ARM got transparent huge page support, improved
overcommit, and support for big endian guests.
Finally, there is a new interface to connect KVM with VFIO. This
helps with devices that use NoSnoop PCI transactions, letting the
driver in the guest execute WBINVD instructions. This includes
some nVidia cards on Windows, that fail to start without these
patches and the corresponding userspace changes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJShPAhAAoJEBvWZb6bTYbyl48P/297GgmELHAGBgjvb6q7yyGu
L8+eHjKbh4XBAkPwyzbvUjuww5z2hM0N3JQ0BDV9oeXlO+zwwCEns/sg2Q5/NJXq
XxnTeShaKnp9lqVBnE6G9rAOUWKoyLJ2wItlvUL8JlaO9xJ0Vmk0ta4n2Nv5GqDp
db6UD7vju6rHtIAhNpvvAO51kAOwc01xxRixCVb7KUYOnmO9nvpixzoI/S0Rp1gu
w/OWMfCosDzBoT+cOe79Yx1OKcpaVW94X6CH1s+ShCw3wcbCL2f13Ka8/E3FIcuq
vkZaLBxio7vjUAHRjPObw0XBW4InXEbhI1DjzIvm8dmc4VsgmtLQkTCG8fj+jINc
dlHQUq6Do+1F4zy6WMBUj8tNeP1Z9DsABp98rQwR8+BwHoQpGQBpAxW0TE0ZMngC
t1caqyvjZ5pPpFUxSrAV+8Kg4AvobXPYOim0vqV7Qea07KhFcBXLCfF7BWdwq/Jc
0CAOlsLL4mHGIQWZJuVGw0YGP7oATDCyewlBuDObx+szYCoV4fQGZVBEL0KwJx/1
7lrLN7JWzRyw6xTgJ5VVwgYE1tUY4IFQcHu7/5N+dw8/xg9KWA3f4PeMavIKSf+R
qteewbtmQsxUnvuQIBHLs8NRWPnBPy+F3Sc2ckeOLIe4pmfTte6shtTXcLDL+LqH
NTmT/cfmYp2BRkiCfCiS
=rWNf
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM changes from Paolo Bonzini:
"Here are the 3.13 KVM changes. There was a lot of work on the PPC
side: the HV and emulation flavors can now coexist in a single kernel
is probably the most interesting change from a user point of view.
On the x86 side there are nested virtualization improvements and a few
bugfixes.
ARM got transparent huge page support, improved overcommit, and
support for big endian guests.
Finally, there is a new interface to connect KVM with VFIO. This
helps with devices that use NoSnoop PCI transactions, letting the
driver in the guest execute WBINVD instructions. This includes some
nVidia cards on Windows, that fail to start without these patches and
the corresponding userspace changes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
kvm, vmx: Fix lazy FPU on nested guest
arm/arm64: KVM: PSCI: propagate caller endianness to the incoming vcpu
arm/arm64: KVM: MMIO support for BE guest
kvm, cpuid: Fix sparse warning
kvm: Delete prototype for non-existent function kvm_check_iopl
kvm: Delete prototype for non-existent function complete_pio
hung_task: add method to reset detector
pvclock: detect watchdog reset at pvclock read
kvm: optimize out smp_mb after srcu_read_unlock
srcu: API for barrier after srcu read unlock
KVM: remove vm mmap method
KVM: IOMMU: hva align mapping page size
KVM: x86: trace cpuid emulation when called from emulator
KVM: emulator: cleanup decode_register_operand() a bit
KVM: emulator: check rex prefix inside decode_register()
KVM: x86: fix emulation of "movzbl %bpl, %eax"
kvm_host: typo fix
KVM: x86: emulate SAHF instruction
MAINTAINERS: add tree for kvm.git
Documentation/kvm: add a 00-INDEX file
...
This way we can use same data type struct with KVM and
also help in using other debug related function.
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Acked-by: Michael Neuling <mikey@neuling.org>
[scottwood@freescale.com: removed obvious debug_reg comment]
Signed-off-by: Scott Wood <scottwood@freescale.com>
This help ups to select the relevant code in the kernel code
when we later move HV and PR bits as seperate modules. The patch
also makes the config options for PR KVM selectable
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
With later patches supporting PR kvm as a kernel module, the changes
that has to be built into the main kernel binary to enable PR KVM module
is now selected via KVM_BOOK3S_PR_POSSIBLE
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
This way we can use same data type struct with KVM and
also help in using other debug related function.
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Currently PR-style KVM keeps the volatile guest register values
(R0 - R13, CR, LR, CTR, XER, PC) in a shadow_vcpu struct rather than
the main kvm_vcpu struct. For 64-bit, the shadow_vcpu exists in two
places, a kmalloc'd struct and in the PACA, and it gets copied back
and forth in kvmppc_core_vcpu_load/put(), because the real-mode code
can't rely on being able to access the kmalloc'd struct.
This changes the code to copy the volatile values into the shadow_vcpu
as one of the last things done before entering the guest. Similarly
the values are copied back out of the shadow_vcpu to the kvm_vcpu
immediately after exiting the guest. We arrange for interrupts to be
still disabled at this point so that we can't get preempted on 64-bit
and end up copying values from the wrong PACA.
This means that the accessor functions in kvm_book3s.h for these
registers are greatly simplified, and are same between PR and HV KVM.
In places where accesses to shadow_vcpu fields are now replaced by
accesses to the kvm_vcpu, we can also remove the svcpu_get/put pairs.
Finally, on 64-bit, we don't need the kmalloc'd struct at all any more.
With this, the time to read the PVR one million times in a loop went
from 567.7ms to 575.5ms (averages of 6 values), an increase of about
1.4% for this worse-case test for guest entries and exits. The
standard deviation of the measurements is about 11ms, so the
difference is only marginally significant statistically.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This enables us to use the Processor Compatibility Register (PCR) on
POWER7 to put the processor into architecture 2.05 compatibility mode
when running a guest. In this mode the new instructions and registers
that were introduced on POWER7 are disabled in user mode. This
includes all the VSX facilities plus several other instructions such
as ldbrx, stdbrx, popcntw, popcntd, etc.
To select this mode, we have a new register accessible through the
set/get_one_reg interface, called KVM_REG_PPC_ARCH_COMPAT. Setting
this to zero gives the full set of capabilities of the processor.
Setting it to one of the "logical" PVR values defined in PAPR puts
the vcpu into the compatibility mode for the corresponding
architecture level. The supported values are:
0x0f000002 Architecture 2.05 (POWER6)
0x0f000003 Architecture 2.06 (POWER7)
0x0f100003 Architecture 2.06+ (POWER7+)
Since the PCR is per-core, the architecture compatibility level and
the corresponding PCR value are stored in the struct kvmppc_vcore, and
are therefore shared between all vcpus in a virtual core.
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: squash in fix to add missing break statements and documentation]
Signed-off-by: Alexander Graf <agraf@suse.de>
POWER7 and later IBM server processors have a register called the
Program Priority Register (PPR), which controls the priority of
each hardware CPU SMT thread, and affects how fast it runs compared
to other SMT threads. This priority can be controlled by writing to
the PPR or by use of a set of instructions of the form or rN,rN,rN
which are otherwise no-ops but have been defined to set the priority
to particular levels.
This adds code to context switch the PPR when entering and exiting
guests and to make the PPR value accessible through the SET/GET_ONE_REG
interface. When entering the guest, we set the PPR as late as
possible, because if we are setting a low thread priority it will
make the code run slowly from that point on. Similarly, the
first-level interrupt handlers save the PPR value in the PACA very
early on, and set the thread priority to the medium level, so that
the interrupt handling code runs at a reasonable speed.
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This adds the ability to have a separate LPCR (Logical Partitioning
Control Register) value relating to a guest for each virtual core,
rather than only having a single value for the whole VM. This
corresponds to what real POWER hardware does, where there is a LPCR
per CPU thread but most of the fields are required to have the same
value on all active threads in a core.
The per-virtual-core LPCR can be read and written using the
GET/SET_ONE_REG interface. Userspace can can only modify the
following fields of the LPCR value:
DPFD Default prefetch depth
ILE Interrupt little-endian
TC Translation control (secondary HPT hash group search disable)
We still maintain a per-VM default LPCR value in kvm->arch.lpcr, which
contains bits relating to memory management, i.e. the Virtualized
Partition Memory (VPM) bits and the bits relating to guest real mode.
When this default value is updated, the update needs to be propagated
to the per-vcore values, so we add a kvmppc_update_lpcr() helper to do
that.
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix whitespace]
Signed-off-by: Alexander Graf <agraf@suse.de>
This allows guests to have a different timebase origin from the host.
This is needed for migration, where a guest can migrate from one host
to another and the two hosts might have a different timebase origin.
However, the timebase seen by the guest must not go backwards, and
should go forwards only by a small amount corresponding to the time
taken for the migration.
Therefore this provides a new per-vcpu value accessed via the one_reg
interface using the new KVM_REG_PPC_TB_OFFSET identifier. This value
defaults to 0 and is not modified by KVM. On entering the guest, this
value is added onto the timebase, and on exiting the guest, it is
subtracted from the timebase.
This is only supported for recent POWER hardware which has the TBU40
(timebase upper 40 bits) register. Writing to the TBU40 register only
alters the upper 40 bits of the timebase, leaving the lower 24 bits
unchanged. This provides a way to modify the timebase for guest
migration without disturbing the synchronization of the timebase
registers across CPU cores. The kernel rounds up the value given
to a multiple of 2^24.
Timebase values stored in KVM structures (struct kvm_vcpu, struct
kvmppc_vcore, etc.) are stored as host timebase values. The timebase
values in the dispatch trace log need to be guest timebase values,
however, since that is read directly by the guest. This moves the
setting of vcpu->arch.dec_expires on guest exit to a point after we
have restored the host timebase so that vcpu->arch.dec_expires is a
host timebase value.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Currently we are not saving and restoring the SIAR and SDAR registers in
the PMU (performance monitor unit) on guest entry and exit. The result
is that performance monitoring tools in the guest could get false
information about where a program was executing and what data it was
accessing at the time of a performance monitor interrupt. This fixes
it by saving and restoring these registers along with the other PMU
registers on guest entry/exit.
This also provides a way for userspace to access these values for a
vcpu via the one_reg interface.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This provides a facility which is intended for use by KVM, where the
contents of the FP/VSX and VMX (Altivec) registers can be saved away
to somewhere other than the thread_struct when kernel code wants to
use floating point or VMX instructions. This is done by providing a
pointer in the thread_struct to indicate where the state should be
saved to. The giveup_fpu() and giveup_altivec() functions test these
pointers and save state to the indicated location if they are non-NULL.
Note that the MSR_FP/VEC bits in task->thread.regs->msr are still used
to indicate whether the CPU register state is live, even when an
alternate save location is being used.
This also provides load_fp_state() and load_vr_state() functions, which
load up FP/VSX and VMX state from memory into the CPU registers, and
corresponding store_fp_state() and store_vr_state() functions, which
store FP/VSX and VMX state into memory from the CPU registers.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This creates new 'thread_fp_state' and 'thread_vr_state' structures
to store FP/VSX state (including FPSCR) and Altivec/VSX state
(including VSCR), and uses them in the thread_struct. In the
thread_fp_state, the FPRs and VSRs are represented as u64 rather
than double, since we rarely perform floating-point computations
on the values, and this will enable the structures to be used
in KVM code as well. Similarly FPSCR is now a u64 rather than
a structure of two 32-bit values.
This takes the offsets out of the macros such as SAVE_32FPRS,
REST_32FPRS, etc. This enables the same macros to be used for normal
and transactional state, enabling us to delete the transactional
versions of the macros. This also removes the unused do_load_up_fpu
and do_load_up_altivec, which were in fact buggy since they didn't
create large enough stack frames to account for the fact that
load_up_fpu and load_up_altivec are not designed to be called from C
and assume that their caller's stack frame is an interrupt frame.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We've been keeping that field in thread_struct for a while, it contains
the "limit" of the current stack pointer and is meant to be used for
detecting stack overflows.
It has a few problems however:
- First, it was never actually *used* on 64-bit. Set and updated but
not actually exploited
- When switching stack to/from irq and softirq stacks, it's update
is racy unless we hard disable interrupts, which is costly. This
is fine on 32-bit as we don't soft-disable there but not on 64-bit.
Thus rather than fixing 2 in order to implement 1 in some hypothetical
future, let's remove the code completely from 64-bit. In order to avoid
a clutter of ifdef's, we remove the updates from C code completely
during interrupt stack switching, and instead maintain it from the
asm helper that is used to do the stack switching in the first place.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull KVM updates from Gleb Natapov:
"The highlights of the release are nested EPT and pv-ticketlocks
support (hypervisor part, guest part, which is most of the code, goes
through tip tree). Apart of that there are many fixes for all arches"
Fix up semantic conflicts as discussed in the pull request thread..
* 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (88 commits)
ARM: KVM: Add newlines to panic strings
ARM: KVM: Work around older compiler bug
ARM: KVM: Simplify tracepoint text
ARM: KVM: Fix kvm_set_pte assignment
ARM: KVM: vgic: Bump VGIC_NR_IRQS to 256
ARM: KVM: Bugfix: vgic_bytemap_get_reg per cpu regs
ARM: KVM: vgic: fix GICD_ICFGRn access
ARM: KVM: vgic: simplify vgic_get_target_reg
KVM: MMU: remove unused parameter
KVM: PPC: Book3S PR: Rework kvmppc_mmu_book3s_64_xlate()
KVM: PPC: Book3S PR: Make instruction fetch fallback work for system calls
KVM: PPC: Book3S PR: Don't corrupt guest state when kernel uses VMX
KVM: x86: update masterclock when kvmclock_offset is calculated (v2)
KVM: PPC: Book3S: Fix compile error in XICS emulation
KVM: PPC: Book3S PR: return appropriate error when allocation fails
arch: powerpc: kvm: add signed type cast for comparation
KVM: x86: add comments where MMIO does not return to the emulator
KVM: vmx: count exits to userspace during invalid guest emulation
KVM: rename __kvm_io_bus_sort_cmp to kvm_io_bus_cmp
kvm: optimize away THP checks in kvm_is_mmio_pfn()
...
If a transaction is rolled back, the Target Address Register (TAR), Processor
Priority Register (PPR) and Data Stream Control Register (DSCR) should be
restored to the checkpointed values before the transaction began. Any changes
to these SPRs inside the transaction should not be visible in the abort
handler.
Currently Linux doesn't save or restore the checkpointed TAR, PPR or DSCR. If
we preempt a processes inside a transaction which has modified any of these, on
process restore, that same transaction may be aborted we but we won't see the
checkpointed versions of these SPRs.
This adds checkpointed versions of these SPRs to the thread_struct and adds the
save/restore of these three SPRs to the treclaim/trechkpt code.
Without this if any of these SPRs are modified during a transaction, users may
incorrectly see a speculated SPR value even if the transaction is aborted.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: <stable@vger.kernel.org> [v3.10]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Unlike the other general-purpose SPRs, SPRG3 can be read by usermode
code, and is used in recent kernels to store the CPU and NUMA node
numbers so that they can be read by VDSO functions. Thus we need to
load the guest's SPRG3 value into the real SPRG3 register when entering
the guest, and restore the host's value when exiting the guest. We don't
need to save the guest SPRG3 value when exiting the guest as usermode
code can't modify SPRG3.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
In commit 59affcd "Context switch more PMU related SPRs" I added more
PMU SPRs to thread_struct, later modified in commit b11ae95. To add
insult to injury it turns out we don't need to switch MMCRA as it's
only user readable, and the value is recomputed by the PMU code.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On BookE (Branch taken + Single Step) is as same as Branch Taken
on BookS and in Linux we simulate BookS behavior for BookE as well.
When doing so, in Branch taken handling we want to set DBCR0_IC but
we update the current->thread->dbcr0 and not DBCR0.
Now on 64bit the current->thread.dbcr0 (and other debug registers)
is synchronized ONLY on context switch flow. But after handling
Branch taken in debug exception if we return back to user space
without context switch then single stepping change (DBCR0_ICMP)
does not get written in h/w DBCR0 and Instruction Complete exception
does not happen.
This fixes using ptrace reliably on BookE-PowerPC
lmbench latency test (lat_syscall) Results are (they varies a little
on each run)
1) ./lat_syscall <action> /dev/shm/uImage
action: Open read write stat fstat null
Before: 3.8618 0.2017 0.2851 1.6789 0.2256 0.0856
After: 3.8580 0.2017 0.2851 1.6955 0.2255 0.0856
1) ./lat_syscall -P 2 -N 10 <action> /dev/shm/uImage
action: Open read write stat fstat null
Before: 4.1388 0.2238 0.3066 1.7106 0.2256 0.0856
After: 4.1413 0.2236 0.3062 1.7107 0.2256 0.0856
[ Slightly modified to avoid extra branch in the fast path
on Book3S and fix build on all non-BookE 64-bit -- BenH
]
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In commit 9353374 "Context switch the new EBB SPRs" we added support for
context switching some new EBB SPRs. However despite four of us signing
off on that patch we missed some. To be fair these are not actually new
SPRs, but they are now potentially user accessible so need to be context
switched.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull kvm updates from Gleb Natapov:
"Highlights of the updates are:
general:
- new emulated device API
- legacy device assignment is now optional
- irqfd interface is more generic and can be shared between arches
x86:
- VMCS shadow support and other nested VMX improvements
- APIC virtualization and Posted Interrupt hardware support
- Optimize mmio spte zapping
ppc:
- BookE: in-kernel MPIC emulation with irqfd support
- Book3S: in-kernel XICS emulation (incomplete)
- Book3S: HV: migration fixes
- BookE: more debug support preparation
- BookE: e6500 support
ARM:
- reworking of Hyp idmaps
s390:
- ioeventfd for virtio-ccw
And many other bug fixes, cleanups and improvements"
* tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
kvm: Add compat_ioctl for device control API
KVM: x86: Account for failing enable_irq_window for NMI window request
KVM: PPC: Book3S: Add API for in-kernel XICS emulation
kvm/ppc/mpic: fix missing unlock in set_base_addr()
kvm/ppc: Hold srcu lock when calling kvm_io_bus_read/write
kvm/ppc/mpic: remove users
kvm/ppc/mpic: fix mmio region lists when multiple guests used
kvm/ppc/mpic: remove default routes from documentation
kvm: KVM_CAP_IOMMU only available with device assignment
ARM: KVM: iterate over all CPUs for CPU compatibility check
KVM: ARM: Fix spelling in error message
ARM: KVM: define KVM_ARM_MAX_VCPUS unconditionally
KVM: ARM: Fix API documentation for ONE_REG encoding
ARM: KVM: promote vfp_host pointer to generic host cpu context
ARM: KVM: add architecture specific hook for capabilities
ARM: KVM: perform HYP initilization for hotplugged CPUs
ARM: KVM: switch to a dual-step HYP init code
ARM: KVM: rework HYP page table freeing
ARM: KVM: enforce maximum size for identity mapped code
ARM: KVM: move to a KVM provided HYP idmap
...
This context switches the new Event Based Branching (EBB) SPRs. The three new
SPRs are:
- Event Based Branch Handler Register (EBBHR)
- Event Based Branch Return Register (EBBRR)
- Branch Event Status and Control Register (BESCR)
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, we wake up a CPU by sending a host IPI with
smp_send_reschedule() to thread 0 of that core, which will take all
threads out of the guest, and cause them to re-evaluate their
interrupt status on the way back in.
This adds a mechanism to differentiate real host IPIs from IPIs sent
by KVM for guest threads to poke each other, in order to target the
guest threads precisely when possible and avoid that global switch of
the core to host state.
We then use this new facility in the in-kernel XICS code.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
At present, the KVM_GET_DIRTY_LOG ioctl doesn't report modifications
done by the host to the virtual processor areas (VPAs) and dispatch
trace logs (DTLs) registered by the guest. This is because those
modifications are done either in real mode or in the host kernel
context, and in neither case does the access go through the guest's
HPT, and thus no change (C) bit gets set in the guest's HPT.
However, the changes done by the host do need to be tracked so that
the modified pages get transferred when doing live migration. In
order to track these modifications, this adds a dirty flag to the
struct representing the VPA/DTL areas, and arranges to set the flag
when the VPA/DTL gets modified by the host. Then, when we are
collecting the dirty log, we also check the dirty flags for the
VPA and DTL for each vcpu and set the relevant bit in the dirty log
if necessary. Doing this also means we now need to keep track of
the guest physical address of the VPA/DTL areas.
So as not to lose track of modifications to a VPA/DTL area when it gets
unregistered, or when a new area gets registered in its place, we need
to transfer the dirty state to the rmap chain. This adds code to
kvmppc_unpin_guest_page() to do that if the area was dirty. To simplify
that code, we now require that all VPA, DTL and SLB shadow buffer areas
fit within a single host page. Guests already comply with this
requirement because pHyp requires that these areas not cross a 4k
boundary.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Installed debug handler will be used for guest debug support
and debug facility emulation features (patches for these
features will follow this patch).
Signed-off-by: Liu Yu <yu.liu@freescale.com>
[bharat.bhushan@freescale.com: Substantial changes]
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Add transactional memory paca scratch register to show_regs. This is useful
for debugging.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds new macros for saving and restoring checkpointed architected state
from and to the thread_struct.
It also adds some debugging macros for when your brain explodes trying to debug
your transactional memory enabled kernel.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The CFAR (Come-From Address Register) is a useful debugging aid that
exists on POWER7 processors. Currently HV KVM doesn't save or restore
the CFAR register for guest vcpus, making the CFAR of limited use in
guests.
This adds the necessary code to capture the CFAR value saved in the
early exception entry code (it has to be saved before any branch is
executed), save it in the vcpu.arch struct, and restore it on entry
to the guest.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Like other places, use thread_struct to get vcpu reference.
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
This patch adds support for enabling and context switching the Target
Address Register in Power8. The TAR is a new special purpose register
that can be used for computed branches with the bctar[l] (branch
conditional to TAR) instruction in the same manner as the count and link
registers.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[PATCH 4/6] powerpc: Define ppr in thread_struct
ppr in thread_struct is used to save PPR and restore it before process exits
from kernel.
This patch sets the default priority to 3 when tasks are created such
that users can use 4 for higher priority tasks.
Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When we change or remove a HPT (hashed page table) entry, we can do
either a global TLB invalidation (tlbie) that works across the whole
machine, or a local invalidation (tlbiel) that only affects this core.
Currently we do local invalidations if the VM has only one vcpu or if
the guest requests it with the H_LOCAL flag, though the guest Linux
kernel currently doesn't ever use H_LOCAL. Then, to cope with the
possibility that vcpus moving around to different physical cores might
expose stale TLB entries, there is some code in kvmppc_hv_entry to
flush the whole TLB of entries for this VM if either this vcpu is now
running on a different physical core from where it last ran, or if this
physical core last ran a different vcpu.
There are a number of problems on POWER7 with this as it stands:
- The TLB invalidation is done per thread, whereas it only needs to be
done per core, since the TLB is shared between the threads.
- With the possibility of the host paging out guest pages, the use of
H_LOCAL by an SMP guest is dangerous since the guest could possibly
retain and use a stale TLB entry pointing to a page that had been
removed from the guest.
- The TLB invalidations that we do when a vcpu moves from one physical
core to another are unnecessary in the case of an SMP guest that isn't
using H_LOCAL.
- The optimization of using local invalidations rather than global should
apply to guests with one virtual core, not just one vcpu.
(None of this applies on PPC970, since there we always have to
invalidate the whole TLB when entering and leaving the guest, and we
can't support paging out guest memory.)
To fix these problems and simplify the code, we now maintain a simple
cpumask of which cpus need to flush the TLB on entry to the guest.
(This is indexed by cpu, though we only ever use the bits for thread
0 of each core.) Whenever we do a local TLB invalidation, we set the
bits for every cpu except the bit for thread 0 of the core that we're
currently running on. Whenever we enter a guest, we test and clear the
bit for our core, and flush the TLB if it was set.
On initial startup of the VM, and when resetting the HPT, we set all the
bits in the need_tlb_flush cpumask, since any core could potentially have
stale TLB entries from the previous VM to use the same LPID, or the
previous contents of the HPT.
Then, we maintain a count of the number of online virtual cores, and use
that when deciding whether to use a local invalidation rather than the
number of online vcpus. The code to make that decision is extracted out
into a new function, global_invalidates(). For multi-core guests on
POWER7 (i.e. when we are using mmu notifiers), we now never do local
invalidations regardless of the H_LOCAL flag.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Critical exception on 64-bit booke uses user-visible SPRG3 as scratch.
Restore VDSO information in SPRG3 on exception prolog.
Use a common sprg3 field in PACA for all powerpc64 architectures.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
During a context switch we always restore the per thread DSCR value.
If we aren't doing explicit DSCR management
(ie thread.dscr_inherit == 0) and the default DSCR changed while
the process has been sleeping we end up with the wrong value.
Check thread.dscr_inherit and select the default DSCR or per thread
DSCR as required.
This was found with the following test case, when running with
more threads than CPUs (ie forcing context switching):
http://ozlabs.org/~anton/junkcode/dscr_default_test.c
With the four patches applied I can run a combination of all
test cases successfully at the same time:
http://ozlabs.org/~anton/junkcode/dscr_default_test.chttp://ozlabs.org/~anton/junkcode/dscr_explicit_test.chttp://ozlabs.org/~anton/junkcode/dscr_inherit_test.c
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org> # 3.0+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We have a request for a fast method of getting CPU and NUMA node IDs
from userspace. This patch implements a getcpu VDSO function,
similar to x86.
Ben suggested we use SPRG3 which is userspace readable. SPRG3 can be
modified by a KVM guest, so we save the SPRG3 value in the paca and
restore it when transitioning from the guest to the host.
I have a glibc patch that implements sched_getcpu on top of this.
Testing on a POWER7:
baseline: 538 cycles
vdso: 30 cycles
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull KVM changes from Avi Kivity:
"Changes include additional instruction emulation, page-crossing MMIO,
faster dirty logging, preventing the watchdog from killing a stopped
guest, module autoload, a new MSI ABI, and some minor optimizations
and fixes. Outside x86 we have a small s390 and a very large ppc
update.
Regarding the new (for kvm) rebaseless workflow, some of the patches
that were merged before we switch trees had to be rebased, while
others are true pulls. In either case the signoffs should be correct
now."
Fix up trivial conflicts in Documentation/feature-removal-schedule.txt
arch/powerpc/kvm/book3s_segment.S and arch/x86/include/asm/kvm_para.h.
I suspect the kvm_para.h resolution ends up doing the "do I have cpuid"
check effectively twice (it was done differently in two different
commits), but better safe than sorry ;)
* 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (125 commits)
KVM: make asm-generic/kvm_para.h have an ifdef __KERNEL__ block
KVM: s390: onereg for timer related registers
KVM: s390: epoch difference and TOD programmable field
KVM: s390: KVM_GET/SET_ONEREG for s390
KVM: s390: add capability indicating COW support
KVM: Fix mmu_reload() clash with nested vmx event injection
KVM: MMU: Don't use RCU for lockless shadow walking
KVM: VMX: Optimize %ds, %es reload
KVM: VMX: Fix %ds/%es clobber
KVM: x86 emulator: convert bsf/bsr instructions to emulate_2op_SrcV_nobyte()
KVM: VMX: unlike vmcs on fail path
KVM: PPC: Emulator: clean up SPR reads and writes
KVM: PPC: Emulator: clean up instruction parsing
kvm/powerpc: Add new ioctl to retreive server MMU infos
kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM
KVM: PPC: bookehv: Fix r8/r13 storing in level exception handler
KVM: PPC: Book3S: Enable IRQs during exit handling
KVM: PPC: Fix PR KVM on POWER7 bare metal
KVM: PPC: Fix stbux emulation
KVM: PPC: bookehv: Use lwz/stw instead of PPC_LL/PPC_STL for 32-bit fields
...
Remove all the iseries specific fields in the lppaca.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commits 2f5cdd5487 ("KVM: PPC: Book3S HV: Make secondary threads more
robust against stray IPIs") and 1c2066b0f7 ("KVM: PPC: Book3S HV: Make
virtual processor area registration more robust") added fields to
struct kvm_vcpu_arch inside #ifdef CONFIG_KVM_BOOK3S_64_HV regions,
and added lines to arch/powerpc/kernel/asm-offsets.c to generate
assembler constants for their offsets. Unfortunately this led to
compile errors on Book 3S machines for configs that had KVM enabled
but not CONFIG_KVM_BOOK3S_64_HV. This fixes the problem by moving
the offending lines inside #ifdef CONFIG_KVM_BOOK3S_64_HV regions.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
The PAPR API allows three sorts of per-virtual-processor areas to be
registered (VPA, SLB shadow buffer, and dispatch trace log), and
furthermore, these can be registered and unregistered for another
virtual CPU. Currently we just update the vcpu fields pointing to
these areas at the time of registration or unregistration. If this
is done on another vcpu, there is the possibility that the target vcpu
is using those fields at the time and could end up using a bogus
pointer and corrupting memory.
This fixes the race by making the target cpu itself do the update, so
we can be sure that the update happens at a time when the fields
aren't being used. Each area now has a struct kvmppc_vpa which is
used to manage these updates. There is also a spinlock which protects
access to all of the kvmppc_vpa structs, other than to the pinned_addr
fields. (We could have just taken the spinlock when using the vpa,
slb_shadow or dtl fields, but that would mean taking the spinlock on
every guest entry and exit.)
This also changes 'struct dtl' (which was undefined) to 'struct dtl_entry',
which is what the rest of the kernel uses.
Thanks to Michael Ellerman <michael@ellerman.id.au> for pointing out
the need to initialize vcpu->arch.vpa_update_lock.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently on POWER7, if we are running the guest on a core and we don't
need all the hardware threads, we do nothing to ensure that the unused
threads aren't executing in the kernel (other than checking that they
are offline). We just assume they're napping and we don't do anything
to stop them trying to enter the kernel while the guest is running.
This means that a stray IPI can wake up the hardware thread and it will
then try to enter the kernel, but since the core is in guest context,
it will execute code from the guest in hypervisor mode once it turns the
MMU on, which tends to lead to crashes or hangs in the host.
This fixes the problem by adding two new one-byte flags in the
kvmppc_host_state structure in the PACA which are used to interlock
between the primary thread and the unused secondary threads when entering
the guest. With these flags, the primary thread can ensure that the
unused secondaries are not already in kernel mode (i.e. handling a stray
IPI) and then indicate that they should not try to enter the kernel
if they do get woken for any reason. Instead they will go into KVM code,
find that there is no vcpu to run, acknowledge and clear the IPI and go
back to nap mode.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
Chips such as e500mc that implement category E.HV in Power ISA 2.06
provide hardware virtualization features, including a new MSR mode for
guest state. The guest OS can perform many operations without trapping
into the hypervisor, including transitions to and from guest userspace.
Since we can use SRR1[GS] to reliably tell whether an exception came from
guest state, instead of messing around with IVPR, we use DO_KVM similarly
to book3s.
Current issues include:
- Machine checks from guest state are not routed to the host handler.
- The guest can cause a host oops by executing an emulated instruction
in a page that lacks read permission. Existing e500/4xx support has
the same problem.
Includes work by Ashish Kalra <Ashish.Kalra@freescale.com>,
Varun Sethi <Varun.Sethi@freescale.com>, and
Liu Yu <yu.liu@freescale.com>.
Signed-off-by: Scott Wood <scottwood@freescale.com>
[agraf: remove pt_regs usage]
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
Pull kvm updates from Avi Kivity:
"Changes include timekeeping improvements, support for assigning host
PCI devices that share interrupt lines, s390 user-controlled guests, a
large ppc update, and random fixes."
This is with the sign-off's fixed, hopefully next merge window we won't
have rebased commits.
* 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits)
KVM: Convert intx_mask_lock to spin lock
KVM: x86: fix kvm_write_tsc() TSC matching thinko
x86: kvmclock: abstract save/restore sched_clock_state
KVM: nVMX: Fix erroneous exception bitmap check
KVM: Ignore the writes to MSR_K7_HWCR(3)
KVM: MMU: make use of ->root_level in reset_rsvds_bits_mask
KVM: PMU: add proper support for fixed counter 2
KVM: PMU: Fix raw event check
KVM: PMU: warn when pin control is set in eventsel msr
KVM: VMX: Fix delayed load of shared MSRs
KVM: use correct tlbs dirty type in cmpxchg
KVM: Allow host IRQ sharing for assigned PCI 2.3 devices
KVM: Ensure all vcpus are consistent with in-kernel irqchip settings
KVM: x86 emulator: Allow PM/VM86 switch during task switch
KVM: SVM: Fix CPL updates
KVM: x86 emulator: VM86 segments must have DPL 3
KVM: x86 emulator: Fix task switch privilege checks
arch/powerpc/kvm/book3s_hv.c: included linux/sched.h twice
KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction emulation
KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
...
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>