When KVM emulates an exit from L2 to L1, it loads L1 CR4 into the
guest CR4. Before this CR4 loading, the guest CR4 refers to L2
CR4. Because these two CR4's are in different levels of guest, we
should vmx_set_cr4() rather than kvm_set_cr4() here. The latter, which
is used to handle guest writes to its CR4, checks the guest change to
CR4 and may fail if the change is invalid.
The failure may cause trouble. Consider we start
a L1 guest with non-zero L1 PCID in use,
(i.e. L1 CR4.PCIDE == 1 && L1 CR3.PCID != 0)
and
a L2 guest with L2 PCID disabled,
(i.e. L2 CR4.PCIDE == 0)
and following events may happen:
1. If kvm_set_cr4() is used in load_vmcs12_host_state() to load L1 CR4
into guest CR4 (in VMCS01) for L2 to L1 exit, it will fail because
of PCID check. As a result, the guest CR4 recorded in L0 KVM (i.e.
vcpu->arch.cr4) is left to the value of L2 CR4.
2. Later, if L1 attempts to change its CR4, e.g., clearing VMXE bit,
kvm_set_cr4() in L0 KVM will think L1 also wants to enable PCID,
because the wrong L2 CR4 is used by L0 KVM as L1 CR4. As L1
CR3.PCID != 0, L0 KVM will inject GP to L1 guest.
Fixes: 4704d0befb ("KVM: nVMX: Exiting from L2 to L1")
Cc: qemu-stable@nongnu.org
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
is_last_gpte() is not equivalent to the pseudo-code given in commit
6bb69c9b69 ("KVM: MMU: simplify last_pte_bitmap") because an incorrect
value of last_nonleaf_level may override the result even if level == 1.
It is critical for is_last_gpte() to return true on level == 1 to
terminate page walks. Otherwise memory corruption may occur as level
is used as an index to various data structures throughout the page
walking code. Even though the actual bug would be wherever the MMU is
initialized (as in the previous patch), be defensive and ensure here
that is_last_gpte() returns the correct value.
This patch is also enough to fix CVE-2017-12188.
Fixes: 6bb69c9b69
Cc: stable@vger.kernel.org
Cc: Andy Honig <ahonig@google.com>
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
[Panic if walk_addr_generic gets an incorrect level; this is a serious
bug and it's not worth a WARN_ON where the recovery path might hide
further exploitable issues; suggested by Andrew Honig. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The function updates context->root_level but didn't call
update_last_nonleaf_level so the previous and potentially wrong value
was used for page walks. For example, a zero value of last_nonleaf_level
would allow a potential out-of-bounds access in arch/x86/mmu/paging_tmpl.h's
walk_addr_generic function (CVE-2017-12188).
Fixes: 155a97a3d7
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The rework of the posted interrupt handling broke building without
support for the local APIC:
ERROR: "boot_cpu_physical_apicid" [arch/x86/kvm/kvm-intel.ko] undefined!
That configuration is probably not particularly useful anyway, so
we can avoid the randconfig failures by adding a Kconfig dependency.
Fixes: 8b306e2f3c ("KVM: VMX: avoid double list add with VT-d posted interrupts")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
When compiling the kernel with the '-frecord-gcc-switches' flag, objtool
complains:
arch/x86/kvm/emulate.o: warning: objtool: .GCC.command.line+0x0: special: can't find new instruction
And also the kernel fails to link.
The problem is that the 'kvm_fastop_exception' code gets placed into the
throwaway '.GCC.command.line' section instead of '.text'.
Exception fixup code is conventionally placed in the '.fixup' section,
so put it there where it belongs.
Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Currently, in PREEMPT_COUNT=n kernel, kvm_async_pf_task_wait() could call
schedule() to reschedule in some cases. This could result in
accidentally ending the current RCU read-side critical section early,
causing random memory corruption in the guest, or otherwise preempting
the currently running task inside between preempt_disable and
preempt_enable.
The difficulty to handle this well is because we don't know whether an
async PF delivered in a preemptible section or RCU read-side critical section
for PREEMPT_COUNT=n, since preempt_disable()/enable() and rcu_read_lock/unlock()
are both no-ops in that case.
To cure this, we treat any async PF interrupting a kernel context as one
that cannot be preempted, preventing kvm_async_pf_task_wait() from choosing
the schedule() path in that case.
To do so, a second parameter for kvm_async_pf_task_wait() is introduced,
so that we know whether it's called from a context interrupting the
kernel, and the parameter is set properly in all the callsites.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
that was finally triggered by PCID support.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJZzmJqAAoJEL/70l94x66D99oH/R4hOMfzDxFOgW3LnaCQJvwo
n1+tH3as0dfdkpggZ+UmJuKnbVJ0625+qozenrdYkKtk1YiyIIQWG3vdsz4HBfzp
CYK2NVVymf0dg8DQaluz6iB1R28ke12PggzyFv01s1QyENBDA8J38pslZarPM2OA
tnpRKC6B59/VmRD0PWS6yRmTXY+HfzWlWg4JMraq2VdybbEXJhh8BNfjjNn30DkZ
SW8kHq60AUd5Arhb3cmiPiXZCQ7odqF2u2mEcBmnA9hAacaGEheSzKCUOaEIjmZV
5/jTyG1tZkN7CbrG81ryuoa8A6qTOSyHxo1QkzAmE/p+s2IzGfzzLqmtfIsAWkE=
=1lM1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Mixed bugfixes. Perhaps the most interesting one is a latent bug that
was finally triggered by PCID support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm/x86: Handle async PF in RCU read-side critical sections
KVM: nVMX: Fix nested #PF intends to break L1's vmlauch/vmresume
KVM: VMX: use cmpxchg64
KVM: VMX: simplify and fix vmx_vcpu_pi_load
KVM: VMX: avoid double list add with VT-d posted interrupts
KVM: VMX: extract __pi_post_block
KVM: PPC: Book3S HV: Check for updated HDSISR on P9 HDSI exception
KVM: nVMX: fix HOST_CR3/HOST_CR4 cache
------------[ cut here ]------------
WARNING: CPU: 4 PID: 5280 at /home/kernel/linux/arch/x86/kvm//vmx.c:11394 nested_vmx_vmexit+0xc2b/0xd70 [kvm_intel]
CPU: 4 PID: 5280 Comm: qemu-system-x86 Tainted: G W OE 4.13.0+ #17
RIP: 0010:nested_vmx_vmexit+0xc2b/0xd70 [kvm_intel]
Call Trace:
? emulator_read_emulated+0x15/0x20 [kvm]
? segmented_read+0xae/0xf0 [kvm]
vmx_inject_page_fault_nested+0x60/0x70 [kvm_intel]
? vmx_inject_page_fault_nested+0x60/0x70 [kvm_intel]
x86_emulate_instruction+0x733/0x810 [kvm]
vmx_handle_exit+0x2f4/0xda0 [kvm_intel]
? kvm_arch_vcpu_ioctl_run+0xd2f/0x1c60 [kvm]
kvm_arch_vcpu_ioctl_run+0xdab/0x1c60 [kvm]
? kvm_arch_vcpu_load+0x62/0x230 [kvm]
kvm_vcpu_ioctl+0x340/0x700 [kvm]
? kvm_vcpu_ioctl+0x340/0x700 [kvm]
? __fget+0xfc/0x210
do_vfs_ioctl+0xa4/0x6a0
? __fget+0x11d/0x210
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x23/0xc2
A nested #PF is triggered during L0 emulating instruction for L2. However, it
doesn't consider we should not break L1's vmlauch/vmresme. This patch fixes
it by queuing the #PF exception instead ,requesting an immediate VM exit from
L2 and keeping the exception for L1 pending for a subsequent nested VM exit.
This should actually work all the time, making vmx_inject_page_fault_nested
totally unnecessary. However, that's not working yet, so this patch can work
around the issue in the meanwhile.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The simplify part: do not touch pi_desc.nv, we can set it when the
VCPU is first created. Likewise, pi_desc.sn is only handled by
vmx_vcpu_pi_load, do not touch it in __pi_post_block.
The fix part: do not check kvm_arch_has_assigned_device, instead
check the SN bit to figure out whether vmx_vcpu_pi_put ran before.
This matches what the previous patch did in pi_post_block.
Cc: Huangweidong <weidong.huang@huawei.com>
Cc: Gonglei <arei.gonglei@huawei.com>
Cc: wangxin <wangxinxin.wang@huawei.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Tested-by: Longpeng (Mike) <longpeng2@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In some cases, for example involving hot-unplug of assigned
devices, pi_post_block can forget to remove the vCPU from the
blocked_vcpu_list. When this happens, the next call to
pi_pre_block corrupts the list.
Fix this in two ways. First, check vcpu->pre_pcpu in pi_pre_block
and WARN instead of adding the element twice in the list. Second,
always do the list removal in pi_post_block if vcpu->pre_pcpu is
set (not -1).
The new code keeps interrupts disabled for the whole duration of
pi_pre_block/pi_post_block. This is not strictly necessary, but
easier to follow. For the same reason, PI.ON is checked only
after the cmpxchg, and to handle it we just call the post-block
code. This removes duplication of the list removal code.
Cc: Huangweidong <weidong.huang@huawei.com>
Cc: Gonglei <arei.gonglei@huawei.com>
Cc: wangxin <wangxinxin.wang@huawei.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Tested-by: Longpeng (Mike) <longpeng2@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename this function to better express that it's all about
initializing the FPU state of a task which goes hand in hand
with the fpu::initialized field.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/20170923130016.21448-33-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Ingo Molnar:
"Another round of CR3/PCID related fixes (I think this addresses all
but one of the known problems with PCID support), an objtool fix plus
a Clang fix that (finally) solves all Clang quirks to build a bootable
x86 kernel as-is"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Fix inline asm call constraints for Clang
objtool: Handle another GCC stack pointer adjustment bug
x86/mm/32: Load a sane CR3 before cpu_init() on secondary CPUs
x86/mm/32: Move setup_clear_cpu_cap(X86_FEATURE_PCID) earlier
x86/mm/64: Stop using CR3.PCID == 0 in ASID-aware code
x86/mm: Factor out CR3-building code
For inline asm statements which have a CALL instruction, we list the
stack pointer as a constraint to convince GCC to ensure the frame
pointer is set up first:
static inline void foo()
{
register void *__sp asm(_ASM_SP);
asm("call bar" : "+r" (__sp))
}
Unfortunately, that pattern causes Clang to corrupt the stack pointer.
The fix is easy: convert the stack pointer register variable to a global
variable.
It should be noted that the end result is different based on the GCC
version. With GCC 6.4, this patch has exactly the same result as
before:
defconfig defconfig-nofp distro distro-nofp
before 9820389 9491555 8816046 8516940
after 9820389 9491555 8816046 8516940
With GCC 7.2, however, GCC's behavior has changed. It now changes its
behavior based on the conversion of the register variable to a global.
That somehow convinces it to *always* set up the frame pointer before
inserting *any* inline asm. (Therefore, listing the variable as an
output constraint is a no-op and is no longer necessary.) It's a bit
overkill, but the performance impact should be negligible. And in fact,
there's a nice improvement with frame pointers disabled:
defconfig defconfig-nofp distro distro-nofp
before 9796316 9468236 9076191 8790305
after 9796957 9464267 9076381 8785949
So in summary, while listing the stack pointer as an output constraint
is no longer necessary for newer versions of GCC, it's still needed for
older versions.
Suggested-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/3db862e970c432ae823cf515c52b54fec8270e0e.1505942196.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For nested virt we maintain multiple VMCS that can run on a vCPU. So it is
incorrect to keep vmcs_host_cr3 and vmcs_host_cr4, whose purpose is caching
the value of the rarely changing HOST_CR3 and HOST_CR4 VMCS fields, in
vCPU-wide data structures.
Hyper-V nested on KVM runs into this consistently for me with PCID enabled.
CR3 is updated with a new value, unlikely(cr3 != vmx->host_state.vmcs_host_cr3)
fires, and the currently loaded VMCS is updated. Then we switch from L2 to
L1 and the next exit reverts CR3 to its old value.
Fixes: d6e41f1151 ("x86/mm, KVM: Teach KVM's VMX code that CR3 isn't a constant")
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
WARN_ON_ONCE(pi_test_sn(&vmx->pi_desc)) in kvm_vcpu_trigger_posted_interrupt()
intends to detect the violation of invariant that VT-d PI notification
event is not suppressed when vcpu is in the guest mode. Because the
two checks for the target vcpu mode and the target suppress field
cannot be performed atomically, the target vcpu mode may change in
between. If that does happen, WARN_ON_ONCE() here may raise false
alarms.
As the previous patch fixed the real invariant breaker, remove this
WARN_ON_ONCE() to avoid false alarms, and document the allowed cases
instead.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reported-by: "Ramamurthy, Venkatesh" <venkatesh.ramamurthy@intel.com>
Reported-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes: 28b835d60f ("KVM: Update Posted-Interrupts Descriptor when vCPU is preempted")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
In kvm_vcpu_trigger_posted_interrupt() and pi_pre_block(), KVM
assumes that PI notification events should not be suppressed when the
target vCPU is not blocked.
vmx_update_pi_irte() sets the SN field before changing an interrupt
from posting to remapping, but it does not check the vCPU mode.
Therefore, the change of SN field may break above the assumption.
Besides, I don't see reasons to suppress notification events here, so
remove the changes of SN field to avoid race condition.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reported-by: "Ramamurthy, Venkatesh" <venkatesh.ramamurthy@intel.com>
Reported-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes: 28b835d60f ("KVM: Update Posted-Interrupts Descriptor when vCPU is preempted")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Routine check_cr_write() will trigger emulator_get_cpuid()->
kvm_cpuid() to get maxphyaddr, and NULL is passed as values
for ebx/ecx/edx. This is problematic because kvm_cpuid() will
dereference these pointers.
Fixes: d1cd3ce900 ("KVM: MMU: check guest CR3 reserved bits based on its physical address width.")
Reported-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
When emulating a nested VM-entry from L1 to L2, several control field
validation checks are deferred to the hardware. Should one of these
validation checks fail, vcpu_vmx_run will set the vmx->fail flag. When
this happens, the L2 guest state is not loaded (even in part), and
execution should continue in L1 with the next instruction after the
VMLAUNCH/VMRESUME.
The VMCS12 is not modified (except for the VM-instruction error
field), the VMCS12 MSR save/load lists are not processed, and the CPU
state is not loaded from the VMCS12 host area. Moreover, the vmcs02
exit reason is stale, so it should not be consulted for any reason.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On an early VMLAUNCH/VMRESUME failure (i.e. one which sets the
VM-instruction error field of the current VMCS), the launch state of
the current VMCS is not set to "launched," and the VM-exit information
fields of the current VMCS (including IDT-vectoring information and
exit reason) are stale.
On a late VMLAUNCH/VMRESUME failure (i.e. one which sets the high bit
of the exit reason field), the launch state of the current VMCS is not
set to "launched," and only two of the VM-exit information fields of
the current VMCS are modified (exit reason and exit
qualification). The remaining VM-exit information fields of the
current VMCS (including IDT-vectoring information, in particular) are
stale.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
After a successful VM-entry, RFLAGS is cleared, with the exception of
bit 1, which is always set. This is handled by load_vmcs12_host_state.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The value of the guest_irq argument to vmx_update_pi_irte() is
ultimately coming from a KVM_IRQFD API call. Do not BUG() in
vmx_update_pi_irte() if the value is out-of bounds. (Especially,
since KVM as a whole seems to hang after that.)
Instead, print a message only once if we find that we don't have a
route for a certain IRQ (which can be out-of-bounds or within the
array).
This fixes CVE-2017-1000252.
Fixes: efc644048e ("KVM: x86: Update IRTE for posted-interrupts")
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If L1 does not specify the "use TPR shadow" VM-execution control in
vmcs12, then L0 must specify the "CR8-load exiting" and "CR8-store
exiting" VM-execution controls in vmcs02. Failure to do so will give
the L2 VM unrestricted read/write access to the hardware CR8.
This fixes CVE-2017-12154.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
qemu-system-x86-8600 [004] d..1 7205.687530: kvm_entry: vcpu 2
qemu-system-x86-8600 [004] .... 7205.687532: kvm_exit: reason EXCEPTION_NMI rip 0xffffffffa921297d info ffffeb2c0e44e018 80000b0e
qemu-system-x86-8600 [004] .... 7205.687532: kvm_page_fault: address ffffeb2c0e44e018 error_code 0
qemu-system-x86-8600 [004] .... 7205.687620: kvm_try_async_get_page: gva = 0xffffeb2c0e44e018, gfn = 0x427e4e
qemu-system-x86-8600 [004] .N.. 7205.687628: kvm_async_pf_not_present: token 0x8b002 gva 0xffffeb2c0e44e018
kworker/4:2-7814 [004] .... 7205.687655: kvm_async_pf_completed: gva 0xffffeb2c0e44e018 address 0x7fcc30c4e000
qemu-system-x86-8600 [004] .... 7205.687703: kvm_async_pf_ready: token 0x8b002 gva 0xffffeb2c0e44e018
qemu-system-x86-8600 [004] d..1 7205.687711: kvm_entry: vcpu 2
After running some memory intensive workload in guest, I catch the kworker
which completes the GUP too quickly, and queues an "Page Ready" #PF exception
after the "Page not Present" exception before the next vmentry as the above
trace which will result in #DF injected to guest.
This patch fixes it by clearing the queue for "Page not Present" if "Page Ready"
occurs before the next vmentry since the GUP has already got the required page
and shadow page table has already been fixed by "Page Ready" handler.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Fixes: 7c90705bf2 ("KVM: Inject asynchronous page fault into a PV guest if page is swapped out.")
[Changed indentation and added clearing of injected. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Don't block vCPU if there is pending exception.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
SVM AVIC hardware accelerates guest write to APIC_EOI register
(for edge-trigger interrupt), which means it does not trap to KVM.
So, only enable SVM AVIC only in split irqchip mode.
(e.g. launching qemu w/ option '-machine kernel_irqchip=split').
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Fixes: 44a95dae1d ("KVM: x86: Detect and Initialize AVIC support")
[Removed pr_debug - Radim.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Modify struct kvm_x86_ops.arch.apicv_active() to take struct kvm_vcpu
pointer as parameter in preparation to subsequent changes.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Preparing the base code for subsequent changes. This does not change
existing logic.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Clang resolves __builtin_constant_p() to false even if the expression is
constant in the end. The only purpose of that expression was to
differentiate a case where the following expression couldn't be checked
at compile-time, so we can just remove the check.
Clang handles the following two correctly. Turn it into BUG_ON if there
are any more problems with this.
Fixes: d6321d4933 ("KVM: x86: generalize guest_cpuid_has_ helpers")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
When user space sets kvm_run->immediate_exit, KVM is supposed to
return quickly. However, when a vCPU is in KVM_MP_STATE_UNINITIALIZED,
the value is not considered and the vCPU blocks.
Fix that oversight.
Fixes: 460df4c1fc ("KVM: race-free exit from KVM_RUN without POSIX signals")
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
KVM API says that KVM_RUN will return with -EINTR when a signal is
pending. However, if a vCPU is in KVM_MP_STATE_UNINITIALIZED, then
the return value is unconditionally -EAGAIN.
Copy over some code from vcpu_run(), so that the case of a pending
signal results in the expected return value.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Common:
- improve heuristic for boosting preempted spinlocks by ignoring VCPUs
in user mode
ARM:
- fix for decoding external abort types from guests
- added support for migrating the active priority of interrupts when
running a GICv2 guest on a GICv3 host
- minor cleanup
PPC:
- expose storage keys to userspace
- merge powerpc/topic/ppc-kvm branch that contains
find_linux_pte_or_hugepte and POWER9 thread management cleanup
- merge kvm-ppc-fixes with a fix that missed 4.13 because of vacations
- fixes
s390:
- merge of topic branch tlb-flushing from the s390 tree to get the
no-dat base features
- merge of kvm/master to avoid conflicts with additional sthyi fixes
- wire up the no-dat enhancements in KVM
- multiple epoch facility (z14 feature)
- Configuration z/Architecture Mode
- more sthyi fixes
- gdb server range checking fix
- small code cleanups
x86:
- emulate Hyper-V TSC frequency MSRs
- add nested INVPCID
- emulate EPTP switching VMFUNC
- support Virtual GIF
- support 5 level page tables
- speedup nested VM exits by packing byte operations
- speedup MMIO by using hardware provided physical address
- a lot of fixes and cleanups, especially nested
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJZspE1AAoJEED/6hsPKofoDcMIALT11n+LKV50QGwQdg2W1GOt
aChbgnj/Kegit3hQlDhVNb8kmdZEOZzSL81Lh0VPEr7zXU8QiWn2snbizDPv8sde
MpHhcZYZZ0YrpoiZKjl8yiwcu88OWGn2qtJ7OpuTS5hvEGAfxMncp0AMZho6fnz/
ySTwJ9GK2MTgBw39OAzCeDOeoYn4NKYMwjJGqBXRhNX8PG/1wmfqv0vPrd6wfg31
KJ58BumavwJjr8YbQ1xELm9rpQrAmaayIsG0R1dEUqCbt5a1+t2gt4h2uY7tWcIv
ACt2bIze7eF3xA+OpRs+eT+yemiH3t9btIVmhCfzUpnQ+V5Z55VMSwASLtTuJRQ=
=R8Ry
-----END PGP SIGNATURE-----
Merge tag 'kvm-4.14-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Radim Krčmář:
"First batch of KVM changes for 4.14
Common:
- improve heuristic for boosting preempted spinlocks by ignoring
VCPUs in user mode
ARM:
- fix for decoding external abort types from guests
- added support for migrating the active priority of interrupts when
running a GICv2 guest on a GICv3 host
- minor cleanup
PPC:
- expose storage keys to userspace
- merge kvm-ppc-fixes with a fix that missed 4.13 because of
vacations
- fixes
s390:
- merge of kvm/master to avoid conflicts with additional sthyi fixes
- wire up the no-dat enhancements in KVM
- multiple epoch facility (z14 feature)
- Configuration z/Architecture Mode
- more sthyi fixes
- gdb server range checking fix
- small code cleanups
x86:
- emulate Hyper-V TSC frequency MSRs
- add nested INVPCID
- emulate EPTP switching VMFUNC
- support Virtual GIF
- support 5 level page tables
- speedup nested VM exits by packing byte operations
- speedup MMIO by using hardware provided physical address
- a lot of fixes and cleanups, especially nested"
* tag 'kvm-4.14-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits)
KVM: arm/arm64: Support uaccess of GICC_APRn
KVM: arm/arm64: Extract GICv3 max APRn index calculation
KVM: arm/arm64: vITS: Drop its_ite->lpi field
KVM: arm/arm64: vgic: constify seq_operations and file_operations
KVM: arm/arm64: Fix guest external abort matching
KVM: PPC: Book3S HV: Fix memory leak in kvm_vm_ioctl_get_htab_fd
KVM: s390: vsie: cleanup mcck reinjection
KVM: s390: use WARN_ON_ONCE only for checking
KVM: s390: guestdbg: fix range check
KVM: PPC: Book3S HV: Report storage key support to userspace
KVM: PPC: Book3S HV: Fix case where HDEC is treated as 32-bit on POWER9
KVM: PPC: Book3S HV: Fix invalid use of register expression
KVM: PPC: Book3S HV: Fix H_REGISTER_VPA VPA size validation
KVM: PPC: Book3S HV: Fix setting of storage key in H_ENTER
KVM: PPC: e500mc: Fix a NULL dereference
KVM: PPC: e500: Fix some NULL dereferences on error
KVM: PPC: Book3S HV: Protect updates to spapr_tce_tables list
KVM: s390: we are always in czam mode
KVM: s390: expose no-DAT to guest and migration support
KVM: s390: sthyi: remove invalid guest write access
...
This fix was intended for 4.13, but didn't get in because both
maintainers were on vacation.
Paul Mackerras:
"It adds mutual exclusion between list_add_rcu and list_del_rcu calls
on the kvm->arch.spapr_tce_tables list. Without this, userspace could
potentially trigger corruption of the list and cause a host crash or
worse."
Pull x86 apic updates from Thomas Gleixner:
"This update provides:
- Cleanup of the IDT management including the removal of the extra
tracing IDT. A first step to cleanup the vector management code.
- The removal of the paravirt op adjust_exception_frame. This is a
XEN specific issue, but merged through this branch to avoid nasty
merge collisions
- Prevent dmesg spam about the TSC DEADLINE bug, when the CPU has
disabled the TSC DEADLINE timer in CPUID.
- Adjust a debug message in the ioapic code to print out the
information correctly"
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
x86/idt: Fix the X86_TRAP_BP gate
x86/xen: Get rid of paravirt op adjust_exception_frame
x86/eisa: Add missing include
x86/idt: Remove superfluous ALIGNment
x86/apic: Silence "FW_BUG TSC_DEADLINE disabled due to Errata" on CPUs without the feature
x86/idt: Remove the tracing IDT leftovers
x86/idt: Hide set_intr_gate()
x86/idt: Simplify alloc_intr_gate()
x86/idt: Deinline setup functions
x86/idt: Remove unused functions/inlines
x86/idt: Move interrupt gate initialization to IDT code
x86/idt: Move APIC gate initialization to tables
x86/idt: Move regular trap init to tables
x86/idt: Move IST stack based traps to table init
x86/idt: Move debug stack init to table based
x86/idt: Switch early trap init to IDT tables
x86/idt: Prepare for table based init
x86/idt: Move early IDT setup out of 32-bit asm
x86/idt: Move early IDT handler setup to IDT code
x86/idt: Consolidate IDT invalidation
...
Pull x86 mm changes from Ingo Molnar:
"PCID support, 5-level paging support, Secure Memory Encryption support
The main changes in this cycle are support for three new, complex
hardware features of x86 CPUs:
- Add 5-level paging support, which is a new hardware feature on
upcoming Intel CPUs allowing up to 128 PB of virtual address space
and 4 PB of physical RAM space - a 512-fold increase over the old
limits. (Supercomputers of the future forecasting hurricanes on an
ever warming planet can certainly make good use of more RAM.)
Many of the necessary changes went upstream in previous cycles,
v4.14 is the first kernel that can enable 5-level paging.
This feature is activated via CONFIG_X86_5LEVEL=y - disabled by
default.
(By Kirill A. Shutemov)
- Add 'encrypted memory' support, which is a new hardware feature on
upcoming AMD CPUs ('Secure Memory Encryption', SME) allowing system
RAM to be encrypted and decrypted (mostly) transparently by the
CPU, with a little help from the kernel to transition to/from
encrypted RAM. Such RAM should be more secure against various
attacks like RAM access via the memory bus and should make the
radio signature of memory bus traffic harder to intercept (and
decrypt) as well.
This feature is activated via CONFIG_AMD_MEM_ENCRYPT=y - disabled
by default.
(By Tom Lendacky)
- Enable PCID optimized TLB flushing on newer Intel CPUs: PCID is a
hardware feature that attaches an address space tag to TLB entries
and thus allows to skip TLB flushing in many cases, even if we
switch mm's.
(By Andy Lutomirski)
All three of these features were in the works for a long time, and
it's coincidence of the three independent development paths that they
are all enabled in v4.14 at once"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (65 commits)
x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)
x86/mm: Use pr_cont() in dump_pagetable()
x86/mm: Fix SME encryption stack ptr handling
kvm/x86: Avoid clearing the C-bit in rsvd_bits()
x86/CPU: Align CR3 defines
x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
acpi, x86/mm: Remove encryption mask from ACPI page protection type
x86/mm, kexec: Fix memory corruption with SME on successive kexecs
x86/mm/pkeys: Fix typo in Documentation/x86/protection-keys.txt
x86/mm/dump_pagetables: Speed up page tables dump for CONFIG_KASAN=y
x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID
x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y
x86/mm: Allow userspace have mappings above 47-bit
x86/mm: Prepare to expose larger address space to userspace
x86/mpx: Do not allow MPX if we have mappings above 47-bit
x86/mm: Rename tasksize_32bit/64bit to task_size_32bit/64bit()
x86/xen: Redefine XEN_ELFNOTE_INIT_P2M using PUD_SIZE * PTRS_PER_PUD
x86/mm/dump_pagetables: Fix printout of p4d level
x86/mm/dump_pagetables: Generalize address normalization
x86/boot: Fix memremap() related build failure
...
Pull x86 asm updates from Ingo Molnar:
- Introduce the ORC unwinder, which can be enabled via
CONFIG_ORC_UNWINDER=y.
The ORC unwinder is a lightweight, Linux kernel specific debuginfo
implementation, which aims to be DWARF done right for unwinding.
Objtool is used to generate the ORC unwinder tables during build, so
the data format is flexible and kernel internal: there's no
dependency on debuginfo created by an external toolchain.
The ORC unwinder is almost two orders of magnitude faster than the
(out of tree) DWARF unwinder - which is important for perf call graph
profiling. It is also significantly simpler and is coded defensively:
there has not been a single ORC related kernel crash so far, even
with early versions. (knock on wood!)
But the main advantage is that enabling the ORC unwinder allows
CONFIG_FRAME_POINTERS to be turned off - which speeds up the kernel
measurably:
With frame pointers disabled, GCC does not have to add frame pointer
instrumentation code to every function in the kernel. The kernel's
.text size decreases by about 3.2%, resulting in better cache
utilization and fewer instructions executed, resulting in a broad
kernel-wide speedup. Average speedup of system calls should be
roughly in the 1-3% range - measurements by Mel Gorman [1] have shown
a speedup of 5-10% for some function execution intense workloads.
The main cost of the unwinder is that the unwinder data has to be
stored in RAM: the memory cost is 2-4MB of RAM, depending on kernel
config - which is a modest cost on modern x86 systems.
Given how young the ORC unwinder code is it's not enabled by default
- but given the performance advantages the plan is to eventually make
it the default unwinder on x86.
See Documentation/x86/orc-unwinder.txt for more details.
- Remove lguest support: its intended role was that of a temporary
proof of concept for virtualization, plus its removal will enable the
reduction (removal) of the paravirt API as well, so Rusty agreed to
its removal. (Juergen Gross)
- Clean up and fix FSGS related functionality (Andy Lutomirski)
- Clean up IO access APIs (Andy Shevchenko)
- Enhance the symbol namespace (Jiri Slaby)
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
objtool: Handle GCC stack pointer adjustment bug
x86/entry/64: Use ENTRY() instead of ALIGN+GLOBAL for stub32_clone()
x86/fpu/math-emu: Add ENDPROC to functions
x86/boot/64: Extract efi_pe_entry() from startup_64()
x86/boot/32: Extract efi_pe_entry() from startup_32()
x86/lguest: Remove lguest support
x86/paravirt/xen: Remove xen_patch()
objtool: Fix objtool fallthrough detection with function padding
x86/xen/64: Fix the reported SS and CS in SYSCALL
objtool: Track DRAP separately from callee-saved registers
objtool: Fix validate_branch() return codes
x86: Clarify/fix no-op barriers for text_poke_bp()
x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUs
selftests/x86/fsgsbase: Test selectors 1, 2, and 3
x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps
x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_common
x86/asm: Fix UNWIND_HINT_REGS macro for older binutils
x86/asm/32: Fix regs_get_register() on segment registers
x86/xen/64: Rearrange the SYSCALL entries
x86/asm/32: Remove a bunch of '& 0xffff' from pt_regs segment reads
...
Calls to mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range() and are now bracketed by calls to
mmu_notifier_invalidate_range_start()/end()
Remove now useless invalidate_page callback.
Changed since v1 (Linus Torvalds)
- remove now useless kvm_arch_mmu_notifier_invalidate_page()
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Tested-by: Adam Borowski <kilobyte@angband.pl>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The first 32 bits of gate struct are the same for 32 and 64 bit kernels.
The 32-bit version uses desc_struct and no designated data structure,
so we need different accessors for 32 and 64 bit kernels.
Aside of that the macros which are necessary to build the 32-bit
gate descriptor are horrible to read.
Unify the gate structs and switch all code fiddling with it over.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20170828064957.861974317@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following commit:
d0ec49d4de ("kvm/x86/svm: Support Secure Memory Encryption within KVM")
uses __sme_clr() to remove the C-bit in rsvd_bits(). rsvd_bits() is
just a simple function to return some 1 bits. Applying a mask based
on properties of the host MMU is incorrect. Additionally, the masks
computed by __reset_rsvds_bits_mask also apply to guest page tables,
where the C bit is reserved since we don't emulate SME.
The fix is to clear the C-bit from rsvd_bits_mask array after it has been
populated from __reset_rsvds_bits_mask()
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: kvm@vger.kernel.org
Cc: paolo.bonzini@gmail.com
Fixes: d0ec49d ("kvm/x86/svm: Support Secure Memory Encryption within KVM")
Link: http://lkml.kernel.org/r/20170825205540.123531-1-brijesh.singh@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
According to the SDM, if the "use TPR shadow" VM-execution control is
1, bits 11:0 of the virtual-APIC address must be 0 and the address
should set any bits beyond the processor's physical-address width.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The host pkru is restored right after vcpu exit (commit 1be0e61), so
KVM_GET_XSAVE will return the host PKRU value instead. Fix this by
using the guest PKRU explicitly in fill_xsave and load_xsave. This
part is based on a patch by Junkang Fu.
The host PKRU data may also not match the value in vcpu->arch.guest_fpu.state,
because it could have been changed by userspace since the last time
it was saved, so skip loading it in kvm_load_guest_fpu.
Reported-by: Junkang Fu <junkang.fjk@alibaba-inc.com>
Cc: Yang Zhang <zy107165@alibaba-inc.com>
Fixes: 1be0e61c1f
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move it to struct kvm_arch_vcpu, replacing guest_pkru_valid with a
simple comparison against the host value of the register. The write of
PKRU in addition can be skipped if the guest has not enabled the feature.
Once we do this, we need not test OSPKE in the host anymore, because
guest_CR4.PKE=1 implies host_CR4.PKE=1.
The static PKU test is kept to elide the code on older CPUs.
Suggested-by: Yang Zhang <zy107165@alibaba-inc.com>
Fixes: 1be0e61c1f
Cc: stable@vger.kernel.org
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>