Pull spectre/meltdown updates from Thomas Gleixner:
"The next round of updates related to melted spectrum:
- The initial set of spectre V1 mitigations:
- Array index speculation blocker and its usage for syscall,
fdtable and the n180211 driver.
- Speculation barrier and its usage in user access functions
- Make indirect calls in KVM speculation safe
- Blacklisting of known to be broken microcodes so IPBP/IBSR are not
touched.
- The initial IBPB support and its usage in context switch
- The exposure of the new speculation MSRs to KVM guests.
- A fix for a regression in x86/32 related to the cpu entry area
- Proper whitelisting for known to be safe CPUs from the mitigations.
- objtool fixes to deal proper with retpolines and alternatives
- Exclude __init functions from retpolines which speeds up the boot
process.
- Removal of the syscall64 fast path and related cleanups and
simplifications
- Removal of the unpatched paravirt mode which is yet another source
of indirect unproteced calls.
- A new and undisputed version of the module mismatch warning
- A couple of cleanup and correctness fixes all over the place
Yet another step towards full mitigation. There are a few things still
missing like the RBS underflow mitigation for Skylake and other small
details, but that's being worked on.
That said, I'm taking a belated christmas vacation for a week and hope
that everything is magically solved when I'm back on Feb 12th"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL
KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL
KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
KVM/x86: Add IBPB support
KVM/x86: Update the reverse_cpuid list to include CPUID_7_EDX
x86/speculation: Fix typo IBRS_ATT, which should be IBRS_ALL
x86/pti: Mark constant arrays as __initconst
x86/spectre: Simplify spectre_v2 command line parsing
x86/retpoline: Avoid retpolines for built-in __init functions
x86/kvm: Update spectre-v1 mitigation
KVM: VMX: make MSR bitmaps per-VCPU
x86/paravirt: Remove 'noreplace-paravirt' cmdline option
x86/speculation: Use Indirect Branch Prediction Barrier in context switch
x86/cpuid: Fix up "virtual" IBRS/IBPB/STIBP feature bits on Intel
x86/spectre: Fix spelling mistake: "vunerable"-> "vulnerable"
x86/spectre: Report get_user mitigation for spectre_v1
nl80211: Sanitize array index in parse_txq_params
vfs, fdtable: Prevent bounds-check bypass via speculative execution
x86/syscall: Sanitize syscall table de-references under speculation
x86/get_user: Use pointer masking to limit speculation
...
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass all
hardened usercopy checks since these sizes cannot change at runtime.)
This new check is WARN-by-default, so any mistakes can be found over the
next several releases without breaking anyone's system.
The series has roughly the following sections:
- remove %p and improve reporting with offset
- prepare infrastructure and whitelist kmalloc
- update VFS subsystem with whitelists
- update SCSI subsystem with whitelists
- update network subsystem with whitelists
- update process memory with whitelists
- update per-architecture thread_struct with whitelists
- update KVM with whitelists and fix ioctl bug
- mark all other allocations as not whitelisted
- update lkdtm for more sensible test overage
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Kees Cook <kees@outflux.net>
iQIcBAABCgAGBQJabvleAAoJEIly9N/cbcAmO1kQAJnjVPutnLSbnUteZxtsv7W4
43Cggvokfxr6l08Yh3hUowNxZVKjhF9uwMVgRRg9Nl5WdYCN+vCQbHz+ZdzGJXKq
cGqdKWgexMKX+aBdNDrK7BphUeD46sH7JWR+a/lDV/BgPxBCm9i5ZZCgXbPP89AZ
NpLBji7gz49wMsnm/x135xtNlZ3dG0oKETzi7MiR+NtKtUGvoIszSKy5JdPZ4m8q
9fnXmHqmwM6uQFuzDJPt1o+D1fusTuYnjI7EgyrJRRhQ+BB3qEFZApXnKNDRS9Dm
uB7jtcwefJCjlZVCf2+PWTOEifH2WFZXLPFlC8f44jK6iRW2Nc+wVRisJ3vSNBG1
gaRUe/FSge68eyfQj5OFiwM/2099MNkKdZ0fSOjEBeubQpiFChjgWgcOXa5Bhlrr
C4CIhFV2qg/tOuHDAF+Q5S96oZkaTy5qcEEwhBSW15ySDUaRWFSrtboNt6ZVOhug
d8JJvDCQWoNu1IQozcbv6xW/Rk7miy8c0INZ4q33YUvIZpH862+vgDWfTJ73Zy9H
jR/8eG6t3kFHKS1vWdKZzOX1bEcnd02CGElFnFYUEewKoV7ZeeLsYX7zodyUAKyi
Yp5CImsDbWWTsptBg6h9nt2TseXTxYCt2bbmpJcqzsqSCUwOQNQ4/YpuzLeG0ihc
JgOmUnQNJWCTwUUw5AS1
=tzmJ
-----END PGP SIGNATURE-----
Merge tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardened usercopy whitelisting from Kees Cook:
"Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs.
To further restrict what memory is available for copying, this creates
a way to whitelist specific areas of a given slab cache object for
copying to/from userspace, allowing much finer granularity of access
control.
Slab caches that are never exposed to userspace can declare no
whitelist for their objects, thereby keeping them unavailable to
userspace via dynamic copy operations. (Note, an implicit form of
whitelisting is the use of constant sizes in usercopy operations and
get_user()/put_user(); these bypass all hardened usercopy checks since
these sizes cannot change at runtime.)
This new check is WARN-by-default, so any mistakes can be found over
the next several releases without breaking anyone's system.
The series has roughly the following sections:
- remove %p and improve reporting with offset
- prepare infrastructure and whitelist kmalloc
- update VFS subsystem with whitelists
- update SCSI subsystem with whitelists
- update network subsystem with whitelists
- update process memory with whitelists
- update per-architecture thread_struct with whitelists
- update KVM with whitelists and fix ioctl bug
- mark all other allocations as not whitelisted
- update lkdtm for more sensible test overage"
* tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits)
lkdtm: Update usercopy tests for whitelisting
usercopy: Restrict non-usercopy caches to size 0
kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
kvm: whitelist struct kvm_vcpu_arch
arm: Implement thread_struct whitelist for hardened usercopy
arm64: Implement thread_struct whitelist for hardened usercopy
x86: Implement thread_struct whitelist for hardened usercopy
fork: Provide usercopy whitelisting for task_struct
fork: Define usercopy region in thread_stack slab caches
fork: Define usercopy region in mm_struct slab caches
net: Restrict unwhitelisted proto caches to size 0
sctp: Copy struct sctp_sock.autoclose to userspace using put_user()
sctp: Define usercopy region in SCTP proto slab cache
caif: Define usercopy region in caif proto slab cache
ip: Define usercopy region in IP proto slab cache
net: Define usercopy region in struct proto slab cache
scsi: Define usercopy region in scsi_sense_cache slab cache
cifs: Define usercopy region in cifs_request slab cache
vxfs: Define usercopy region in vxfs_inode slab cache
ufs: Define usercopy region in ufs_inode_cache slab cache
...
[ Based on a patch from Paolo Bonzini <pbonzini@redhat.com> ]
... basically doing exactly what we do for VMX:
- Passthrough SPEC_CTRL to guests (if enabled in guest CPUID)
- Save and restore SPEC_CTRL around VMExit and VMEntry only if the guest
actually used it.
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: kvm@vger.kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Link: https://lkml.kernel.org/r/1517669783-20732-1-git-send-email-karahmed@amazon.de
[ Based on a patch from Ashok Raj <ashok.raj@intel.com> ]
Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for
guests that will only mitigate Spectre V2 through IBRS+IBPB and will not
be using a retpoline+IBPB based approach.
To avoid the overhead of saving and restoring the MSR_IA32_SPEC_CTRL for
guests that do not actually use the MSR, only start saving and restoring
when a non-zero is written to it.
No attempt is made to handle STIBP here, intentionally. Filtering STIBP
may be added in a future patch, which may require trapping all writes
if we don't want to pass it through directly to the guest.
[dwmw2: Clean up CPUID bits, save/restore manually, handle reset]
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: kvm@vger.kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Link: https://lkml.kernel.org/r/1517522386-18410-5-git-send-email-karahmed@amazon.de
Intel processors use MSR_IA32_ARCH_CAPABILITIES MSR to indicate RDCL_NO
(bit 0) and IBRS_ALL (bit 1). This is a read-only MSR. By default the
contents will come directly from the hardware, but user-space can still
override it.
[dwmw2: The bit in kvm_cpuid_7_0_edx_x86_features can be unconditional]
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: kvm@vger.kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Link: https://lkml.kernel.org/r/1517522386-18410-4-git-send-email-karahmed@amazon.de
The Indirect Branch Predictor Barrier (IBPB) is an indirect branch
control mechanism. It keeps earlier branches from influencing
later ones.
Unlike IBRS and STIBP, IBPB does not define a new mode of operation.
It's a command that ensures predicted branch targets aren't used after
the barrier. Although IBRS and IBPB are enumerated by the same CPUID
enumeration, IBPB is very different.
IBPB helps mitigate against three potential attacks:
* Mitigate guests from being attacked by other guests.
- This is addressed by issing IBPB when we do a guest switch.
* Mitigate attacks from guest/ring3->host/ring3.
These would require a IBPB during context switch in host, or after
VMEXIT. The host process has two ways to mitigate
- Either it can be compiled with retpoline
- If its going through context switch, and has set !dumpable then
there is a IBPB in that path.
(Tim's patch: https://patchwork.kernel.org/patch/10192871)
- The case where after a VMEXIT you return back to Qemu might make
Qemu attackable from guest when Qemu isn't compiled with retpoline.
There are issues reported when doing IBPB on every VMEXIT that resulted
in some tsc calibration woes in guest.
* Mitigate guest/ring0->host/ring0 attacks.
When host kernel is using retpoline it is safe against these attacks.
If host kernel isn't using retpoline we might need to do a IBPB flush on
every VMEXIT.
Even when using retpoline for indirect calls, in certain conditions 'ret'
can use the BTB on Skylake-era CPUs. There are other mitigations
available like RSB stuffing/clearing.
* IBPB is issued only for SVM during svm_free_vcpu().
VMX has a vmclear and SVM doesn't. Follow discussion here:
https://lkml.org/lkml/2018/1/15/146
Please refer to the following spec for more details on the enumeration
and control.
Refer here to get documentation about mitigations.
https://software.intel.com/en-us/side-channel-security-support
[peterz: rebase and changelog rewrite]
[karahmed: - rebase
- vmx: expose PRED_CMD if guest has it in CPUID
- svm: only pass through IBPB if guest has it in CPUID
- vmx: support !cpu_has_vmx_msr_bitmap()]
- vmx: support nested]
[dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS)
PRED_CMD is a write-only MSR]
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: kvm@vger.kernel.org
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1515720739-43819-6-git-send-email-ashok.raj@intel.com
Link: https://lkml.kernel.org/r/1517522386-18410-3-git-send-email-karahmed@amazon.de
[dwmw2: Stop using KF() for bits in it, too]
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Link: https://lkml.kernel.org/r/1517522386-18410-2-git-send-email-karahmed@amazon.de
Commit 75f139aaf8 "KVM: x86: Add memory barrier on vmcs field lookup"
added a raw 'asm("lfence");' to prevent a bounds check bypass of
'vmcs_field_to_offset_table'.
The lfence can be avoided in this path by using the array_index_nospec()
helper designed for these types of fixes.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Honig <ahonig@google.com>
Cc: kvm@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Link: https://lkml.kernel.org/r/151744959670.6342.3001723920950249067.stgit@dwillia2-desk3.amr.corp.intel.com
Place the MSR bitmap in struct loaded_vmcs, and update it in place
every time the x2apic or APICv state can change. This is rare and
the loop can handle 64 MSRs per iteration, in a similar fashion as
nested_vmx_prepare_msr_bitmap.
This prepares for choosing, on a per-VM basis, whether to intercept
the SPEC_CTRL and PRED_CMD MSRs.
Cc: stable@vger.kernel.org # prereq for Spectre mitigation
Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJabj6pAAoJEHm+PkMAQRiGs8cIAJQFkCWnbz86e3vG4DuWhyA8
CMGHCQdUOxxFGa/ixhIiuetbC0x+JVHAjV2FwVYbAQfaZB3pfw2iR1ncQxpAP1AI
oLU9vBEqTmwKMPc9CM5rRfnLFWpGcGwUNzgPdxD5yYqGDtcM8K840mF6NdkYe5AN
xU8rv1wlcFPF4A5pvHCH0pvVmK4VxlVFk/2H67TFdxBs4PyJOnSBnf+bcGWgsKO6
hC8XIVtcKCH2GfFxt5d0Vgc5QXJEpX1zn2mtCa1MwYRjN2plgYfD84ha0xE7J0B0
oqV/wnjKXDsmrgVpncr3txd4+zKJFNkdNRE4eLAIupHo2XHTG4HvDJ5dBY2NhGU=
=sOml
-----END PGP SIGNATURE-----
Merge tag 'v4.15' into x86/pti, to be able to merge dependent changes
Time has come to switch PTI development over to a v4.15 base - we'll still
try to make sure that all PTI fixes backport cleanly to v4.14 and earlier.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86/pti updates from Thomas Gleixner:
"Another set of melted spectrum related changes:
- Code simplifications and cleanups for RSB and retpolines.
- Make the indirect calls in KVM speculation safe.
- Whitelist CPUs which are known not to speculate from Meltdown and
prepare for the new CPUID flag which tells the kernel that a CPU is
not affected.
- A less rigorous variant of the module retpoline check which merily
warns when a non-retpoline protected module is loaded and reflects
that fact in the sysfs file.
- Prepare for Indirect Branch Prediction Barrier support.
- Prepare for exposure of the Speculation Control MSRs to guests, so
guest OSes which depend on those "features" can use them. Includes
a blacklist of the broken microcodes. The actual exposure of the
MSRs through KVM is still being worked on"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation: Simplify indirect_branch_prediction_barrier()
x86/retpoline: Simplify vmexit_fill_RSB()
x86/cpufeatures: Clean up Spectre v2 related CPUID flags
x86/cpu/bugs: Make retpoline module warning conditional
x86/bugs: Drop one "mitigation" from dmesg
x86/nospec: Fix header guards names
x86/alternative: Print unadorned pointers
x86/speculation: Add basic IBPB (Indirect Branch Prediction Barrier) support
x86/cpufeature: Blacklist SPEC_CTRL/PRED_CMD on early Spectre v2 microcodes
x86/pti: Do not enable PTI on CPUs which are not vulnerable to Meltdown
x86/msr: Add definitions for new speculation control MSRs
x86/cpufeatures: Add AMD feature bits for Speculation Control
x86/cpufeatures: Add Intel feature bits for Speculation Control
x86/cpufeatures: Add CPUID_7_EDX CPUID leaf
module/retpoline: Warn about missing retpoline in module
KVM: VMX: Make indirect call speculation safe
KVM: x86: Make indirect calls in emulator speculation safe
Group together the calls to alloc_vmcs and loaded_vmcs_init. Soon we'll also
allocate an MSR bitmap there.
Cc: stable@vger.kernel.org # prereq for Spectre mitigation
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The potential performance advantages of a vmcs02 pool have never been
realized. To simplify the code, eliminate the pool. Instead, a single
vmcs02 is allocated per VCPU when the VCPU enters VMX operation.
Cc: stable@vger.kernel.org # prereq for Spectre mitigation
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Reviewed-by: Ameya More <ameya.more@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Replace indirect call with CALL_NOSPEC.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: rga@amazon.de
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lkml.kernel.org/r/20180125095843.645776917@infradead.org
Replace the indirect calls with CALL_NOSPEC.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: rga@amazon.de
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lkml.kernel.org/r/20180125095843.595615683@infradead.org
kvm_valid_sregs() should use X86_CR0_PG and X86_CR4_PAE to check bit
status rather than X86_CR0_PG_BIT and X86_CR4_PAE_BIT. This patch is
to fix it.
Fixes: f29810335965a(KVM/x86: Check input paging mode when cs.l is set)
Reported-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
This ioctl is obsolete (it was used by Xenner as far as I know) but
still let's not break it gratuitously... Its handler is copying
directly into struct kvm. Go through a bounce buffer instead, with
the added benefit that we can actually do something useful with the
flags argument---the previous code was exiting with -EINVAL but still
doing the copy.
This technically is a userspace ABI breakage, but since no one should be
using the ioctl, it's a good occasion to see if someone actually
complains.
Cc: kernel-hardening@lists.openwall.com
Cc: Kees Cook <keescook@chromium.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Pull x86 pti updates from Thomas Gleixner:
"This contains:
- a PTI bugfix to avoid setting reserved CR3 bits when PCID is
disabled. This seems to cause issues on a virtual machine at least
and is incorrect according to the AMD manual.
- a PTI bugfix which disables the perf BTS facility if PTI is
enabled. The BTS AUX buffer is not globally visible and causes the
CPU to fault when the mapping disappears on switching CR3 to user
space. A full fix which restores BTS on PTI is non trivial and will
be worked on.
- PTI bugfixes for EFI and trusted boot which make sure that the user
space visible page table entries have the NX bit cleared
- removal of dead code in the PTI pagetable setup functions
- add PTI documentation
- add a selftest for vsyscall to verify that the kernel actually
implements what it advertises.
- a sysfs interface to expose vulnerability and mitigation
information so there is a coherent way for users to retrieve the
status.
- the initial spectre_v2 mitigations, aka retpoline:
+ The necessary ASM thunk and compiler support
+ The ASM variants of retpoline and the conversion of affected ASM
code
+ Make LFENCE serializing on AMD so it can be used as speculation
trap
+ The RSB fill after vmexit
- initial objtool support for retpoline
As I said in the status mail this is the most of the set of patches
which should go into 4.15 except two straight forward patches still on
hold:
- the retpoline add on of LFENCE which waits for ACKs
- the RSB fill after context switch
Both should be ready to go early next week and with that we'll have
covered the major holes of spectre_v2 and go back to normality"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
x86,perf: Disable intel_bts when PTI
security/Kconfig: Correct the Documentation reference for PTI
x86/pti: Fix !PCID and sanitize defines
selftests/x86: Add test_vsyscall
x86/retpoline: Fill return stack buffer on vmexit
x86/retpoline/irq32: Convert assembler indirect jumps
x86/retpoline/checksum32: Convert assembler indirect jumps
x86/retpoline/xen: Convert Xen hypercall indirect jumps
x86/retpoline/hyperv: Convert assembler indirect jumps
x86/retpoline/ftrace: Convert ftrace assembler indirect jumps
x86/retpoline/entry: Convert entry assembler indirect jumps
x86/retpoline/crypto: Convert crypto assembler indirect jumps
x86/spectre: Add boot time option to select Spectre v2 mitigation
x86/retpoline: Add initial retpoline support
objtool: Allow alternatives to be ignored
objtool: Detect jumps to retpoline thunks
x86/pti: Make unpoison of pgd for trusted boot work for real
x86/alternatives: Fix optimize_nops() checking
sysfs/cpu: Fix typos in vulnerability documentation
x86/cpu/AMD: Use LFENCE_RDTSC in preference to MFENCE_RDTSC
...
In accordance with the Intel and AMD documentation, we need to overwrite
all entries in the RSB on exiting a guest, to prevent malicious branch
target predictions from affecting the host kernel. This is needed both
for retpoline and for IBRS.
[ak: numbers again for the RSB stuffing labels]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: thomas.lendacky@amd.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kees Cook <keescook@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: https://lkml.kernel.org/r/1515755487-8524-1-git-send-email-dwmw@amazon.co.uk
This adds a memory barrier when performing a lookup into
the vmcs_field_to_offset_table. This is related to
CVE-2017-5753.
Signed-off-by: Andrew Honig <ahonig@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commits ae1f576707
and ac9b305caa.
If the hardware doesn't support MOVBE, but L0 sets CPUID.01H:ECX.MOVBE
in L1's emulated CPUID information, then L1 is likely to pass that
CPUID bit through to L2. L2 will expect MOVBE to work, but if L1
doesn't intercept #UD, then any MOVBE instruction executed in L2 will
raise #UD, and the exception will be delivered in L2.
Commit ac9b305caa is a better and more
complete version of ae1f576707 ("KVM: nVMX: Do not emulate #UD while
in guest mode"); however, neither considers the above case.
Suggested-by: Jim Mattson <jmattson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Patch "mm,vmscan: mark register_shrinker() as __must_check" is
queued for 4.16 in linux-mm and adds a warning about the unchecked
call to register_shrinker:
arch/x86/kvm/mmu.c:5485:2: warning: ignoring return value of 'register_shrinker', declared with attribute warn_unused_result [-Wunused-result]
This changes the kvm_mmu_module_init() function to fail itself
when the call to register_shrinker fails.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For example, when two APF's for page ready happen after one exit and
the first one becomes pending, the second one will result in #DF.
Instead, just handle the second page fault synchronously.
Reported-by: Ross Zwisler <zwisler@gmail.com>
Message-ID: <CAOxpaSUBf8QoOZQ1p4KfUp0jq76OKfGY4Uxs-Gg8ngReD99xww@mail.gmail.com>
Reported-by: Alec Blayne <ab@tevsa.net>
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
s390:
* Two fixes for potential bitmap overruns in the cmma migration code
x86:
* Clear guest provided GPRs to defeat the Project Zero PoC for CVE
2017-5715
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJaUTJ4AAoJEED/6hsPKofohk0IAJAFlMG66u5MxC0kSM61U4Zf
1vkzRwAkBbcN82LpGQKbqabVyTq0F3aLipyOn6WO5SN0K5m+OI2OV/aAroPyX8bI
F7nWIqTXLhJ9X6KXINFvyavHMprvWl8PA72tR/B/7GhhfShrZ2wGgqhl0vv/kCUK
/8q+5e693yJqw8ceemin9a6kPJrLpmjeH+Oy24KIlGbvJWV4UrIE86pRHnAnBtg8
L7Vbxn5+ezKmakvBh+zF8NKcD1zHDcmQZHoYFPsQT0vX5GPoYqT2bcO6gsh1Grmp
8ti6KkrnP+j2A/OEna4LBWfwKI/1xHXneB22BYrAxvNjHt+R4JrjaPpx82SEB4Y=
=URMR
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
"s390:
- Two fixes for potential bitmap overruns in the cmma migration code
x86:
- Clear guest provided GPRs to defeat the Project Zero PoC for CVE
2017-5715"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: vmx: Scrub hardware GPRs at VM-exit
KVM: s390: prevent buffer overrun on memory hotplug during migration
KVM: s390: fix cmma migration for multiple memory slots
Guest GPR values are live in the hardware GPRs at VM-exit. Do not
leave any guest values in hardware GPRs after the guest GPR values are
saved to the vcpu_vmx structure.
This is a partial mitigation for CVE 2017-5715 and CVE 2017-5753.
Specifically, it defeats the Project Zero PoC for CVE 2017-5715.
Suggested-by: Eric Northup <digitaleric@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Eric Northup <digitaleric@google.com>
Reviewed-by: Benjamin Serebrin <serebrin@google.com>
Reviewed-by: Andrew Honig <ahonig@google.com>
[Paolo: Add AMD bits, Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- A bug in handling of SPE state for non-vhe systems
- A fix for a crash on system shutdown
- Three timer fixes, introduced by the timer optimizations for v4.15
x86 fixes:
- fix for a WARN that was introduced in 4.15
- fix for SMM when guest uses PCID
- fixes for several bugs found by syzkaller
... and a dozen papercut fixes for the kvm_stat tool.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJaO6N9AAoJEL/70l94x66DC1wH/Rf+u0Cj6ZQil6LK6Nf8bfPd
3TqrwrxUDeXwi8GzsvK14izBr1mDzidSHIO0Q4XINFRSRdaf43h3R2im/SJqvNhP
xktCmJI2CxN96oaC7kIExgwf3YKhFdLIADfbT8oR9p3xZG/+c97dkr3b4XtmVCDb
ZXdUEOcKnoW4zwpfJN30FLlq4OwYvuYVz02AEfPivZRDfhhus/TYSnuSdxH8CLNf
75ymuKyXoo/RELbimwbMk8Cm9+ey7PjlUGOgbnbXIFtmgznXhLzAOeES2B+46J5b
sMBPlmiJrn6N//lM18CC5yOBzBLGsYOoXggtw4aU/5nM4GVcFebWedpcoD4D8Jw=
=Bt8w
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"ARM fixes:
- A bug in handling of SPE state for non-vhe systems
- A fix for a crash on system shutdown
- Three timer fixes, introduced by the timer optimizations for v4.15
x86 fixes:
- fix for a WARN that was introduced in 4.15
- fix for SMM when guest uses PCID
- fixes for several bugs found by syzkaller
... and a dozen papercut fixes for the kvm_stat tool"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
tools/kvm_stat: sort '-f help' output
kvm: x86: fix RSM when PCID is non-zero
KVM: Fix stack-out-of-bounds read in write_mmio
KVM: arm/arm64: Fix timer enable flow
KVM: arm/arm64: Properly handle arch-timer IRQs after vtimer_save_state
KVM: arm/arm64: timer: Don't set irq as forwarded if no usable GIC
KVM: arm/arm64: Fix HYP unmapping going off limits
arm64: kvm: Prevent restoring stale PMSCR_EL1 for vcpu
KVM/x86: Check input paging mode when cs.l is set
tools/kvm_stat: add line for totals
tools/kvm_stat: stop ignoring unhandled arguments
tools/kvm_stat: suppress usage information on command line errors
tools/kvm_stat: handle invalid regular expressions
tools/kvm_stat: add hint on '-f help' to man page
tools/kvm_stat: fix child trace events accounting
tools/kvm_stat: fix extra handling of 'help' with fields filter
tools/kvm_stat: fix missing field update after filter change
tools/kvm_stat: fix drilldown in events-by-guests mode
tools/kvm_stat: fix command line option '-g'
kvm: x86: fix WARN due to uninitialized guest FPU state
...
rsm_load_state_64() and rsm_enter_protected_mode() load CR3, then
CR4 & ~PCIDE, then CR0, then CR4.
However, setting CR4.PCIDE fails if CR3[11:0] != 0. It's probably easier
in the long run to replace rsm_enter_protected_mode() with an emulator
callback that sets all the special registers (like KVM_SET_SREGS would
do). For now, set the PCID field of CR3 only after CR4.PCIDE is 1.
Reported-by: Laszlo Ersek <lersek@redhat.com>
Tested-by: Laszlo Ersek <lersek@redhat.com>
Fixes: 660a5d517a
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull x86 syscall entry code changes for PTI from Ingo Molnar:
"The main changes here are Andy Lutomirski's changes to switch the
x86-64 entry code to use the 'per CPU entry trampoline stack'. This,
besides helping fix KASLR leaks (the pending Page Table Isolation
(PTI) work), also robustifies the x86 entry code"
* 'WIP.x86-pti.entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
x86/cpufeatures: Make CPU bugs sticky
x86/paravirt: Provide a way to check for hypervisors
x86/paravirt: Dont patch flush_tlb_single
x86/entry/64: Make cpu_entry_area.tss read-only
x86/entry: Clean up the SYSENTER_stack code
x86/entry/64: Remove the SYSENTER stack canary
x86/entry/64: Move the IST stacks into struct cpu_entry_area
x86/entry/64: Create a per-CPU SYSCALL entry trampoline
x86/entry/64: Return to userspace from the trampoline stack
x86/entry/64: Use a per-CPU trampoline stack for IDT entries
x86/espfix/64: Stop assuming that pt_regs is on the entry stack
x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0
x86/entry: Remap the TSS into the CPU entry area
x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct
x86/dumpstack: Handle stack overflow on all stacks
x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss
x86/kasan/64: Teach KASAN about the cpu_entry_area
x86/mm/fixmap: Generalize the GDT fixmap mechanism, introduce struct cpu_entry_area
x86/entry/gdt: Put per-CPU GDT remaps in ascending order
x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
...
Reported by syzkaller:
BUG: KASAN: stack-out-of-bounds in write_mmio+0x11e/0x270 [kvm]
Read of size 8 at addr ffff8803259df7f8 by task syz-executor/32298
CPU: 6 PID: 32298 Comm: syz-executor Tainted: G OE 4.15.0-rc2+ #18
Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
Call Trace:
dump_stack+0xab/0xe1
print_address_description+0x6b/0x290
kasan_report+0x28a/0x370
write_mmio+0x11e/0x270 [kvm]
emulator_read_write_onepage+0x311/0x600 [kvm]
emulator_read_write+0xef/0x240 [kvm]
emulator_fix_hypercall+0x105/0x150 [kvm]
em_hypercall+0x2b/0x80 [kvm]
x86_emulate_insn+0x2b1/0x1640 [kvm]
x86_emulate_instruction+0x39a/0xb90 [kvm]
handle_exception+0x1b4/0x4d0 [kvm_intel]
vcpu_enter_guest+0x15a0/0x2640 [kvm]
kvm_arch_vcpu_ioctl_run+0x549/0x7d0 [kvm]
kvm_vcpu_ioctl+0x479/0x880 [kvm]
do_vfs_ioctl+0x142/0x9a0
SyS_ioctl+0x74/0x80
entry_SYSCALL_64_fastpath+0x23/0x9a
The path of patched vmmcall will patch 3 bytes opcode 0F 01 C1(vmcall)
to the guest memory, however, write_mmio tracepoint always prints 8 bytes
through *(u64 *)val since kvm splits the mmio access into 8 bytes. This
leaks 5 bytes from the kernel stack (CVE-2017-17741). This patch fixes
it by just accessing the bytes which we operate on.
Before patch:
syz-executor-5567 [007] .... 51370.561696: kvm_mmio: mmio write len 3 gpa 0x10 val 0x1ffff10077c1010f
After patch:
syz-executor-13416 [002] .... 51302.299573: kvm_mmio: mmio write len 3 gpa 0x10 val 0xc1010f
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This has a secondary purpose: it puts the entry stack into a region
with a well-controlled layout. A subsequent patch will take
advantage of this to streamline the SYSCALL entry code to be able to
find it more easily.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bpetkov@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150605.962042855@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A future patch will move SYSENTER_stack to the beginning of cpu_tss
to help detect overflow. Before this can happen, fix several code
paths that hardcode assumptions about the old layout.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150605.722425540@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
------------[ cut here ]------------
Bad FPU state detected at kvm_put_guest_fpu+0xd8/0x2d0 [kvm], reinitializing FPU registers.
WARNING: CPU: 1 PID: 4594 at arch/x86/mm/extable.c:103 ex_handler_fprestore+0x88/0x90
CPU: 1 PID: 4594 Comm: qemu-system-x86 Tainted: G B OE 4.15.0-rc2+ #10
RIP: 0010:ex_handler_fprestore+0x88/0x90
Call Trace:
fixup_exception+0x4e/0x60
do_general_protection+0xff/0x270
general_protection+0x22/0x30
RIP: 0010:kvm_put_guest_fpu+0xd8/0x2d0 [kvm]
RSP: 0018:ffff8803d5627810 EFLAGS: 00010246
kvm_vcpu_reset+0x3b4/0x3c0 [kvm]
kvm_apic_accept_events+0x1c0/0x240 [kvm]
kvm_arch_vcpu_ioctl_run+0x1658/0x2fb0 [kvm]
kvm_vcpu_ioctl+0x479/0x880 [kvm]
do_vfs_ioctl+0x142/0x9a0
SyS_ioctl+0x74/0x80
do_syscall_64+0x15f/0x600
where kvm_put_guest_fpu is called without a prior kvm_load_guest_fpu.
To fix it, move kvm_load_guest_fpu to the very beginning of
kvm_arch_vcpu_ioctl_run.
Cc: stable@vger.kernel.org
Fixes: f775b13eed
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The below test case can cause infinite loop in kvm when ept=0.
#include <unistd.h>
#include <sys/syscall.h>
#include <string.h>
#include <stdint.h>
#include <linux/kvm.h>
#include <fcntl.h>
#include <sys/ioctl.h>
long r[5];
int main()
{
r[2] = open("/dev/kvm", O_RDONLY);
r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7);
ioctl(r[4], KVM_RUN, 0);
}
It doesn't setup the memory regions, mmu_alloc_shadow/direct_roots() in
kvm return 1 when kvm fails to allocate root page table which can result
in beblow infinite loop:
vcpu_run() {
for (;;) {
r = vcpu_enter_guest()::kvm_mmu_reload() returns 1
if (r <= 0)
break;
if (need_resched())
cond_resched();
}
}
This patch fixes it by returning -ENOSPC when there is no available kvm mmu
page for root page table.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 26eeb53cf0 (KVM: MMU: Bail out immediately if there is no available mmu page)
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Implementation of the unpinned APIC page didn't update the VMCS address
cache when invalidation was done through range mmu notifiers.
This became a problem when the page notifier was removed.
Re-introduce the arch-specific helper and call it from ...range_start.
Reported-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Fixes: 38b9917350 ("kvm: vmx: Implement set_apic_access_page_addr")
Fixes: 369ea8242c ("mm/rmap: update to new mmu_notifier semantic v2")
Cc: <stable@vger.kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>
Tested-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
vmx_io_bitmap_b should not be allocated twice.
Fixes: 2361133293 ("KVM: VMX: refactor setup of global page-sized bitmaps")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
This fixes CVE-2017-1000407.
KVM allows guests to directly access I/O port 0x80 on Intel hosts. If
the guest floods this port with writes it generates exceptions and
instability in the host kernel, leading to a crash. With this change
guest writes to port 0x80 on Intel will behave the same as they
currently behave on AMD systems.
Prevent the flooding by removing the code that sets port 0x80 as a
passthrough port. This is essentially the same as upstream patch
99f85a28a7, except that patch was
for AMD chipsets and this patch is for Intel.
Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Fixes: fdef3ad1b3 ("KVM: VMX: Enable io bitmaps to avoid IO port 0x80 VMEXITs")
Cc: <stable@vger.kernel.org>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Now that get_fpu and put_fpu do nothing, because the scheduler will
automatically load and restore the guest FPU context for us while we
are in this code (deep inside the vcpu_run main loop), we can get rid
of the get_fpu and put_fpu hooks.
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.
This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.
This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.
This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:
[258270.527947] __warn+0xcb/0xf0
[258270.527948] warn_slowpath_null+0x1d/0x20
[258270.527951] kernel_fpu_disable+0x3f/0x50
[258270.527953] __kernel_fpu_begin+0x49/0x100
[258270.527955] kernel_fpu_begin+0xe/0x10
[258270.527958] crc32c_pcl_intel_update+0x84/0xb0
[258270.527961] crypto_shash_update+0x3f/0x110
[258270.527968] crc32c+0x63/0x8a [libcrc32c]
[258270.527975] dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
[258270.527978] node_prepare_for_write+0x44/0x70 [dm_persistent_data]
[258270.527985] dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
[258270.527988] submit_io+0x170/0x1b0 [dm_bufio]
[258270.527992] __write_dirty_buffer+0x89/0x90 [dm_bufio]
[258270.527994] __make_buffer_clean+0x4f/0x80 [dm_bufio]
[258270.527996] __try_evict_buffer+0x42/0x60 [dm_bufio]
[258270.527998] dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
[258270.528002] shrink_slab.part.40+0x1f5/0x420
[258270.528004] shrink_node+0x22c/0x320
[258270.528006] do_try_to_free_pages+0xf5/0x330
[258270.528008] try_to_free_pages+0xe9/0x190
[258270.528009] __alloc_pages_slowpath+0x40f/0xba0
[258270.528011] __alloc_pages_nodemask+0x209/0x260
[258270.528014] alloc_pages_vma+0x1f1/0x250
[258270.528017] do_huge_pmd_anonymous_page+0x123/0x660
[258270.528021] handle_mm_fault+0xfd3/0x1330
[258270.528025] __get_user_pages+0x113/0x640
[258270.528027] get_user_pages+0x4f/0x60
[258270.528063] __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
[258270.528108] try_async_pf+0x66/0x230 [kvm]
[258270.528135] tdp_page_fault+0x130/0x280 [kvm]
[258270.528149] kvm_mmu_page_fault+0x60/0x120 [kvm]
[258270.528158] handle_ept_violation+0x91/0x170 [kvm_intel]
[258270.528162] vmx_handle_exit+0x1ca/0x1400 [kvm_intel]
No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.
Cc: stable@vger.kernel.org
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
KVM API says for the signal mask you set via KVM_SET_SIGNAL_MASK, that
"any unblocked signal received [...] will cause KVM_RUN to return with
-EINTR" and that "the signal will only be delivered if not blocked by
the original signal mask".
This, however, is only true, when the calling task has a signal handler
registered for a signal. If not, signal evaluation is short-circuited for
SIG_IGN and SIG_DFL, and the signal is either ignored without KVM_RUN
returning or the whole process is terminated.
Make KVM_SET_SIGNAL_MASK behave as advertised by utilizing logic similar
to that in do_sigtimedwait() to avoid short-circuiting of signals.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reported by syzkaller:
------------[ cut here ]------------
WARNING: CPU: 5 PID: 2939 at arch/x86/kvm/vmx.c:3844 free_loaded_vmcs+0x77/0x80 [kvm_intel]
CPU: 5 PID: 2939 Comm: repro Not tainted 4.14.0+ #26
RIP: 0010:free_loaded_vmcs+0x77/0x80 [kvm_intel]
Call Trace:
vmx_free_vcpu+0xda/0x130 [kvm_intel]
kvm_arch_destroy_vm+0x192/0x290 [kvm]
kvm_put_kvm+0x262/0x560 [kvm]
kvm_vm_release+0x2c/0x30 [kvm]
__fput+0x190/0x370
task_work_run+0xa1/0xd0
do_exit+0x4d2/0x13e0
do_group_exit+0x89/0x140
get_signal+0x318/0xb80
do_signal+0x8c/0xb40
exit_to_usermode_loop+0xe4/0x140
syscall_return_slowpath+0x206/0x230
entry_SYSCALL_64_fastpath+0x98/0x9a
The syzkaller testcase will execute VMXON/VMLAUCH instructions, so the
vmx->nested stuff is populated, it will also issue KVM_SMI ioctl. However,
the testcase is just a simple c program and not be lauched by something
like seabios which implements smi_handler. Commit 05cade71cf (KVM: nSVM:
fix SMI injection in guest mode) gets out of guest mode and set nested.vmxon
to false for the duration of SMM according to SDM 34.14.1 "leave VMX
operation" upon entering SMM. We can't alloc/free the vmx->nested stuff
each time when entering/exiting SMM since it will induce more overhead. So
the function vmx_pre_enter_smm() marks nested.vmxon false even if vmx->nested
stuff is still populated. What it expected is em_rsm() can mark nested.vmxon
to be true again. However, the smi_handler/rsm will not execute since there
is no something like seabios in this scenario. The function free_nested()
fails to free the vmx->nested stuff since the vmx->nested.vmxon is false
which results in the above warning.
This patch fixes it by also considering the no SMI handler case, luckily
vmx->nested.smm.vmxon is marked according to the value of vmx->nested.vmxon
in vmx_pre_enter_smm(), we can take advantage of it and free vmx->nested
stuff when L1 goes down.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Fixes: 05cade71cf (KVM: nSVM: fix SMI injection in guest mode)
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reported by syzkaller:
*** Guest State ***
CR0: actual=0x0000000080010031, shadow=0x0000000060000010, gh_mask=fffffffffffffff7
CR4: actual=0x0000000000002061, shadow=0x0000000000000000, gh_mask=ffffffffffffe8f1
CR3 = 0x000000002081e000
RSP = 0x000000000000fffa RIP = 0x0000000000000000
RFLAGS=0x00023000 DR7 = 0x00000000000000
^^^^^^^^^^
------------[ cut here ]------------
WARNING: CPU: 6 PID: 24431 at /home/kernel/linux/arch/x86/kvm//x86.c:7302 kvm_arch_vcpu_ioctl_run+0x651/0x2ea0 [kvm]
CPU: 6 PID: 24431 Comm: reprotest Tainted: G W OE 4.14.0+ #26
RIP: 0010:kvm_arch_vcpu_ioctl_run+0x651/0x2ea0 [kvm]
RSP: 0018:ffff880291d179e0 EFLAGS: 00010202
Call Trace:
kvm_vcpu_ioctl+0x479/0x880 [kvm]
do_vfs_ioctl+0x142/0x9a0
SyS_ioctl+0x74/0x80
entry_SYSCALL_64_fastpath+0x23/0x9a
The failed vmentry is triggered by the following beautified testcase:
#include <unistd.h>
#include <sys/syscall.h>
#include <string.h>
#include <stdint.h>
#include <linux/kvm.h>
#include <fcntl.h>
#include <sys/ioctl.h>
long r[5];
int main()
{
struct kvm_debugregs dr = { 0 };
r[2] = open("/dev/kvm", O_RDONLY);
r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7);
struct kvm_guest_debug debug = {
.control = 0xf0403,
.arch = {
.debugreg[6] = 0x2,
.debugreg[7] = 0x2
}
};
ioctl(r[4], KVM_SET_GUEST_DEBUG, &debug);
ioctl(r[4], KVM_RUN, 0);
}
which testcase tries to setup the processor specific debug
registers and configure vCPU for handling guest debug events through
KVM_SET_GUEST_DEBUG. The KVM_SET_GUEST_DEBUG ioctl will get and set
rflags in order to set TF bit if single step is needed. All regs' caches
are reset to avail and GUEST_RFLAGS vmcs field is reset to 0x2 during vCPU
reset. However, the cache of rflags is not reset during vCPU reset. The
function vmx_get_rflags() returns an unreset rflags cache value since
the cache is marked avail, it is 0 after boot. Vmentry fails if the
rflags reserved bit 1 is 0.
This patch fixes it by resetting both the GUEST_RFLAGS vmcs field and
its cache to 0x2 during vCPU reset.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [qemu-system-x86:10185]
CPU: 6 PID: 10185 Comm: qemu-system-x86 Tainted: G OE 4.14.0-rc4+ #4
RIP: 0010:kvm_get_time_scale+0x4e/0xa0 [kvm]
Call Trace:
get_time_ref_counter+0x5a/0x80 [kvm]
kvm_hv_process_stimers+0x120/0x5f0 [kvm]
kvm_arch_vcpu_ioctl_run+0x4b4/0x1690 [kvm]
kvm_vcpu_ioctl+0x33a/0x620 [kvm]
do_vfs_ioctl+0xa1/0x5d0
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x1e/0xa9
This can be reproduced when running kvm-unit-tests/hyperv_stimer.flat and
cpu-hotplug stress simultaneously. __this_cpu_read(cpu_tsc_khz) returns 0
(set in kvmclock_cpu_down_prep()) when the pCPU is unhotplug which results
in kvm_get_time_scale() gets into an infinite loop.
This patch fixes it by treating the unhotplug pCPU as not using master clock.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In x2apic mode the LDR is fixed based on the ID rather
than separately loadable like it was before x2.
When kvm_apic_set_state is called, the base is set, and if
it has the X2APIC_ENABLE flag set then the LDR is calculated;
however that value gets overwritten by the memcpy a few lines
below overwriting it with the value that came from userland.
The symptom is a lack of EOI after loading the state
(e.g. after a QEMU migration) and is due to the EOI bitmap
being wrong due to the incorrect LDR. This was seen with
a Win2016 guest under Qemu with irqchip=split whose USB mouse
didn't work after a VM migration.
This corresponds to RH bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1502591
Reported-by: Yiqian Wei <yiwei@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: stable@vger.kernel.org
[Applied fixup from Liran Alon. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Split out the ldr calculation from kvm_apic_set_x2apic_id
since we're about to reuse it in the following patch.
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These bits were not defined until now in common code, but they are
now that the kernel supports UMIP too.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The function for CPUID 80000001 ECX is set to 0xc0000001. Set it to
0x80000001.
Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Fixes: d6321d4933 ("KVM: x86: generalize guest_cpuid_has_ helpers")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>