Commit Graph

44287 Commits

Author SHA1 Message Date
Sean Christopherson
1c18efdaa3 KVM: nVMX: Use KVM-governed feature framework to track "nested VMX enabled"
Track "VMX exposed to L1" via a governed feature flag instead of using a
dedicated helper to provide the same functionality.  The main goal is to
drive convergence between VMX and SVM with respect to querying features
that are controllable via module param (SVM likes to cache nested
features), avoiding the guest CPUID lookups at runtime is just a bonus
and unlikely to provide any meaningful performance benefits.

Note, X86_FEATURE_VMX is set in kvm_cpu_caps if and only if "nested" is
true, and the CPU obviously supports VMX if KVM+VMX is running.  I.e. the
check on "nested" is now implicitly down by the kvm_cpu_cap_has() check
in kvm_governed_feature_check_and_set().

No functional change intended.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviwed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:40:55 -07:00
Sean Christopherson
fe60e8f65f KVM: x86: Use KVM-governed feature framework to track "XSAVES enabled"
Use the governed feature framework to track if XSAVES is "enabled", i.e.
if XSAVES can be used by the guest.  Add a comment in the SVM code to
explain the very unintuitive logic of deliberately NOT checking if XSAVES
is enumerated in the guest CPUID model.

No functional change intended.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:38:28 -07:00
Sean Christopherson
662f681578 KVM: VMX: Rename XSAVES control to follow KVM's preferred "ENABLE_XYZ"
Rename the XSAVES secondary execution control to follow KVM's preferred
style so that XSAVES related logic can use common macros that depend on
KVM's preferred style.

No functional change intended.

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230815203653.519297-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:38:28 -07:00
Sean Christopherson
0497d2ac9b KVM: VMX: Check KVM CPU caps, not just VMX MSR support, for XSAVE enabling
Check KVM CPU capabilities instead of raw VMX support for XSAVES when
determining whether or not XSAVER can/should be exposed to the guest.
Practically speaking, it's nonsensical/impossible for a CPU to support
"enable XSAVES" without XSAVES being supported natively.  The real
motivation for checking kvm_cpu_cap_has() is to allow using the governed
feature's standard check-and-set logic.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:38:28 -07:00
Sean Christopherson
1143c0b85c KVM: VMX: Recompute "XSAVES enabled" only after CPUID update
Recompute whether or not XSAVES is enabled for the guest only if the
guest's CPUID model changes instead of redoing the computation every time
KVM generates vmcs01's secondary execution controls.  The boot_cpu_has()
and cpu_has_vmx_xsaves() checks should never change after KVM is loaded,
and if they do the kernel/KVM is hosed.

Opportunistically add a comment explaining _why_ XSAVES is effectively
exposed to the guest if and only if XSAVE is also exposed to the guest.

Practically speaking, no functional change intended (KVM will do fewer
computations, but should still see the same xsaves_enabled value whenever
KVM looks at it).

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:38:28 -07:00
Sean Christopherson
ccf31d6e6c KVM: x86/mmu: Use KVM-governed feature framework to track "GBPAGES enabled"
Use the governed feature framework to track whether or not the guest can
use 1GiB pages, and drop the one-off helper that wraps the surprisingly
non-trivial logic surrounding 1GiB page usage in the guest.

No functional change intended.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:38:27 -07:00
Sean Christopherson
42764413d1 KVM: x86: Add a framework for enabling KVM-governed x86 features
Introduce yet another X86_FEATURE flag framework to manage and cache KVM
governed features (for lack of a better name).  "Governed" in this case
means that KVM has some level of involvement and/or vested interest in
whether or not an X86_FEATURE can be used by the guest.  The intent of the
framework is twofold: to simplify caching of guest CPUID flags that KVM
needs to frequently query, and to add clarity to such caching, e.g. it
isn't immediately obvious that SVM's bundle of flags for "optional nested
SVM features" track whether or not a flag is exposed to L1.

Begrudgingly define KVM_MAX_NR_GOVERNED_FEATURES for the size of the
bitmap to avoid exposing governed_features.h in arch/x86/include/asm/, but
add a FIXME to call out that it can and should be cleaned up once
"struct kvm_vcpu_arch" is no longer expose to the kernel at large.

Cc: Zeng Guang <guang.zeng@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:38:27 -07:00
Li zeming
392a532462 x86: kvm: x86: Remove unnecessary initial values of variables
bitmap and khz is assigned first, so it does not need to initialize the
assignment.

Signed-off-by: Li zeming <zeming@nfschina.com>
Link: https://lore.kernel.org/r/20230817002631.2885-1-zeming@nfschina.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:35:28 -07:00
Sean Christopherson
7b0151caf7 KVM: x86: Remove WARN sanity check on hypervisor timer vs. UNINITIALIZED vCPU
Drop the WARN in KVM_RUN that asserts that KVM isn't using the hypervisor
timer, a.k.a. the VMX preemption timer, for a vCPU that is in the
UNINITIALIZIED activity state.  The intent of the WARN is to sanity check
that KVM won't drop a timer interrupt due to an unexpected transition to
UNINITIALIZED, but unfortunately userspace can use various ioctl()s to
force the unexpected state.

Drop the sanity check instead of switching from the hypervisor timer to a
software based timer, as the only reason to switch to a software timer
when a vCPU is blocking is to ensure the timer interrupt wakes the vCPU,
but said interrupt isn't a valid wake event for vCPUs in UNINITIALIZED
state *and* the interrupt will be dropped in the end.

Reported-by: Yikebaer Aizezi <yikebaer61@gmail.com>
Closes: https://lore.kernel.org/all/CALcu4rbFrU4go8sBHk3FreP+qjgtZCGcYNpSiEXOLm==qFv7iQ@mail.gmail.com
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20230808232057.2498287-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:30:43 -07:00
Like Xu
765da7fe0e KVM: x86: Remove break statements that will never be executed
Fix compiler warnings when compiling KVM with [-Wunreachable-code-break].
No functional change intended.

Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230807094243.32516-1-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:28:00 -07:00
Sean Christopherson
223f93d4d8 KVM: nSVM: Skip writes to MSR_AMD64_TSC_RATIO if guest state isn't loaded
Skip writes to MSR_AMD64_TSC_RATIO that are done in the context of a vCPU
if guest state isn't loaded, i.e. if KVM will update MSR_AMD64_TSC_RATIO
during svm_prepare_switch_to_guest() before entering the guest.  Checking
guest_state_loaded may or may not be a net positive for performance as
the current_tsc_ratio cache will optimize away duplicate WRMSRs in the
vast majority of scenarios.  However, the cost of the check is negligible,
and the real motivation is to document that KVM needs to load the vCPU's
value only when running the vCPU.

Link: https://lore.kernel.org/r/20230729011608.1065019-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 17:16:30 -07:00
Sean Christopherson
2d63699099 KVM: x86: Always write vCPU's current TSC offset/ratio in vendor hooks
Drop the @offset and @multiplier params from the kvm_x86_ops hooks for
propagating TSC offsets/multipliers into hardware, and instead have the
vendor implementations pull the information directly from the vCPU
structure.  The respective vCPU fields _must_ be written at the same
time in order to maintain consistent state, i.e. it's not random luck
that the value passed in by all callers is grabbed from the vCPU.

Explicitly grabbing the value from the vCPU field in SVM's implementation
in particular will allow for additional cleanup without introducing even
more subtle dependencies.  Specifically, SVM can skip the WRMSR if guest
state isn't loaded, i.e. svm_prepare_switch_to_guest() will load the
correct value for the vCPU prior to entering the guest.

This also reconciles KVM's handling of related values that are stored in
the vCPU, as svm_write_tsc_offset() already assumes/requires the caller
to have updated l1_tsc_offset.

Link: https://lore.kernel.org/r/20230729011608.1065019-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 17:16:29 -07:00
Sean Christopherson
229725acfa KVM: SVM: Clean up preemption toggling related to MSR_AMD64_TSC_RATIO
Explicitly disable preemption when writing MSR_AMD64_TSC_RATIO only in the
"outer" helper, as all direct callers of the "inner" helper now run with
preemption already disabled.  And that isn't a coincidence, as the outer
helper requires a vCPU and is intended to be used when modifying guest
state and/or emulating guest instructions, which are typically done with
preemption enabled.

Direct use of the inner helper should be extremely limited, as the only
time KVM should modify MSR_AMD64_TSC_RATIO without a vCPU is when
sanitizing the MSR for a specific pCPU (currently done when {en,dis}abling
disabling SVM). The other direct caller is svm_prepare_switch_to_guest(),
which does have a vCPU, but is a one-off special case: KVM is about to
enter the guest on a specific pCPU and thus must have preemption disabled.

Link: https://lore.kernel.org/r/20230729011608.1065019-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 17:16:29 -07:00
Sean Christopherson
c0dc39bd2c KVM: nSVM: Use the "outer" helper for writing multiplier to MSR_AMD64_TSC_RATIO
When emulating nested SVM transitions, use the outer helper for writing
the TSC multiplier for L2.  Using the inner helper only for one-off cases,
i.e. for paths where KVM is NOT emulating or modifying vCPU state, will
allow for multiple cleanups:

 - Explicitly disabling preemption only in the outer helper
 - Getting the multiplier from the vCPU field in the outer helper
 - Skipping the WRMSR in the outer helper if guest state isn't loaded

Opportunistically delete an extra newline.

No functional change intended.

Link: https://lore.kernel.org/r/20230729011608.1065019-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 17:16:28 -07:00
Sean Christopherson
0c94e24684 KVM: nSVM: Load L1's TSC multiplier based on L1 state, not L2 state
When emulating nested VM-Exit, load L1's TSC multiplier if L1's desired
ratio doesn't match the current ratio, not if the ratio L1 is using for
L2 diverges from the default.  Functionally, the end result is the same
as KVM will run L2 with L1's multiplier if L2's multiplier is the default,
i.e. checking that L1's multiplier is loaded is equivalent to checking if
L2 has a non-default multiplier.

However, the assertion that TSC scaling is exposed to L1 is flawed, as
userspace can trigger the WARN at will by writing the MSR and then
updating guest CPUID to hide the feature (modifying guest CPUID is
allowed anytime before KVM_RUN).  E.g. hacking KVM's state_test
selftest to do

                vcpu_set_msr(vcpu, MSR_AMD64_TSC_RATIO, 0);
                vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_TSCRATEMSR);

after restoring state in a new VM+vCPU yields an endless supply of:

  ------------[ cut here ]------------
  WARNING: CPU: 10 PID: 206939 at arch/x86/kvm/svm/nested.c:1105
           nested_svm_vmexit+0x6af/0x720 [kvm_amd]
  Call Trace:
   nested_svm_exit_handled+0x102/0x1f0 [kvm_amd]
   svm_handle_exit+0xb9/0x180 [kvm_amd]
   kvm_arch_vcpu_ioctl_run+0x1eab/0x2570 [kvm]
   kvm_vcpu_ioctl+0x4c9/0x5b0 [kvm]
   ? trace_hardirqs_off+0x4d/0xa0
   __se_sys_ioctl+0x7a/0xc0
   __x64_sys_ioctl+0x21/0x30
   do_syscall_64+0x41/0x90
   entry_SYSCALL_64_after_hwframe+0x63/0xcd

Unlike the nested VMRUN path, hoisting the svm->tsc_scaling_enabled check
into the if-statement is wrong as KVM needs to ensure L1's multiplier is
loaded in the above scenario.   Alternatively, the WARN_ON() could simply
be deleted, but that would make KVM's behavior even more subtle, e.g. it's
not immediately obvious why it's safe to write MSR_AMD64_TSC_RATIO when
checking only tsc_ratio_msr.

Fixes: 5228eb96a4 ("KVM: x86: nSVM: implement nested TSC scaling")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230729011608.1065019-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 17:16:27 -07:00
Sean Christopherson
7cafe9b8e2 KVM: nSVM: Check instead of asserting on nested TSC scaling support
Check for nested TSC scaling support on nested SVM VMRUN instead of
asserting that TSC scaling is exposed to L1 if L1's MSR_AMD64_TSC_RATIO
has diverged from KVM's default.  Userspace can trigger the WARN at will
by writing the MSR and then updating guest CPUID to hide the feature
(modifying guest CPUID is allowed anytime before KVM_RUN).  E.g. hacking
KVM's state_test selftest to do

		vcpu_set_msr(vcpu, MSR_AMD64_TSC_RATIO, 0);
		vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_TSCRATEMSR);

after restoring state in a new VM+vCPU yields an endless supply of:

  ------------[ cut here ]------------
  WARNING: CPU: 164 PID: 62565 at arch/x86/kvm/svm/nested.c:699
           nested_vmcb02_prepare_control+0x3d6/0x3f0 [kvm_amd]
  Call Trace:
   <TASK>
   enter_svm_guest_mode+0x114/0x560 [kvm_amd]
   nested_svm_vmrun+0x260/0x330 [kvm_amd]
   vmrun_interception+0x29/0x30 [kvm_amd]
   svm_invoke_exit_handler+0x35/0x100 [kvm_amd]
   svm_handle_exit+0xe7/0x180 [kvm_amd]
   kvm_arch_vcpu_ioctl_run+0x1eab/0x2570 [kvm]
   kvm_vcpu_ioctl+0x4c9/0x5b0 [kvm]
   __se_sys_ioctl+0x7a/0xc0
   __x64_sys_ioctl+0x21/0x30
   do_syscall_64+0x41/0x90
   entry_SYSCALL_64_after_hwframe+0x63/0xcd
  RIP: 0033:0x45ca1b

Note, the nested #VMEXIT path has the same flaw, but needs a different
fix and will be handled separately.

Fixes: 5228eb96a4 ("KVM: x86: nSVM: implement nested TSC scaling")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230729011608.1065019-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 17:16:27 -07:00
Tao Su
99b6685453 KVM: x86: Advertise AMX-COMPLEX CPUID to userspace
Latest Intel platform GraniteRapids-D introduces AMX-COMPLEX, which adds
two instructions to perform matrix multiplication of two tiles containing
complex elements and accumulate the results into a packed single precision
tile.

AMX-COMPLEX is enumerated via CPUID.(EAX=7,ECX=1):EDX[bit 8]

Advertise AMX_COMPLEX if it's supported in hardware.  There are no VMX
controls for the feature, i.e. the instructions can't be interecepted, and
KVM advertises base AMX in CPUID if AMX is supported in hardware, even if
KVM doesn't advertise AMX as being supported in XCR0, e.g. because the
process didn't opt-in to allocating tile data.

Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230802022954.193843-1-tao1.su@linux.intel.com
[sean: tweak last paragraph of changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:40:17 -07:00
Sean Christopherson
a788fbb763 KVM: VMX: Skip VMCLEAR logic during emergency reboots if CR4.VMXE=0
Bail from vmx_emergency_disable() without processing the list of loaded
VMCSes if CR4.VMXE=0, i.e. if the CPU can't be post-VMXON.  It should be
impossible for the list to have entries if VMX is already disabled, and
even if that invariant doesn't hold, VMCLEAR will #UD anyways, i.e.
processing the list is pointless even if it somehow isn't empty.

Assuming no existing KVM bugs, this should be a glorified nop.  The
primary motivation for the change is to avoid having code that looks like
it does VMCLEAR, but then skips VMXON, which is nonsensical.

Suggested-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:15 -07:00
Sean Christopherson
2e6b9bd49b KVM: SVM: Use "standard" stgi() helper when disabling SVM
Now that kvm_rebooting is guaranteed to be true prior to disabling SVM
in an emergency, use the existing stgi() helper instead of open coding
STGI.  In effect, eat faults on STGI if and only if kvm_rebooting==true.

Link: https://lore.kernel.org/r/20230721201859.2307736-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:15 -07:00
Sean Christopherson
6ae44e012f KVM: x86: Force kvm_rebooting=true during emergency reboot/crash
Set kvm_rebooting when virtualization is disabled in an emergency so that
KVM eats faults on virtualization instructions even if kvm_reboot() isn't
reached.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:15 -07:00
Sean Christopherson
76ab816108 x86/virt: KVM: Move "disable SVM" helper into KVM SVM
Move cpu_svm_disable() into KVM proper now that all hardware
virtualization management is routed through KVM.  Remove the now-empty
virtext.h.

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:15 -07:00
Sean Christopherson
f9a8866040 KVM: VMX: Ensure CPU is stable when probing basic VMX support
Disable migration when probing VMX support during module load to ensure
the CPU is stable, mostly to match similar SVM logic, where allowing
migration effective requires deliberately writing buggy code.  As a bonus,
KVM won't report the wrong CPU to userspace if VMX is unsupported, but in
practice that is a very, very minor bonus as the only way that reporting
the wrong CPU would actually matter is if hardware is broken or if the
system is misconfigured, i.e. if KVM gets migrated from a CPU that _does_
support VMX to a CPU that does _not_ support VMX.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:15 -07:00
Sean Christopherson
c4db4f20f3 KVM: SVM: Check that the current CPU supports SVM in kvm_is_svm_supported()
Check "this" CPU instead of the boot CPU when querying SVM support so that
the per-CPU checks done during hardware enabling actually function as
intended, i.e. will detect issues where SVM isn't support on all CPUs.

Disable migration for the use from svm_init() mostly so that the standard
accessors for the per-CPU data can be used without getting yelled at by
CONFIG_DEBUG_PREEMPT=y sanity checks.  Preventing the "disabled by BIOS"
error message from reporting the wrong CPU is largely a bonus, as ensuring
a stable CPU during module load is a non-goal for KVM.

Link: https://lore.kernel.org/all/ZAdxNgv0M6P63odE@google.com
Cc: Kai Huang <kai.huang@intel.com>
Cc: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:15 -07:00
Sean Christopherson
85fd29dd5f x86/virt: KVM: Open code cpu_has_svm() into kvm_is_svm_supported()
Fold the guts of cpu_has_svm() into kvm_is_svm_supported(), its sole
remaining user.

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:15 -07:00
Sean Christopherson
5df8ecfe36 x86/virt: Drop unnecessary check on extended CPUID level in cpu_has_svm()
Drop the explicit check on the extended CPUID level in cpu_has_svm(), the
kernel's cached CPUID info will leave the entire SVM leaf unset if said
leaf is not supported by hardware.  Prior to using cached information,
the check was needed to avoid false positives due to Intel's rather crazy
CPUID behavior of returning the values of the maximum supported leaf if
the specified leaf is unsupported.

Fixes: 682a810887 ("x86/kvm/svm: Simplify cpu_has_svm()")
Link: https://lore.kernel.org/r/20230721201859.2307736-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:15 -07:00
Sean Christopherson
554856b69e KVM: SVM: Make KVM_AMD depend on CPU_SUP_AMD or CPU_SUP_HYGON
Make building KVM SVM support depend on support for AMD or Hygon.  KVM
already effectively restricts SVM support to AMD and Hygon by virtue of
the vendor string checks in cpu_has_svm(), and KVM VMX supports depends
on one of its three known vendors (Intel, Centaur, or Zhaoxin).

Add the CPU_SUP_HYGON clause even though CPU_SUP_HYGON selects CPU_SUP_AMD
to document that KVM SVM support isn't just for AMD CPUs, and to prevent
breakage should Hygon support ever become a standalone thing.

Link: https://lore.kernel.org/r/20230721201859.2307736-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:15 -07:00
Sean Christopherson
22e420e127 x86/virt: KVM: Move VMXOFF helpers into KVM VMX
Now that VMX is disabled in emergencies via the virt callbacks, move the
VMXOFF helpers into KVM, the only remaining user.

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:15 -07:00
Sean Christopherson
b6a6af0d19 x86/virt: KVM: Open code cpu_has_vmx() in KVM VMX
Fold the raw CPUID check for VMX into kvm_is_vmx_supported(), its sole
user.  Keep the check even though KVM also checks X86_FEATURE_VMX, as the
intent is to provide a unique error message if VMX is unsupported by
hardware, whereas X86_FEATURE_VMX may be clear due to firmware and/or
kernel actions.

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:14 -07:00
Sean Christopherson
261cd5ed93 x86/reboot: Expose VMCS crash hooks if and only if KVM_{INTEL,AMD} is enabled
Expose the crash/reboot hooks used by KVM to disable virtualization in
hardware and unblock INIT only if there's a potential in-tree user,
i.e. either KVM_INTEL or KVM_AMD is enabled.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:14 -07:00
Sean Christopherson
59765db5fc x86/reboot: Disable virtualization during reboot iff callback is registered
Attempt to disable virtualization during an emergency reboot if and only
if there is a registered virt callback, i.e. iff a hypervisor (KVM) is
active.  If there's no active hypervisor, then the CPU can't be operating
with VMX or SVM enabled (barring an egregious bug).

Checking for a valid callback instead of simply for SVM or VMX support
can also eliminates spurious NMIs by avoiding the unecessary call to
nmi_shootdown_cpus_on_restart().

Note, IRQs are disabled, which prevents KVM from coming along and
enabling virtualization after the fact.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:14 -07:00
Sean Christopherson
edc8deb087 x86/reboot: Hoist "disable virt" helpers above "emergency reboot" path
Move the various "disable virtualization" helpers above the emergency
reboot path so that emergency_reboot_disable_virtualization() can be
stubbed out in a future patch if neither KVM_INTEL nor KVM_AMD is enabled,
i.e. if there is no in-tree user of CPU virtualization.

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:14 -07:00
Sean Christopherson
ad93c1a7c0 x86/reboot: Assert that IRQs are disabled when turning off virtualization
Assert that IRQs are disabled when turning off virtualization in an
emergency.  KVM enables hardware via on_each_cpu(), i.e. could re-enable
hardware if a pending IPI were delivered after disabling virtualization.

Remove a misleading comment from emergency_reboot_disable_virtualization()
about "just" needing to guarantee the CPU is stable (see above).

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:14 -07:00
Sean Christopherson
baeb4de7ad x86/reboot: KVM: Disable SVM during reboot via virt/KVM reboot callback
Use the virt callback to disable SVM (and set GIF=1) during an emergency
instead of blindly attempting to disable SVM.  Like the VMX case, if a
hypervisor, i.e. KVM, isn't loaded/active, SVM can't be in use.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:14 -07:00
Sean Christopherson
119b5cb4ff x86/reboot: KVM: Handle VMXOFF in KVM's reboot callback
Use KVM VMX's reboot/crash callback to do VMXOFF in an emergency instead
of manually and blindly doing VMXOFF.  There's no need to attempt VMXOFF
if a hypervisor, i.e. KVM, isn't loaded/active, i.e. if the CPU can't
possibly be post-VMXON.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:14 -07:00
Sean Christopherson
5e408396c6 x86/reboot: Harden virtualization hooks for emergency reboot
Provide dedicated helpers to (un)register virt hooks used during an
emergency crash/reboot, and WARN if there is an attempt to overwrite
the registered callback, or an attempt to do an unpaired unregister.

Opportunsitically use rcu_assign_pointer() instead of RCU_INIT_POINTER(),
mainly so that the set/unset paths are more symmetrical, but also because
any performance gains from using RCU_INIT_POINTER() are meaningless for
this code.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:14 -07:00
Sean Christopherson
b23c83ad2c x86/reboot: VMCLEAR active VMCSes before emergency reboot
VMCLEAR active VMCSes before any emergency reboot, not just if the kernel
may kexec into a new kernel after a crash.  Per Intel's SDM, the VMX
architecture doesn't require the CPU to flush the VMCS cache on INIT.  If
an emergency reboot doesn't RESET CPUs, cached VMCSes could theoretically
be kept and only be written back to memory after the new kernel is booted,
i.e. could effectively corrupt memory after reboot.

Opportunistically remove the setting of the global pointer to NULL to make
checkpatch happy.

Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>
Link: https://lore.kernel.org/r/20230721201859.2307736-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-03 15:37:14 -07:00
Sean Christopherson
41e90a69a4 KVM: x86: Retry APIC optimized map recalc if vCPU is added/enabled
Retry the optimized APIC map recalculation if an APIC-enabled vCPU shows
up between allocating the map and filling in the map data.  Conditionally
reschedule before retrying even though the number of vCPUs that can be
created is bounded by KVM.  Retrying a few thousand times isn't so slow
as to be hugely problematic, but it's not blazing fast either.

Reset xapic_id_mistach on each retry as a vCPU could change its xAPIC ID
between loops, but do NOT reset max_id.  The map size also factors in
whether or not a vCPU's local APIC is hardware-enabled, i.e. userspace
and/or the guest can theoretically keep KVM retrying indefinitely.  The
only downside is that KVM will allocate more memory than is strictly
necessary if the vCPU with the highest x2APIC ID disabled its APIC while
the recalculation was in-progress.

Refresh kvm->arch.apic_map_dirty to opportunistically change it from
DIRTY => UPDATE_IN_PROGRESS to avoid an unnecessary recalc from a
different task, i.e. if another task is waiting to attempt an update
(which is likely since a retry happens if and only if an update is
required).

Link: https://lore.kernel.org/r/20230602233250.1014316-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-02 16:48:44 -07:00
Sean Christopherson
550ba57faa KVM: VMX: Drop unnecessary vmx_fb_clear_ctrl_available "cache"
Now that KVM snapshots the host's MSR_IA32_ARCH_CAPABILITIES, drop the
similar snapshot/cache of whether or not KVM is allowed to manipulate
MSR_IA32_MCU_OPT_CTRL.FB_CLEAR_DIS.  The motivation for the cache was
presumably to avoid the RDMSR, e.g. boot_cpu_has_bug() is quite cheap, and
modifying the vCPU's MSR_IA32_ARCH_CAPABILITIES is an infrequent option
and a relatively slow path.

Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230607004311.1420507-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-02 16:37:30 -07:00
Sean Christopherson
a2fd5d02ba KVM: x86: Snapshot host's MSR_IA32_ARCH_CAPABILITIES
Snapshot the host's MSR_IA32_ARCH_CAPABILITIES, if it's supported, instead
of reading the MSR every time KVM wants to query the host state, e.g. when
initializing the default value during vCPU creation.  The paths that query
ARCH_CAPABILITIES aren't particularly performance sensitive, but creating
vCPUs is a frequent enough operation that burning 8 bytes is a good
trade-off.

Alternatively, KVM could add a field in kvm_caps and thus skip the
on-demand calculations entirely, but a pure snapshot isn't possible due to
the way KVM handles the l1tf_vmx_mitigation module param.  And unlike the
other "supported" fields in kvm_caps, KVM doesn't enforce the "supported"
value, i.e. KVM treats ARCH_CAPABILITIES like a CPUID leaf and lets
userspace advertise whatever it wants.  Those problems are solvable, but
it's not clear there is real benefit versus snapshotting the host value,
and grabbing the host value will allow additional cleanup of KVM's
FB_CLEAR_CTRL code.

Link: https://lore.kernel.org/all/20230524061634.54141-2-chao.gao@intel.com
Cc: Chao Gao <chao.gao@intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230607004311.1420507-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-02 16:37:26 -07:00
Takahiro Itazuri
af8e2ccfa6 KVM: x86: Advertise host CPUID 0x80000005 in KVM_GET_SUPPORTED_CPUID
Advertise CPUID 0x80000005 (L1 cache and TLB info) to userspace so that
VMMs that reflect KVM_GET_SUPPORTED_CPUID into KVM_SET_CPUID2 will
enumerate sane cache/TLB information to the guest.

CPUID 0x80000006 (L2 cache and TLB and L3 cache info) has been returned
since commit 43d05de2be ("KVM: pass through CPUID(0x80000006)").
Enumerating both 0x80000005 and 0x80000006 with KVM_GET_SUPPORTED_CPUID
is better than reporting one or the other, and 0x80000005 could be helpful
for VMM to pass it to KVM_SET_CPUID{,2} for the same reason with
0x80000006.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Link: https://lore.kernel.org/all/ZK7NmfKI9xur%2FMop@google.com
Link: https://lore.kernel.org/r/20230712183136.85561-1-itazur@amazon.com
[sean: add link, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-02 15:49:23 -07:00
Michal Luczaj
7f717f5484 KVM: x86: Remove x86_emulate_ops::guest_has_long_mode
Remove x86_emulate_ops::guest_has_long_mode along with its implementation,
emulator_guest_has_long_mode(). It has been unused since commit
1d0da94cda ("KVM: x86: do not go through ctxt->ops when emulating rsm").

No functional change intended.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230718101809.1249769-1-mhal@rbox.co
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-02 15:47:27 -07:00
Like Xu
1d6664fadd KVM: x86: Use sysfs_emit() instead of sprintf()
Use sysfs_emit() instead of the sprintf() for sysfs entries. sysfs_emit()
knows the maximum of the temporary buffer used for outputting sysfs
content and avoids overrunning the buffer length.

Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230625073438.57427-1-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-07-31 13:55:26 -07:00
Linus Torvalds
1667e630c2 - Fix a lockdep warning when the event given is the first one, no event
group exists yet but the code still goes and iterates over event
   siblings
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmS0O2cACgkQEsHwGGHe
 VUrCsRAAq+sdCTlD9FEyhm8LkAYa7A1IhXqo0sO1DrVt/gwqj+I9xtxBRu3tEI3d
 IzwzNQoWoPW59frdGtXi7R9hJUrHKFh+FQ6l/rPwWCwC3CP56SVg0UTLkIPylrVZ
 WpZ5DU5Sc3n8cHusINGgdG51h0/H8aJx3WEFPfND0ydt4gzD14rnq+nQLU8DCfxB
 /1UHZu7wWdNey9cqO/KDgajiCuO26OyGBCO2y5rmL6/UkT7mbO3UR+NusZrFyUCI
 IoUaWPs2NtZmWGxyh3XkkcJLUBWVITYhMZdHGzJqDp7J2A7t213+q1R4X9f+Kiq7
 6nJEAUH0fwodjkJN9GUJGaite+umn7R2W7+OQ3Qigz3hrIMIai9f1wfnnoYo9auH
 vSGvYl3b4v8A+eyZLCQC4qJg5ekfkgxR2LXck6qv9PKtDamjNRMZEUhPFknsvTWg
 Yn29rFq2zZlUCLdTbR+z/dlHEQRxe8FOo5V4+YtWsDMZcYsnvcULb4XQPq6EYHAi
 BDs1iCMWR7uVer8Duq7o/RKbeE3hQwLFfm+SqjYxn6sHH2NcE9OKi+rr6UPkOh27
 gZzBPLlP7SLXTBuqLeSHiczDXochUvFGF7gC+2mZ8/jNP023OMkrHJZyoNyuj8sZ
 qSGk9g3zFCtyQCfsgw01pDuRfSs4Y3MZmzsxI3/mUzbK/KzTXOE=
 =KqCi
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:

 - Fix a lockdep warning when the event given is the first one, no event
   group exists yet but the code still goes and iterates over event
   siblings

* tag 'perf_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix lockdep warning in for_each_sibling_event() on SPR
2023-07-16 13:46:08 -07:00
Linus Torvalds
b6e6cc1f78 Fix kCFI/FineIBT weaknesses
The primary bug Alyssa noticed was that with FineIBT enabled function
 prologues have a spurious ENDBR instruction:
 
   __cfi_foo:
 	endbr64
 	subl	$hash, %r10d
 	jz	1f
 	ud2
 	nop
   1:
   foo:
 	endbr64 <--- *sadface*
 
 This means that any indirect call that fails to target the __cfi symbol
 and instead targets (the regular old) foo+0, will succeed due to that
 second ENDBR.
 
 Fixing this lead to the discovery of a single indirect call that was
 still doing this: ret_from_fork(), since that's an assembly stub the
 compmiler would not generate the proper kCFI indirect call magic and it
 would not get patched.
 
 Brian came up with the most comprehensive fix -- convert the thing to C
 with only a very thin asm wrapper. This ensures the kernel thread
 boostrap is a proper kCFI call.
 
 While discussing all this, Kees noted that kCFI hashes could/should be
 poisoned to seal all functions whose address is never taken, further
 limiting the valid kCFI targets -- much like we already do for IBT.
 
 So what was a 'simple' observation and fix cascaded into a bunch of
 inter-related CFI infrastructure fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEv3OU3/byMaA0LqWJdkfhpEvA5LoFAmSxr64VHHBldGVyekBp
 bmZyYWRlYWQub3JnAAoJEHZH4aRLwOS6L7kQAIjDWbxqVtmiBiz+IBcWcsxt7BXX
 pRBaSe/eBp3KLhqgzYUY0mXIi0ua7y3CBtW4SdQUSPsAKtCgBUuq2JjQWToRghjN
 4ndCky4oxb9z8ADr/R/qfU8ZpSOwoX3kgBHqyjcQ0fQsg/DFKs3sWKqluwT0PtvU
 vLYAw2QKSv56NG/u3CujWPdcIWgzJ+M3214xuqIWCTwEcqdP+xkXmQstkXkyPQ6d
 XE0iG/wo9uiX4icfsRVp8JL0TkzNqGJfgr9Mv1rBKT4wbT64zKI6RyMJVlUS0yrk
 1jeDgNbVfx4ZpvtHmTsQn1jogWI3pqGkqoPwHqJSFg42Eer5OSodH/uVd3HK/0tD
 1nlhCfue6zc4smu480064s3fWAE7kC6ySdmijQXOJo3YWVGdagxVp/CSE4Ek0TFq
 y+CltNEA6bthKImWg8GFWxS8bMnuZv2joJ8yhgfpnG5sppVOYs2HJ3ipIks9sZjO
 o65auDeOkGg1+NhgDx+2uay6/fbxTNjbAyjV4HttkN70SO5kTTT4zWyh2PLwXaTy
 wv0B4i0laxTRU7boIA4nFJAKz5xKfyh9e2idxbmPlrV5FY4mEPA2oLeWsn8cS4VG
 0SWJ30ky7C4r7VWd9DWhGcCRcrlCvCM8LdjwzImZHXRQ2KweEuGMmrXYtHCrTRZn
 IMijS/9q653h9ws7
 =RhPI
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 CFI fixes from Peter Zijlstra:
 "Fix kCFI/FineIBT weaknesses

  The primary bug Alyssa noticed was that with FineIBT enabled function
  prologues have a spurious ENDBR instruction:

    __cfi_foo:
	endbr64
	subl	$hash, %r10d
	jz	1f
	ud2
	nop
    1:
    foo:
	endbr64 <--- *sadface*

  This means that any indirect call that fails to target the __cfi
  symbol and instead targets (the regular old) foo+0, will succeed due
  to that second ENDBR.

  Fixing this led to the discovery of a single indirect call that was
  still doing this: ret_from_fork(). Since that's an assembly stub the
  compiler would not generate the proper kCFI indirect call magic and it
  would not get patched.

  Brian came up with the most comprehensive fix -- convert the thing to
  C with only a very thin asm wrapper. This ensures the kernel thread
  boostrap is a proper kCFI call.

  While discussing all this, Kees noted that kCFI hashes could/should be
  poisoned to seal all functions whose address is never taken, further
  limiting the valid kCFI targets -- much like we already do for IBT.

  So what was a 'simple' observation and fix cascaded into a bunch of
  inter-related CFI infrastructure fixes"

* tag 'x86_urgent_for_6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cfi: Only define poison_cfi() if CONFIG_X86_KERNEL_IBT=y
  x86/fineibt: Poison ENDBR at +0
  x86: Rewrite ret_from_fork() in C
  x86/32: Remove schedule_tail_wrapper()
  x86/cfi: Extend ENDBR sealing to kCFI
  x86/alternative: Rename apply_ibt_endbr()
  x86/cfi: Extend {JMP,CAKK}_NOSPEC comment
2023-07-14 20:19:25 -07:00
Linus Torvalds
ebc27aacee Tracing fixes and clean ups:
- Fix some missing-prototype warnings
 
 - Fix user events struct args (did not include size of struct)
   When creating a user event, the "struct" keyword is to denote
   that the size of the field will be passed in. But the parsing
   failed to handle this case.
 
 - Add selftest to struct sizes for user events
 
 - Fix sample code for direct trampolines.
   The sample code for direct trampolines attached to handle_mm_fault().
   But the prototype changed and the direct trampoline sample code
   was not updated. Direct trampolines needs to have the arguments correct
   otherwise it can fail or crash the system.
 
 - Remove unused ftrace_regs_caller_ret() prototype.
 
 - Quiet false positive of FORTIFY_SOURCE
   Due to backward compatibility, the structure used to save stack traces
   in the kernel had a fixed size of 8. This structure is exported to
   user space via the tracing format file. A change was made to allow
   more than 8 functions to be recorded, and user space now uses the
   size field to know how many functions are actually in the stack.
   But the structure still has size of 8 (even though it points into
   the ring buffer that has the required amount allocated to hold a
   full stack. This was fine until the fortifier noticed that the
   memcpy(&entry->caller, stack, size) was greater than the 8 functions
   and would complain at runtime about it. Hide this by using a pointer
   to the stack location on the ring buffer instead of using the address
   of the entry structure caller field.
 
 - Fix a deadloop in reading trace_pipe that was caused by a mismatch
   between ring_buffer_empty() returning false which then asked to
   read the data, but the read code uses rb_num_of_entries() that
   returned zero, and causing a infinite "retry".
 
 - Fix a warning caused by not using all pages allocated to store
   ftrace functions, where this can happen if the linker inserts a bunch of
   "NULL" entries, causing the accounting of how many pages needed
   to be off.
 
 - Fix histogram synthetic event crashing when the start event is
   removed and the end event is still using a variable from it.
 
 - Fix memory leak in freeing iter->temp in tracing_release_pipe()
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZLBF6hQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qkswAP4mhdoFFfNosM7+Sh/R4t31IxKZApm9
 M2Hf9jgvJ7b65AD/VV1XfO6skw2+5Yn9S4UyNE2MQaYxPwWpONcNFUzZ3Q8=
 =Nb+7
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.5-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix some missing-prototype warnings

 - Fix user events struct args (did not include size of struct)

   When creating a user event, the "struct" keyword is to denote that
   the size of the field will be passed in. But the parsing failed to
   handle this case.

 - Add selftest to struct sizes for user events

 - Fix sample code for direct trampolines.

   The sample code for direct trampolines attached to handle_mm_fault().
   But the prototype changed and the direct trampoline sample code was
   not updated. Direct trampolines needs to have the arguments correct
   otherwise it can fail or crash the system.

 - Remove unused ftrace_regs_caller_ret() prototype.

 - Quiet false positive of FORTIFY_SOURCE

   Due to backward compatibility, the structure used to save stack
   traces in the kernel had a fixed size of 8. This structure is
   exported to user space via the tracing format file. A change was made
   to allow more than 8 functions to be recorded, and user space now
   uses the size field to know how many functions are actually in the
   stack.

   But the structure still has size of 8 (even though it points into the
   ring buffer that has the required amount allocated to hold a full
   stack.

   This was fine until the fortifier noticed that the
   memcpy(&entry->caller, stack, size) was greater than the 8 functions
   and would complain at runtime about it.

   Hide this by using a pointer to the stack location on the ring buffer
   instead of using the address of the entry structure caller field.

 - Fix a deadloop in reading trace_pipe that was caused by a mismatch
   between ring_buffer_empty() returning false which then asked to read
   the data, but the read code uses rb_num_of_entries() that returned
   zero, and causing a infinite "retry".

 - Fix a warning caused by not using all pages allocated to store ftrace
   functions, where this can happen if the linker inserts a bunch of
   "NULL" entries, causing the accounting of how many pages needed to be
   off.

 - Fix histogram synthetic event crashing when the start event is
   removed and the end event is still using a variable from it

 - Fix memory leak in freeing iter->temp in tracing_release_pipe()

* tag 'trace-v6.5-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix memory leak of iter->temp when reading trace_pipe
  tracing/histograms: Add histograms to hist_vars if they have referenced variables
  tracing: Stop FORTIFY_SOURCE complaining about stack trace caller
  ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()
  ring-buffer: Fix deadloop issue on reading trace_pipe
  tracing: arm64: Avoid missing-prototype warnings
  selftests/user_events: Test struct size match cases
  tracing/user_events: Fix struct arg size match check
  x86/ftrace: Remove unsued extern declaration ftrace_regs_caller_ret()
  arm64: ftrace: Add direct call trampoline samples support
  samples: ftrace: Save required argument registers in sample trampolines
2023-07-13 13:44:28 -07:00
Linus Torvalds
1599932894 xen: branch for v6.5-rc2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZK/pZgAKCRCAXGG7T9hj
 vmQlAQD/xi8BUlCe0a7l6kf7+nMkOWmvpVIrmdxrqQ1Wj4c9FAEA0FuI+XXz2sow
 ov+il7z3UnViGsieeSHTW+Gxdn6Blgc=
 =LzAo
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:

 - a cleanup of the Xen related ELF-notes

 - a fix for virtio handling in Xen dom0 when running Xen in a VM

* tag 'for-linus-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/virtio: Fix NULL deref when a bridge of PCI root bus has no parent
  x86/Xen: tidy xen-head.S
2023-07-13 13:39:36 -07:00
Ingo Molnar
535d0ae391 x86/cfi: Only define poison_cfi() if CONFIG_X86_KERNEL_IBT=y
poison_cfi() was introduced in:

  9831c6253a ("x86/cfi: Extend ENDBR sealing to kCFI")

... but it's only ever used under CONFIG_X86_KERNEL_IBT=y,
and if that option is disabled, we get:

  arch/x86/kernel/alternative.c:1243:13: error: ‘poison_cfi’ defined but not used [-Werror=unused-function]

Guard the definition with CONFIG_X86_KERNEL_IBT.

Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-07-11 10:17:55 +02:00
YueHaibing
b599b06544 x86/ftrace: Remove unsued extern declaration ftrace_regs_caller_ret()
This is now unused, so can remove it.

Link: https://lore.kernel.org/linux-trace-kernel/20230623091640.21952-1-yuehaibing@huawei.com

Cc: <mark.rutland@arm.com>
Cc: <tglx@linutronix.de>
Cc: <mingo@redhat.com>
Cc: <bp@alien8.de>
Cc: <dave.hansen@linux.intel.com>
Cc: <x86@kernel.org>
Cc: <hpa@zytor.com>
Cc: <peterz@infradead.org>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-10 21:38:13 -04:00
Peter Zijlstra
04505bbbbb x86/fineibt: Poison ENDBR at +0
Alyssa noticed that when building the kernel with CFI_CLANG+IBT and
booting on IBT enabled hardware to obtain FineIBT, the indirect
functions look like:

  __cfi_foo:
	endbr64
	subl	$hash, %r10d
	jz	1f
	ud2
	nop
  1:
  foo:
	endbr64

This is because the compiler generates code for kCFI+IBT. In that case
the caller does the hash check and will jump to +0, so there must be
an ENDBR there. The compiler doesn't know about FineIBT at all; also
it is possible to actually use kCFI+IBT when booting with 'cfi=kcfi'
on IBT enabled hardware.

Having this second ENDBR however makes it possible to elide the CFI
check. Therefore, we should poison this second ENDBR when switching to
FineIBT mode.

Fixes: 931ab63664 ("x86/ibt: Implement FineIBT")
Reported-by: "Milburn, Alyssa" <alyssa.milburn@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20230615193722.194131053@infradead.org
2023-07-10 09:52:25 +02:00
Brian Gerst
3aec4ecb3d x86: Rewrite ret_from_fork() in C
When kCFI is enabled, special handling is needed for the indirect call
to the kernel thread function.  Rewrite the ret_from_fork() function in
C so that the compiler can properly handle the indirect call.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lkml.kernel.org/r/20230623225529.34590-3-brgerst@gmail.com
2023-07-10 09:52:25 +02:00