Commit Graph

39311 Commits

Author SHA1 Message Date
Masahiro Yamada
6d61b8e66d x86/build: Remove the left-over bzlilo target
Commit

  f279b49f13 ("x86/boot: Modernize genimage script; hdimage+EFI support")

removed the bzlilo target from arch/x86/boot/Makefile.

Remove the left-over from arch/x86/Makefile.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210729140023.442101-1-masahiroy@kernel.org
2021-08-25 11:49:21 +02:00
Jing Yangyang
5b3fd8aa5d x86/kaslr: Have process_mem_region() return a boolean
Fix the following coccicheck warning:

  ./arch/x86/boot/compressed/kaslr.c:671:10-11:WARNING:return of 0/1
  in function 'process_mem_region' with return type bool

Generated by: scripts/coccinelle/misc/boolreturn.cocci

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Jing Yangyang <jing.yangyang@zte.com.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210824070515.61065-1-deng.changcheng@zte.com.cn
2021-08-24 10:54:15 +02:00
Borislav Petkov
3bff147b18 x86/mce: Defer processing of early errors
When a fatal machine check results in a system reset, Linux does not
clear the error(s) from machine check bank(s) - hardware preserves the
machine check banks across a warm reset.

During initialization of the kernel after the reboot, Linux reads, logs,
and clears all machine check banks.

But there is a problem. In:

  5de97c9f6d ("x86/mce: Factor out and deprecate the /dev/mcelog driver")

the call to mce_register_decode_chain() moved later in the boot
sequence. This means that /dev/mcelog doesn't see those early error
logs.

This was partially fixed by:

  cd9c57cad3 ("x86/MCE: Dump MCE to dmesg if no consumers")

which made sure that the logs were not lost completely by printing
to the console. But parsing console logs is error prone. Users of
/dev/mcelog should expect to find any early errors logged to standard
places.

Add a new flag MCP_QUEUE_LOG to machine_check_poll() to be used in early
machine check initialization to indicate that any errors found should
just be queued to genpool. When mcheck_late_init() is called it will
call mce_schedule_work() to actually log and flush any errors queued in
the genpool.

 [ Based on an original patch, commit message by and completely
   productized by Tony Luck. ]

Fixes: 5de97c9f6d ("x86/mce: Factor out and deprecate the /dev/mcelog driver")
Reported-by: Sumanth Kamatala <skamatala@juniper.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210824003129.GA1642753@agluck-desk2.amr.corp.intel.com
2021-08-24 10:40:58 +02:00
Borislav Petkov
03dca99e20 x86/tools/relocs: Mark die() with the printf function attr format
Mark die() as a function which accepts printf-style arguments so that
the compiler can typecheck them against the supplied format string.

Use the C99 inttypes.h format specifiers as relocs.c gets built for both
32- and 64-bit.

Original version of the patch by Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/YNnb6Q4QHtNYC049@zn.tnic
2021-08-23 05:58:02 +02:00
Nick Desaulniers
989ceac799 x86/build: Remove stale cc-option checks
cc-option, __cc-option, cc-option-yn, and cc-disable-warning all invoke
the compiler during build time, and can slow down the build when these
checks become stale for our supported compilers, whose minimally
supported versions increases over time.

See Documentation/process/changes.rst for the current supported minimal
versions (GCC 4.9+, clang 10.0.1+). Compiler version support for these
flags may be verified on godbolt.org.

The following flags are supported by all supported versions of GCC and
Clang. Remove their cc-option, __cc-option, and cc-option-yn tests.

  -Wno-address-of-packed-member
  -mno-avx
  -m32
  -mno-80387
  -march=k8
  -march=nocona
  -march=core2
  -march=atom
  -mtune=generic
  -mfentry

[ mingo: Fixed regression on GCC, via partial revert of the stack-boundary changes.  ]

Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://github.com/ClangBuiltLinux/linux/issues/1436
Link: https://lkml.kernel.org/r/20210812183848.1519994-1-ndesaulniers@google.com
2021-08-22 10:24:27 +02:00
Babu Moger
527f721478 x86/resctrl: Fix a maybe-uninitialized build warning treated as error
The recent commit

  064855a690 ("x86/resctrl: Fix default monitoring groups reporting")

caused a RHEL build failure with an uninitialized variable warning
treated as an error because it removed the default case snippet.

The RHEL Makefile uses '-Werror=maybe-uninitialized' to force possibly
uninitialized variable warnings to be treated as errors. This is also
reported by smatch via the 0day robot.

The error from the RHEL build is:

  arch/x86/kernel/cpu/resctrl/monitor.c: In function ‘__mon_event_count’:
  arch/x86/kernel/cpu/resctrl/monitor.c:261:12: error: ‘m’ may be used
  uninitialized in this function [-Werror=maybe-uninitialized]
    m->chunks += chunks;
              ^~

The upstream Makefile does not build using '-Werror=maybe-uninitialized'.
So, the problem is not seen there. Fix the problem by putting back the
default case snippet.

 [ bp: note that there's nothing wrong with the code and other compilers
   do not trigger this warning - this is being done just so the RHEL compiler
   is happy. ]

Fixes: 064855a690 ("x86/resctrl: Fix default monitoring groups reporting")
Reported-by: Terry Bowman <Terry.Bowman@amd.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/162949631908.23903.17090272726012848523.stgit@bmoger-ubuntu
2021-08-22 09:11:29 +02:00
Joerg Roedel
22aa45cb46 x86/efi: Restore Firmware IDT before calling ExitBootServices()
Commit

  79419e13e8 ("x86/boot/compressed/64: Setup IDT in startup_32 boot path")

introduced an IDT into the 32-bit boot path of the decompressor stub.
But the IDT is set up before ExitBootServices() is called, and some UEFI
firmwares rely on their own IDT.

Save the firmware IDT on boot and restore it before calling into EFI
functions to fix boot failures introduced by above commit.

Fixes: 79419e13e8 ("x86/boot/compressed/64: Setup IDT in startup_32 boot path")
Reported-by: Fabio Aiuto <fabioaiuto83@gmail.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Cc: stable@vger.kernel.org # 5.13+
Link: https://lkml.kernel.org/r/20210820125703.32410-1-joro@8bytes.org
2021-08-21 17:57:04 +02:00
Wei Huang
43e540cc9f KVM: SVM: Add 5-level page table support for SVM
When the 5-level page table is enabled on host OS, the nested page table
for guest VMs must use 5-level as well. Update get_npt_level() function
to reflect this requirement. In the meanwhile, remove the code that
prevents kvm-amd driver from being loaded when 5-level page table is
detected.

Signed-off-by: Wei Huang <wei.huang2@amd.com>
Message-Id: <20210818165549.3771014-4-wei.huang2@amd.com>
[Tweak condition as suggested by Sean. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:07:56 -04:00
Wei Huang
cb0f722aff KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host
When the 5-level page table CPU flag is set in the host, but the guest
has CR4.LA57=0 (including the case of a 32-bit guest), the top level of
the shadow NPT page tables will be fixed, consisting of one pointer to
a lower-level table and 511 non-present entries.  Extend the existing
code that creates the fixed PML4 or PDP table, to provide a fixed PML5
table if needed.

This is not needed on EPT because the number of layers in the tables
is specified in the EPTP instead of depending on the host CR4.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wei Huang <wei.huang2@amd.com>
Message-Id: <20210818165549.3771014-3-wei.huang2@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:07:48 -04:00
Wei Huang
746700d21f KVM: x86: Allow CPU to force vendor-specific TDP level
AMD future CPUs will require a 5-level NPT if host CR4.LA57 is set.
To prevent kvm_mmu_get_tdp_level() from incorrectly changing NPT level
on behalf of CPUs, add a new parameter in kvm_configure_mmu() to force
a fixed TDP level.

Signed-off-by: Wei Huang <wei.huang2@amd.com>
Message-Id: <20210818165549.3771014-2-wei.huang2@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:44 -04:00
Paolo Bonzini
ec607a564f KVM: x86: clamp host mapping level to max_level in kvm_mmu_max_mapping_level
This change started as a way to make kvm_mmu_hugepage_adjust a bit simpler,
but it does fix two bugs as well.

One bug is in zapping collapsible PTEs.  If a large page size is
disallowed but not all of them, kvm_mmu_max_mapping_level will return the
host mapping level and the small PTEs will be zapped up to that level.
However, if e.g. 1GB are prohibited, we can still zap 4KB mapping and
preserve the 2MB ones. This can happen for example when NX huge pages
are in use.

The second would happen when userspace backs guest memory
with a 1gb hugepage but only assign a subset of the page to
the guest.  1gb pages would be disallowed by the memslot, but
not 2mb.  kvm_mmu_max_mapping_level() would fall through to the
host_pfn_mapping_level() logic, see the 1gb hugepage, and map the whole
thing into the guest.

Fixes: 2f57b7051f ("KVM: x86/mmu: Persist gfn_lpage_is_disallowed() to max_level")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:41 -04:00
Maxim Levitsky
61e5f69ef0 KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ
KVM_GUESTDBG_BLOCKIRQ will allow KVM to block all interrupts
while running.

This change is mostly intended for more robust single stepping
of the guest and it has the following benefits when enabled:

* Resuming from a breakpoint is much more reliable.
  When resuming execution from a breakpoint, with interrupts enabled,
  more often than not, KVM would inject an interrupt and make the CPU
  jump immediately to the interrupt handler and eventually return to
  the breakpoint, to trigger it again.

  From the user point of view it looks like the CPU never executed a
  single instruction and in some cases that can even prevent forward
  progress, for example, when the breakpoint is placed by an automated
  script (e.g lx-symbols), which does something in response to the
  breakpoint and then continues the guest automatically.
  If the script execution takes enough time for another interrupt to
  arrive, the guest will be stuck on the same breakpoint RIP forever.

* Normal single stepping is much more predictable, since it won't
  land the debugger into an interrupt handler.

* RFLAGS.TF has less chance to be leaked to the guest:

  We set that flag behind the guest's back to do single stepping
  but if single step lands us into an interrupt/exception handler
  it will be leaked to the guest in the form of being pushed
  to the stack.
  This doesn't completely eliminate this problem as exceptions
  can still happen, but at least this reduces the chances
  of this happening.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210811122927.900604-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:37 -04:00
Maxim Levitsky
7a4bca85b2 KVM: SVM: split svm_handle_invalid_exit
Split the check for having a vmexit handler to svm_check_exit_valid,
and make svm_handle_invalid_exit only handle a vmexit that is
already not valid.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210811122927.900604-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:37 -04:00
Sean Christopherson
9653f2da75 KVM: x86/mmu: Drop 'shared' param from tdp_mmu_link_page()
Drop @shared from tdp_mmu_link_page() and hardcode it to work for
mmu_lock being held for read.  The helper has exactly one caller and
in all likelihood will only ever have exactly one caller.  Even if KVM
adds a path to install translations without an initiating page fault,
odds are very, very good that the path will just be a wrapper to the
"page fault" handler (both SNP and TDX RFCs propose patches to do
exactly that).

No functional change intended.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210810224554.2978735-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:35 -04:00
Mingwei Zhang
71f51d2c32 KVM: x86/mmu: Add detailed page size stats
Existing KVM code tracks the number of large pages regardless of their
sizes. Therefore, when large page of 1GB (or larger) is adopted, the
information becomes less useful because lpages counts a mix of 1G and 2M
pages.

So remove the lpages since it is easy for user space to aggregate the info.
Instead, provide a comprehensive page stats of all sizes from 4K to 512G.

Suggested-by: Ben Gardon <bgardon@google.com>

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Cc: Jing Zhang <jingzhangos@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Message-Id: <20210803044607.599629-4-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:34 -04:00
Sean Christopherson
088acd2352 KVM: x86/mmu: Avoid collision with !PRESENT SPTEs in TDP MMU lpage stats
Factor in whether or not the old/new SPTEs are shadow-present when
adjusting the large page stats in the TDP MMU.  A modified MMIO SPTE can
toggle the page size bit, as bit 7 is used to store the MMIO generation,
i.e. is_large_pte() can get a false positive when called on a MMIO SPTE.
Ditto for nuking SPTEs with REMOVED_SPTE, which sets bit 7 in its magic
value.

Opportunistically move the logic below the check to verify at least one
of the old/new SPTEs is shadow present.

Use is/was_leaf even though is/was_present would suffice.  The code
generation is roughly equivalent since all flags need to be computed
prior to the code in question, and using the *_leaf flags will minimize
the diff in a future enhancement to account all pages, i.e. will change
the check to "is_leaf != was_leaf".

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>

Fixes: 1699f65c8b ("kvm/x86: Fix 'lpages' kvm stat for TDM MMU")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20210803044607.599629-3-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:34 -04:00
Mingwei Zhang
4293ddb788 KVM: x86/mmu: Remove redundant spte present check in mmu_set_spte
Drop an unnecessary is_shadow_present_pte() check when updating the rmaps
after installing a non-MMIO SPTE.  set_spte() is used only to create
shadow-present SPTEs, e.g. MMIO SPTEs are handled early on, mmu_set_spte()
runs with mmu_lock held for write, i.e. the SPTE can't be zapped between
writing the SPTE and updating the rmaps.

Opportunistically combine the "new SPTE" logic for large pages and rmaps.

No functional change intended.

Suggested-by: Ben Gardon <bgardon@google.com>

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20210803044607.599629-2-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:34 -04:00
Jing Zhang
f95937ccf5 KVM: stats: Support linear and logarithmic histogram statistics
Add new types of KVM stats, linear and logarithmic histogram.
Histogram are very useful for observing the value distribution
of time or size related stats.

Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210802165633.1866976-2-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:32 -04:00
Maxim Levitsky
73143035c2 KVM: SVM: AVIC: drop unsupported AVIC base relocation code
APIC base relocation is not supported anyway and won't work
correctly so just drop the code that handles it and keep AVIC
MMIO bar at the default APIC base.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-17-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:31 -04:00
Maxim Levitsky
df7e4827c5 KVM: SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC
Currently it is possible to have the following scenario:

1. AVIC is disabled by svm_refresh_apicv_exec_ctrl
2. svm_vcpu_blocking calls avic_vcpu_put which does nothing
3. svm_vcpu_unblocking enables the AVIC (due to KVM_REQ_APICV_UPDATE)
   and then calls avic_vcpu_load
4. warning is triggered in avic_vcpu_load since
   AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK was never cleared

While it is possible to just remove the warning, it seems to be more robust
to fully disable/enable AVIC in svm_refresh_apicv_exec_ctrl by calling the
avic_vcpu_load/avic_vcpu_put

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-16-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:30 -04:00
Maxim Levitsky
bf5f6b9d7a KVM: SVM: move check for kvm_vcpu_apicv_active outside of avic_vcpu_{put|load}
No functional change intended.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-15-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:29 -04:00
Maxim Levitsky
06ef813466 KVM: SVM: avoid refreshing avic if its state didn't change
Since AVIC can be inhibited and uninhibited rapidly it is possible that
we have nothing to do by the time the svm_refresh_apicv_exec_ctrl
is called.

Detect and avoid this, which will be useful when we will start calling
avic_vcpu_load/avic_vcpu_put when the avic inhibition state changes.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-14-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:27 -04:00
Maxim Levitsky
30eed56a7e KVM: SVM: remove svm_toggle_avic_for_irq_window
Now that kvm_request_apicv_update doesn't need to drop the kvm->srcu lock,
we can call kvm_request_apicv_update directly.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210810205251.424103-13-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:26 -04:00
Vitaly Kuznetsov
0f250a6463 KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use
APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon
SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is,
however, possible to track whether the feature was actually used by the
guest and only inhibit APICv/AVIC when needed.

TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let
Windows know that AutoEOI feature should be avoided. While it's up to
KVM userspace to set the flag, KVM can help a bit by exposing global
APICv/AVIC enablement.

Maxim:
   - always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid,
     since this feature can be used regardless of AVIC

Paolo:
   - use arch.apicv_update_lock to protect the hv->synic_auto_eoi_used
     instead of atomic ops

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-12-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:25 -04:00
Maxim Levitsky
4628efcd4e KVM: SVM: add warning for mistmatch between AVIC vcpu state and AVIC inhibition
It is never a good idea to enter a guest on a vCPU when the
AVIC inhibition state doesn't match the enablement of
the AVIC on the vCPU.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-11-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:24 -04:00
Maxim Levitsky
b0a1637f64 KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM
Currently on SVM, the kvm_request_apicv_update toggles the APICv
memslot without doing any synchronization.

If there is a mismatch between that memslot state and the AVIC state,
on one of the vCPUs, an APIC mmio access can be lost:

For example:

VCPU0: enable the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT
VCPU1: access an APIC mmio register.

Since AVIC is still disabled on VCPU1, the access will not be intercepted
by it, and neither will it cause MMIO fault, but rather it will just be
read/written from/to the dummy page mapped into the
APIC_ACCESS_PAGE_PRIVATE_MEMSLOT.

Fix that by adding a lock guarding the AVIC state changes, and carefully
order the operations of kvm_request_apicv_update to avoid this race:

1. Take the lock
2. Send KVM_REQ_APICV_UPDATE
3. Update the apic inhibit reason
4. Release the lock

This ensures that at (2) all vCPUs are kicked out of the guest mode,
but don't yet see the new avic state.
Then only after (4) all other vCPUs can update their AVIC state and resume.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-10-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:23 -04:00
Maxim Levitsky
36222b117e KVM: x86: don't disable APICv memslot when inhibited
Thanks to the former patches, it is now possible to keep the APICv
memslot always enabled, and it will be invisible to the guest
when it is inhibited

This code is based on a suggestion from Sean Christopherson:
https://lkml.org/lkml/2021/7/19/2970

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-9-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:22 -04:00
Maxim Levitsky
9cc13d60ba KVM: x86/mmu: allow APICv memslot to be enabled but invisible
on AMD, APIC virtualization needs to dynamicaly inhibit the AVIC in a
response to some events, and this is problematic and not efficient to do by
enabling/disabling the memslot that covers APIC's mmio range.

Plus due to SRCU locking, it makes it more complex to
request AVIC inhibition.

Instead, the APIC memslot will be always enabled, but be invisible
to the guest, such as the MMU code will not install a SPTE for it,
when it is inhibited and instead jump straight to emulating the access.

When inhibiting the AVIC, this SPTE will be zapped.

This code is based on a suggestion from Sean Christopherson:
https://lkml.org/lkml/2021/7/19/2970

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-8-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:22 -04:00
Maxim Levitsky
8f32d5e563 KVM: x86/mmu: allow kvm_faultin_pfn to return page fault handling code
This will allow it to return RET_PF_EMULATE for APIC mmio
emulation.

This code is based on a patch from Sean Christopherson:
https://lkml.org/lkml/2021/7/19/2970

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210810205251.424103-7-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:20 -04:00
Maxim Levitsky
33a5c0009d KVM: x86/mmu: rename try_async_pf to kvm_faultin_pfn
try_async_pf is a wrong name for this function, since this code
is used when asynchronous page fault is not enabled as well.

This code is based on a patch from Sean Christopherson:
https://lkml.org/lkml/2021/7/19/2970

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210810205251.424103-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:20 -04:00
Maxim Levitsky
edb298c663 KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range
This together with previous patch, ensures that
kvm_zap_gfn_range doesn't race with page fault
running on another vcpu, and will make this page fault code
retry instead.

This is based on a patch suggested by Sean Christopherson:
https://lkml.org/lkml/2021/7/22/1025

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:19 -04:00
Maxim Levitsky
88f585358b KVM: x86/mmu: add comment explaining arguments to kvm_zap_gfn_range
This comment makes it clear that the range of gfns that this
function receives is non inclusive.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:18 -04:00
Maxim Levitsky
2822da4466 KVM: x86/mmu: fix parameters to kvm_flush_remote_tlbs_with_address
kvm_flush_remote_tlbs_with_address expects (start gfn, number of pages),
and not (start gfn, end gfn)

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:16 -04:00
Sean Christopherson
5a324c24b6 Revert "KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock"
This together with the next patch will fix a future race between
kvm_zap_gfn_range and the page fault handler, which will happen
when AVIC memslot is going to be only partially disabled.

The performance impact is minimal since kvm_zap_gfn_range is only
called by users, update_mtrr() and kvm_post_set_cr0().

Both only use it if the guest has non-coherent DMA, in order to
honor the guest's UC memtype.

MTRR and CD setup only happens at boot, and generally in an area
where the page tables should be small (for CD) or should not
include the affected GFNs at all (for MTRRs).

This is based on a patch suggested by Sean Christopherson:
https://lkml.org/lkml/2021/7/22/1025

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:15 -04:00
Peter Xu
3bcd0662d6 KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file
Use this file to dump rmap statistic information.  The statistic is done by
calculating the rmap count and the result is log-2-based.

An example output of this looks like (idle 6GB guest, right after boot linux):

Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
Level=4K:       3086676 53045   12330   1272    502     121     76      2       0       0       0
Level=2M:       5947    231     0       0       0       0       0       0       0       0       0
Level=1G:       32      0       0       0       0       0       0       0       0       0       0

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220455.26054-5-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:06:11 -04:00
Peter Xu
4139b1972a KVM: X86: Introduce kvm_mmu_slot_lpages() helpers
Introduce kvm_mmu_slot_lpages() to calculcate lpage_info and rmap array size.
The other __kvm_mmu_slot_lpages() can take an extra parameter of npages rather
than fetching from the memslot pointer.  Start to use the latter one in
kvm_alloc_memslot_metadata().

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220455.26054-4-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20 16:04:51 -04:00
Jakub Kicinski
f444fea789 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/ptp/Kconfig:
  55c8fca1da ("ptp_pch: Restore dependency on PCI")
  e5f3155267 ("ethernet: fix PTP_1588_CLOCK dependencies")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-19 18:09:18 -07:00
Alexey Dobriyan
c0891ac15f isystem: ship and use stdarg.h
Ship minimal stdarg.h (1 type, 4 macros) as <linux/stdarg.h>.
stdarg.h is the only userspace header commonly used in the kernel.

GPL 2 version of <stdarg.h> can be extracted from
http://archive.debian.org/debian/pool/main/g/gcc-4.2/gcc-4.2_4.2.4.orig.tar.gz

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2021-08-19 09:02:55 +09:00
Linus Torvalds
02a3715449 Two nested virtualization fixes for AMD processors.
-----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmEabicUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroObjgf7BjKq8zyDbS1NyTW+FCaPQk5x4zy6
 0EI521qJNoWMU8p9O7B4EUFJsLr4Oq7mIanact6hPSmctdWa2CxGi/FRG5QQTIpQ
 8Tb2UyPPYn98OTLnM1SBfhuix4QnnYX73IRkklzCFE3Prg+XEjUoTzORVhhioC+k
 sD0cdYEJczsEQ9Boic+6LKNVGs7WHqsVWSruPoEPevZhvl5+RKrmbkJMxyB6G+xF
 EVLmxfuiU4BurRzACHBEghlAWbaqQUVHampCI0/ppH2Fb5RMviTJK0Xcbj04B7Gx
 NH6l5VHTBW7LXiAxcF+oNWLZlmhmWzsVSmw4P01ZuFXlohW3rtPm5WjUiw==
 =+7g8
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "Two nested virtualization fixes for AMD processors"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: nSVM: always intercept VMLOAD/VMSAVE when nested (CVE-2021-3656)
  KVM: nSVM: avoid picking up unsupported bits from L2 in int_ctl (CVE-2021-3653)
2021-08-16 06:23:26 -10:00
Masahiro Yamada
4aae683f13 tracing: Refactor TRACE_IRQFLAGS_SUPPORT in Kconfig
Make architectures select TRACE_IRQFLAGS_SUPPORT instead of
having many defines.

Link: https://lkml.kernel.org/r/20210731052233.4703-2-masahiroy@kernel.org

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>   #arch/arc
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-16 11:37:21 -04:00
Maxim Levitsky
c7dfa40099 KVM: nSVM: always intercept VMLOAD/VMSAVE when nested (CVE-2021-3656)
If L1 disables VMLOAD/VMSAVE intercepts, and doesn't enable
Virtual VMLOAD/VMSAVE (currently not supported for the nested hypervisor),
then VMLOAD/VMSAVE must operate on the L1 physical memory, which is only
possible by making L0 intercept these instructions.

Failure to do so allowed the nested guest to run VMLOAD/VMSAVE unintercepted,
and thus read/write portions of the host physical memory.

Fixes: 89c8a4984f ("KVM: SVM: Enable Virtual VMLOAD VMSAVE feature")

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-16 09:48:37 -04:00
Maxim Levitsky
0f923e0712 KVM: nSVM: avoid picking up unsupported bits from L2 in int_ctl (CVE-2021-3653)
* Invert the mask of bits that we pick from L2 in
  nested_vmcb02_prepare_control

* Invert and explicitly use VIRQ related bits bitmask in svm_clear_vintr

This fixes a security issue that allowed a malicious L1 to run L2 with
AVIC enabled, which allowed the L2 to exploit the uninitialized and enabled
AVIC to read/write the host physical memory at some offsets.

Fixes: 3d6368ef58 ("KVM: SVM: Add VMRUN handler")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-16 09:48:27 -04:00
Linus Torvalds
c4f14eac22 A set of fixes for PCI/MSI and x86 interrupt startup:
- Mask all MSI-X entries when enabling MSI-X otherwise stale unmasked
    entries stay around e.g. when a crashkernel is booted.
 
  - Enforce masking of a MSI-X table entry when updating it, which mandatory
    according to speification
 
  - Ensure that writes to MSI[-X} tables are flushed.
 
  - Prevent invalid bits being set in the MSI mask register
 
  - Properly serialize modifications to the mask cache and the mask register
    for multi-MSI.
 
  - Cure the violation of the affinity setting rules on X86 during interrupt
    startup which can cause lost and stale interrupts. Move the initial
    affinity setting ahead of actualy enabling the interrupt.
 
  - Ensure that MSI interrupts are completely torn down before freeing them
    in the error handling case.
 
  - Prevent an array out of bounds access in the irq timings code.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEY5bcTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaMvD/9KeK2430f4h/x/fQhHmHHIJOv3kqmB
 gRXX4RV+N/DfU9GSflbzPxY9l2SkydgpQjHeGnqpV7DYRIu84nVYAuWcWtPimHHy
 JxapniLlQv2GS+SIy9f1mmChH6VUPS05brHxKSqAQZvQIoZqza8vF3umZlV7eYF4
 uZFd86TCbDFsBxbsKmyV1FtQLo008EeEp8dtZ/1cZ9Fbp0M/mQkuu7aTNqY0qWwZ
 rAoGyE4PjDR+yf87XjE5z7hMs2vfUjiGXg7Kbp30NPKGcRyasb+SlHVKcvZKJIji
 Y0Bk/SOyqoj1Co3U+cEaWolB1MeGff4nP+Xx8xvyNklKxxs1+92Z7L1RElXIc0cL
 kmUehUSf5JuJ83B6ucAYbmnXKNw1XB00PaMy7iSxsYekTXJx+t0b+Rt6o0R3inWB
 xUWbIVmoL2uF1oOAb6mEc3wDNMBVkY33e9l2jD0PUPxKXZ730MVeojWJ8FGFiPOT
 9+aCRLjZHV5slVQAgLnlpcrseJLuUei6HLVwRXxv19Bz5L+HuAXUxWL9h74SRuE9
 14kH63aXSVDlcYyW7c3t8Lh6QjKAf7AIz0iG+u3n09IWyURd4agHuKOl5itileZB
 BK9NuRrNgmr2nEKG461Suc6GojLBXc1ih3ak+MG+O4iaLxnhapTjW3Weqr+OVXr+
 SrIjoxjpEk2ECA==
 =yf3u
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2021-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Thomas Gleixner:
 "A set of fixes for PCI/MSI and x86 interrupt startup:

   - Mask all MSI-X entries when enabling MSI-X otherwise stale unmasked
     entries stay around e.g. when a crashkernel is booted.

   - Enforce masking of a MSI-X table entry when updating it, which
     mandatory according to speification

   - Ensure that writes to MSI[-X} tables are flushed.

   - Prevent invalid bits being set in the MSI mask register

   - Properly serialize modifications to the mask cache and the mask
     register for multi-MSI.

   - Cure the violation of the affinity setting rules on X86 during
     interrupt startup which can cause lost and stale interrupts. Move
     the initial affinity setting ahead of actualy enabling the
     interrupt.

   - Ensure that MSI interrupts are completely torn down before freeing
     them in the error handling case.

   - Prevent an array out of bounds access in the irq timings code"

* tag 'irq-urgent-2021-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  driver core: Add missing kernel doc for device::msi_lock
  genirq/msi: Ensure deactivation on teardown
  genirq/timings: Prevent potential array overflow in __irq_timings_store()
  x86/msi: Force affinity setup before startup
  x86/ioapic: Force affinity setup before startup
  genirq: Provide IRQCHIP_AFFINITY_PRE_STARTUP
  PCI/MSI: Protect msi_desc::masked for multi-MSI
  PCI/MSI: Use msi_mask_irq() in pci_msi_shutdown()
  PCI/MSI: Correct misleading comments
  PCI/MSI: Do not set invalid bits in MSI mask
  PCI/MSI: Enforce MSI[X] entry updates to be visible
  PCI/MSI: Enforce that MSI-X table entry is masked for update
  PCI/MSI: Mask all unused MSI-X entries
  PCI/MSI: Enable and mask MSI-X early
2021-08-15 06:49:40 -10:00
Linus Torvalds
b045b8cc86 - An objdump checker fix to ignore parenthesized strings in the objdump
version
 
 - Fix resctrl default monitoring groups reporting when new subgroups get
   created
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmEYyDsACgkQEsHwGGHe
 VUqYvw/7BhM3XR0xsjGnKAKTIofWNgpupqd/CIgcXwty9WKsJfLD0CMWnrvJXKi2
 NiyrGiJ3TKjgWajd7LAQzpVdq+YNgG4i5YY6Lvxc2VVgoccKQqpD0JfU9vT8m6cC
 kzSWV+dLs1ydhmgb+bxKqedrautaPjM7RN8/EAnv56mBUxlemD8WSx/rEnP9sgwF
 RE9teVSBuutMQj8lO238SJMN9AIF11Ti1ZIaHmuIKwjFTSLIETthE3o+Dhhq17gY
 vaP1uYFPlyh5tTJA0pa7wijoStPvZmdUzn5n2QQ5CJCkoDNXrmNEu7qS5SERbZBA
 U6jag/SNLwTkN2cA4Mmpb6HsA8r7vOhweovC9GgInnsyFiKAgZ1tUT7LbFQOUrhq
 QWQTrsews0xwhHLrv7r92mZf/W4cLoS0iEN9rinHiatb3Nr0/5ugDSgErw8scqLC
 JqjDCqy6Wm3NuRQhXoZfqid+WE/xN8BTfsbrQ7kuAqOV3NSVmm3K6XTSUTLtJ/C0
 x+Fj+W+4Q8UthQoW5WldsfnGLrKM4UjmXBQbM5o9fWW1L4gYIM6FD6uVqBZk5GAs
 bxuT4f1M/R3/5qdm9L69e4WPduyo53/+bJjmwA9DXLaKXvnFqkZikkV3+S3U+9/j
 pKhg+IfRZ4f1ymjH8sEwEsA037cmP3OzrIwF0vrQDIrWErL5h7Q=
 =Gg+d
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:
 "Two fixes:

   - An objdump checker fix to ignore parenthesized strings in the
     objdump version

   - Fix resctrl default monitoring groups reporting when new subgroups
     get created"

* tag 'x86_urgent_for_v5.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Fix default monitoring groups reporting
  x86/tools: Fix objdump version check again
2021-08-15 06:30:24 -10:00
Linus Torvalds
3e763ec791 ARM:
- Plug race between enabling MTE and creating vcpus
 
 - Fix off-by-one bug when checking whether an address range is RAM
 
 x86:
 
 - Fixes for the new MMU, especially a memory leak on hosts with <39
   physical address bits
 
 - Remove bogus EFER.NX checks on 32-bit non-PAE hosts
 
 - WAITPKG fix
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmEWjBwUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPrMgf9EDBsRvD/Kids0kddaoAgM6qICdsH
 tQX/GdsmecUlU16Bkp21XeZif1ZKcJxCmx/dhYmid3woi9HuX5AreFTlLjlJDRxg
 +lJvboqTV0kk7PjaYkOaqd42RSg/BiSLZ+JVPpbW7CqeIr1lGG4yhIC/Nl7fCCto
 sCaY/NoxtraoG5+WZcRRP7XptQmMRckVZ9bimHHh8dKqMkosGx1hcGfj64aKmx4F
 2EVrrjr+an3mpMnwvUIgNw4xEj/jUCFebvGAROVEsrZzNTZ9UrwgT0HeA92XwQVQ
 93z7nqcBUKHH11rnbOvRESEJD9f6I9vCSaiqRROwmoqLY/Xi7jly7XeDcA==
 =Lj8B
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "ARM:

   - Plug race between enabling MTE and creating vcpus

   - Fix off-by-one bug when checking whether an address range is RAM

  x86:

   - Fixes for the new MMU, especially a memory leak on hosts with <39
     physical address bits

   - Remove bogus EFER.NX checks on 32-bit non-PAE hosts

   - WAITPKG fix"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock
  KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs
  KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs
  KVM: nVMX: Use vmx_need_pf_intercept() when deciding if L0 wants a #PF
  kvm: vmx: Sync all matching EPTPs when injecting nested EPT fault
  KVM: x86: remove dead initialization
  KVM: x86: Allow guest to set EFER.NX=1 on non-PAE 32-bit kernels
  KVM: VMX: Use current VMCS to query WAITPKG support for MSR emulation
  KVM: arm64: Fix race when enabling KVM_ARM_CAP_MTE
  KVM: arm64: Fix off-by-one in range_is_memory
2021-08-15 06:21:30 -10:00
Jakub Kicinski
f4083a752a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.h
  9e26680733 ("bnxt_en: Update firmware call to retrieve TX PTP timestamp")
  9e518f2580 ("bnxt_en: 1PPS functions to configure TSIO pins")
  099fdeda65 ("bnxt_en: Event handler for PPS events")

kernel/bpf/helpers.c
include/linux/bpf-cgroup.h
  a2baf4e8bb ("bpf: Fix potentially incorrect results with bpf_get_local_storage()")
  c7603cfa04 ("bpf: Add ambient BPF runtime context stored in current")

drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
  5957cc557d ("net/mlx5: Set all field of mlx5_irq before inserting it to the xarray")
  2d0b41a376 ("net/mlx5: Refcount mlx5_irq with integer")

MAINTAINERS
  7b637cd52f ("MAINTAINERS: fix Microchip CAN BUS Analyzer Tool entry typo")
  7d901a1e87 ("net: phy: add Maxlinear GPY115/21x/24x driver")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-13 06:41:22 -07:00
Sean Christopherson
f7782bb8d8 KVM: nVMX: Unconditionally clear nested.pi_pending on nested VM-Enter
Clear nested.pi_pending on nested VM-Enter even if L2 will run without
posted interrupts enabled.  If nested.pi_pending is left set from a
previous L2, vmx_complete_nested_posted_interrupt() will pick up the
stale flag and exit to userspace with an "internal emulation error" due
the new L2 not having a valid nested.pi_desc.

Arguably, vmx_complete_nested_posted_interrupt() should first check for
posted interrupts being enabled, but it's also completely reasonable that
KVM wouldn't screw up a fundamental flag.  Not to mention that the mere
existence of nested.pi_pending is a long-standing bug as KVM shouldn't
move the posted interrupt out of the IRR until it's actually processed,
e.g. KVM effectively drops an interrupt when it performs a nested VM-Exit
with a "pending" posted interrupt.  Fixing the mess is a future problem.

Prior to vmx_complete_nested_posted_interrupt() interpreting a null PI
descriptor as an error, this was a benign bug as the null PI descriptor
effectively served as a check on PI not being enabled.  Even then, the
new flow did not become problematic until KVM started checking the result
of kvm_check_nested_events().

Fixes: 705699a139 ("KVM: nVMX: Enable nested posted interrupt processing")
Fixes: 966eefb896 ("KVM: nVMX: Disable vmcs02 posted interrupts if vmcs12 PID isn't mappable")
Fixes: 47d3530f86c0 ("KVM: x86: Exit to userspace when kvm_check_nested_events fails")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210810144526.2662272-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:17 -04:00
Like Xu
c1a527a1de KVM: x86: Clean up redundant ROL16(val, n) macro definition
The ROL16(val, n) macro is repeatedly defined in several vmcs-related
files, and it has never been used outside the KVM context.

Let's move it to vmcs.h without any intended functional changes.

Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20210809093410.59304-4-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:16 -04:00
Uros Bizjak
65297341d8 KVM: x86: Move declaration of kvm_spurious_fault() to x86.h
Move the declaration of kvm_spurious_fault() to KVM's "private" x86.h,
it should never be called by anything other than low level KVM code.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
[sean: rebased to a series without __ex()/__kvm_handle_fault_on_reboot()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210809173955.1710866-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:16 -04:00
Sean Christopherson
ad0577c375 KVM: x86: Kill off __ex() and __kvm_handle_fault_on_reboot()
Remove the __kvm_handle_fault_on_reboot() and __ex() macros now that all
VMX and SVM instructions use asm goto to handle the fault (or in the
case of VMREAD, completely custom logic).  Drop kvm_spurious_fault()'s
asmlinkage annotation as __kvm_handle_fault_on_reboot() was the only
flow that invoked it from assembly code.

Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210809173955.1710866-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:16 -04:00
Sean Christopherson
2fba4fc155 KVM: VMX: Hide VMCS control calculators in vmx.c
Now that nested VMX pulls KVM's desired VMCS controls from vmcs01 instead
of re-calculating on the fly, bury the helpers that do the calcluations
in vmx.c.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210810171952.2758100-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:15 -04:00
Sean Christopherson
b6247686b7 KVM: VMX: Drop caching of KVM's desired sec exec controls for vmcs01
Remove the secondary execution controls cache now that it's effectively
dead code; it is only read immediately after it is written.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210810171952.2758100-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:15 -04:00
Sean Christopherson
389ab25216 KVM: nVMX: Pull KVM L0's desired controls directly from vmcs01
When preparing controls for vmcs02, grab KVM's desired controls from
vmcs01's shadow state instead of recalculating the controls from scratch,
or in the secondary execution controls, instead of using the dedicated
cache.  Calculating secondary exec controls is eye-poppingly expensive
due to the guest CPUID checks, hence the dedicated cache, but the other
calculations aren't exactly free either.

Explicitly clear several bits (x2APIC, DESC exiting, and load EFER on
exit) as appropriate as they may be set in vmcs01, whereas the previous
implementation relied on dynamic bits being cleared in the calculator.

Intentionally propagate VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL from
vmcs01 to vmcs02.  Whether or not PERF_GLOBAL_CTRL is loaded depends on
whether or not perf itself is active, so unless perf stops between the
exit from L1 and entry to L2, vmcs01 will hold the desired value.  This
is purely an optimization as atomic_switch_perf_msrs() will set/clear
the control as needed at VM-Enter, i.e. it avoids two extra VMWRITEs in
the case where perf is active (versus starting with the bits clear in
vmcs02, which was the previous behavior).

Cc: Zeng Guang <guang.zeng@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210810171952.2758100-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:15 -04:00
Paolo Bonzini
1ccb6f983a KVM: VMX: Reset DR6 only when KVM_DEBUGREG_WONT_EXIT
The commit efdab99281 ("KVM: x86: fix escape of guest dr6 to the host")
fixed a bug by resetting DR6 unconditionally when the vcpu being scheduled out.

But writing to debug registers is slow, and it can be visible in perf results
sometimes, even if neither the host nor the guest activate breakpoints.

Since KVM_DEBUGREG_WONT_EXIT on Intel processors is the only case
where DR6 gets the guest value, and it never happens at all on SVM,
the register can be cleared in vmx.c right after reading it.

Reported-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:14 -04:00
Paolo Bonzini
375e28ffc0 KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT
Commit c77fb5fe6f ("KVM: x86: Allow the guest to run with dirty debug
registers") allows the guest accessing to DRs without exiting when
KVM_DEBUGREG_WONT_EXIT and we need to ensure that they are synchronized
on entry to the guest---including DR6 that was not synced before the commit.

But the commit sets the hardware DR6 not only when KVM_DEBUGREG_WONT_EXIT,
but also when KVM_DEBUGREG_BP_ENABLED.  The second case is unnecessary
and just leads to a more case which leaks stale DR6 to the host which has
to be resolved by unconditionally reseting DR6 in kvm_arch_vcpu_put().

Even if KVM_DEBUGREG_WONT_EXIT, however, setting the host DR6 only matters
on VMX because SVM always uses the DR6 value from the VMCB.  So move this
line to vmx.c and make it conditional on KVM_DEBUGREG_WONT_EXIT.

Reported-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:14 -04:00
Lai Jiangshan
34e9f86007 KVM: X86: Remove unneeded KVM_DEBUGREG_RELOAD
Commit ae561edeb4 ("KVM: x86: DR0-DR3 are not clear on reset") added code to
ensure eff_db are updated when they're modified through non-standard paths.

But there is no reason to also update hardware DRs unless hardware breakpoints
are active or DR exiting is disabled, and in those cases updating hardware is
handled by KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_BP_ENABLED.

KVM_DEBUGREG_RELOAD just causes unnecesarry load of hardware DRs and is better
to be removed.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210809174307.145263-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:14 -04:00
Paolo Bonzini
9a63b4517c Merge branch 'kvm-tdpmmu-fixes' into HEAD
Merge topic branch with fixes for 5.14-rc6 and 5.15 merge window.
2021-08-13 03:35:01 -04:00
Paolo Bonzini
6e949ddb0a Merge branch 'kvm-tdpmmu-fixes' into kvm-master
Merge topic branch with fixes for both 5.14-rc6 and 5.15.
2021-08-13 03:33:13 -04:00
Sean Christopherson
ce25681d59 KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync.  When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault.  The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.

Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations.  But underflow is the best case scenario.  The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.

Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok.  Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.

For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write.  If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.

Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault.  I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled.  Holding mmu_lock for read also
prevents the indirect shadow page from being freed.  But as above, keep
it simple and always take the lock.

Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP.  Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).

Alternative #2, the unsync logic could be made thread safe.  In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice.  However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).

Fixes: a2855afc7e ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:32:14 -04:00
Sean Christopherson
0103098fb4 KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs
Set the min_level for the TDP iterator at the root level when zapping all
SPTEs to optimize the iterator's try_step_down().  Zapping a non-leaf
SPTE will recursively zap all its children, thus there is no need for the
iterator to attempt to step down.  This avoids rereading the top-level
SPTEs after they are zapped by causing try_step_down() to short-circuit.

In most cases, optimizing try_step_down() will be in the noise as the cost
of zapping SPTEs completely dominates the overall time.  The optimization
is however helpful if the zap occurs with relatively few SPTEs, e.g. if KVM
is zapping in response to multiple memslot updates when userspace is adding
and removing read-only memslots for option ROMs.  In that case, the task
doing the zapping likely isn't a vCPU thread, but it still holds mmu_lock
for read and thus can be a noisy neighbor of sorts.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181414.3376143-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:31:56 -04:00
Sean Christopherson
524a1e4e38 KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs
Pass "all ones" as the end GFN to signal "zap all" for the TDP MMU and
really zap all SPTEs in this case.  As is, zap_gfn_range() skips non-leaf
SPTEs whose range exceeds the range to be zapped.  If shadow_phys_bits is
not aligned to the range size of top-level SPTEs, e.g. 512gb with 4-level
paging, the "zap all" flows will skip top-level SPTEs whose range extends
beyond shadow_phys_bits and leak their SPs when the VM is destroyed.

Use the current upper bound (based on host.MAXPHYADDR) to detect that the
caller wants to zap all SPTEs, e.g. instead of using the max theoretical
gfn, 1 << (52 - 12).  The more precise upper bound allows the TDP iterator
to terminate its walk earlier when running on hosts with MAXPHYADDR < 52.

Add a WARN on kmv->arch.tdp_mmu_pages when the TDP MMU is destroyed to
help future debuggers should KVM decide to leak SPTEs again.

The bug is most easily reproduced by running (and unloading!) KVM in a
VM whose host.MAXPHYADDR < 39, as the SPTE for gfn=0 will be skipped.

  =============================================================================
  BUG kvm_mmu_page_header (Not tainted): Objects remaining in kvm_mmu_page_header on __kmem_cache_shutdown()
  -----------------------------------------------------------------------------
  Slab 0x000000004d8f7af1 objects=22 used=2 fp=0x00000000624d29ac flags=0x4000000000000200(slab|zone=1)
  CPU: 0 PID: 1582 Comm: rmmod Not tainted 5.14.0-rc2+ #420
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   dump_stack_lvl+0x45/0x59
   slab_err+0x95/0xc9
   __kmem_cache_shutdown.cold+0x3c/0x158
   kmem_cache_destroy+0x3d/0xf0
   kvm_mmu_module_exit+0xa/0x30 [kvm]
   kvm_arch_exit+0x5d/0x90 [kvm]
   kvm_exit+0x78/0x90 [kvm]
   vmx_exit+0x1a/0x50 [kvm_intel]
   __x64_sys_delete_module+0x13f/0x220
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: faaf05b00a ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181414.3376143-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:31:46 -04:00
Sean Christopherson
18712c1370 KVM: nVMX: Use vmx_need_pf_intercept() when deciding if L0 wants a #PF
Use vmx_need_pf_intercept() when determining if L0 wants to handle a #PF
in L2 or if the VM-Exit should be forwarded to L1.  The current logic fails
to account for the case where #PF is intercepted to handle
guest.MAXPHYADDR < host.MAXPHYADDR and ends up reflecting all #PFs into
L1.  At best, L1 will complain and inject the #PF back into L2.  At
worst, L1 will eat the unexpected fault and cause L2 to hang on infinite
page faults.

Note, while the bug was technically introduced by the commit that added
support for the MAXPHYADDR madness, the shame is all on commit
a0c134347b ("KVM: VMX: introduce vmx_need_pf_intercept").

Fixes: 1dbf5d68af ("KVM: VMX: Add guest physical address check in EPT violation and misconfig")
Cc: stable@vger.kernel.org
Cc: Peter Shier <pshier@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812045615.3167686-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:20:58 -04:00
Junaid Shahid
85aa8889b8 kvm: vmx: Sync all matching EPTPs when injecting nested EPT fault
When a nested EPT violation/misconfig is injected into the guest,
the shadow EPT PTEs associated with that address need to be synced.
This is done by kvm_inject_emulated_page_fault() before it calls
nested_ept_inject_page_fault(). However, that will only sync the
shadow EPT PTE associated with the current L1 EPTP. Since the ASID
is based on EP4TA rather than the full EPTP, so syncing the current
EPTP is not enough. The SPTEs associated with any other L1 EPTPs
in the prev_roots cache with the same EP4TA also need to be synced.

Signed-off-by: Junaid Shahid <junaids@google.com>
Message-Id: <20210806222229.1645356-1-junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:20:58 -04:00
Paolo Bonzini
375d1adebc Merge branch 'kvm-vmx-secctl' into kvm-master
Merge common topic branch for 5.14-rc6 and 5.15 merge window.
2021-08-13 03:20:18 -04:00
Paolo Bonzini
ffbe17cada KVM: x86: remove dead initialization
hv_vcpu is initialized again a dozen lines below, and at this
point vcpu->arch.hyperv is not valid.  Remove the initializer.

Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:20:18 -04:00
Sean Christopherson
1383279c64 KVM: x86: Allow guest to set EFER.NX=1 on non-PAE 32-bit kernels
Remove an ancient restriction that disallowed exposing EFER.NX to the
guest if EFER.NX=0 on the host, even if NX is fully supported by the CPU.
The motivation of the check, added by commit 2cc51560ae ("KVM: VMX:
Avoid saving and restoring msr_efer on lightweight vmexit"), was to rule
out the case of host.EFER.NX=0 and guest.EFER.NX=1 so that KVM could run
the guest with the host's EFER.NX and thus avoid context switching EFER
if the only divergence was the NX bit.

Fast forward to today, and KVM has long since stopped running the guest
with the host's EFER.NX.  Not only does KVM context switch EFER if
host.EFER.NX=1 && guest.EFER.NX=0, KVM also forces host.EFER.NX=0 &&
guest.EFER.NX=1 when using shadow paging (to emulate SMEP).  Furthermore,
the entire motivation for the restriction was made obsolete over a decade
ago when Intel added dedicated host and guest EFER fields in the VMCS
(Nehalem timeframe), which reduced the overhead of context switching EFER
from 400+ cycles (2 * WRMSR + 1 * RDMSR) to a mere ~2 cycles.

In practice, the removed restriction only affects non-PAE 32-bit kernels,
as EFER.NX is set during boot if NX is supported and the kernel will use
PAE paging (32-bit or 64-bit), regardless of whether or not the kernel
will actually use NX itself (mark PTEs non-executable).

Alternatively and/or complementarily, startup_32_smp() in head_32.S could
be modified to set EFER.NX=1 regardless of paging mode, thus eliminating
the scenario where NX is supported but not enabled.  However, that runs
the risk of breaking non-KVM non-PAE kernels (though the risk is very,
very low as there are no known EFER.NX errata), and also eliminates an
easy-to-use mechanism for stressing KVM's handling of guest vs. host EFER
across nested virtualization transitions.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210805183804.1221554-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:20:17 -04:00
Krzysztof Wilczyński
e15ac2080e x86/PCI: Add pci_numachip_init() declaration
numachip.c defines pci_numachip_init(), but neglected to include its
declaration, causing the following sparse and compile time warnings:

  arch/x86/pci/numachip.c:108:12: warning: no previous prototype for function 'pci_numachip_init' [-Wmissing-prototypes]
  arch/x86/pci/numachip.c:108:12: warning: symbol 'pci_numachip_init' was not declared. Should it be static?

Include asm/numachip/numachip.h, which includes the missing declaration.

Link: https://lore.kernel.org/r/20210812171717.1471243-1-kw@linux.com
Signed-off-by: Krzysztof Wilczyński <kw@linux.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2021-08-12 14:07:31 -05:00
Babu Moger
064855a690 x86/resctrl: Fix default monitoring groups reporting
Creating a new sub monitoring group in the root /sys/fs/resctrl leads to
getting the "Unavailable" value for mbm_total_bytes and mbm_local_bytes
on the entire filesystem.

Steps to reproduce:

  1. mount -t resctrl resctrl /sys/fs/resctrl/

  2. cd /sys/fs/resctrl/

  3. cat mon_data/mon_L3_00/mbm_total_bytes
     23189832

  4. Create sub monitor group:
  mkdir mon_groups/test1

  5. cat mon_data/mon_L3_00/mbm_total_bytes
     Unavailable

When a new monitoring group is created, a new RMID is assigned to the
new group. But the RMID is not active yet. When the events are read on
the new RMID, it is expected to report the status as "Unavailable".

When the user reads the events on the default monitoring group with
multiple subgroups, the events on all subgroups are consolidated
together. Currently, if any of the RMID reads report as "Unavailable",
then everything will be reported as "Unavailable".

Fix the issue by discarding the "Unavailable" reads and reporting all
the successful RMID reads. This is not a problem on Intel systems as
Intel reports 0 on Inactive RMIDs.

Fixes: d89b737901 ("x86/intel_rdt/cqm: Add mon_data")
Reported-by: Paweł Szulik <pawel.szulik@intel.com>
Signed-off-by: Babu Moger <Babu.Moger@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213311
Link: https://lkml.kernel.org/r/162793309296.9224.15871659871696482080.stgit@bmoger-ubuntu
2021-08-12 20:12:20 +02:00
Randy Dunlap
839ad22f75 x86/tools: Fix objdump version check again
Skip (omit) any version string info that is parenthesized.

Warning: objdump version 15) is older than 2.19
Warning: Skipping posttest.

where 'objdump -v' says:
GNU objdump (GNU Binutils; SUSE Linux Enterprise 15) 2.35.1.20201123-7.18

Fixes: 8bee738bb1 ("x86: Fix objdump version check in chkobjdump.awk for different formats.")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20210731000146.2720-1-rdunlap@infradead.org
2021-08-12 17:17:25 +02:00
Paul Gortmaker
a729691b54 x86/reboot: Limit Dell Optiplex 990 quirk to early BIOS versions
When this platform was relatively new in November 2011, with early BIOS
revisions, a reboot quirk was added in commit 6be30bb7d7 ("x86/reboot:
Blacklist Dell OptiPlex 990 known to require PCI reboot")

However, this quirk (and several others) are open-ended to all BIOS
versions and left no automatic expiry if/when the system BIOS fixed the
issue, meaning that nobody is likely to come along and re-test.

What is really problematic with using PCI reboot as this quirk does, is
that it causes this platform to do a full power down, wait one second,
and then power back on.  This is less than ideal if one is using it for
boot testing and/or bisecting kernels when legacy rotating hard disks
are installed.

It was only by chance that the quirk was noticed in dmesg - and when
disabled it turned out that it wasn't required anymore (BIOS A24), and a
default reboot would work fine without the "harshness" of power cycling the
machine (and disks) down and up like the PCI reboot does.

Doing a bit more research, it seems that the "newest" BIOS for which the
issue was reported[1] was version A06, however Dell[2] seemed to suggest
only up to and including version A05, with the A06 having a large number of
fixes[3] listed.

As is typical with a new platform, the initial BIOS updates come frequently
and then taper off (and in this case, with a revival for CPU CVEs); a
search for O990-A<ver>.exe reveals the following dates:

        A02     16 Mar 2011
        A03     11 May 2011
        A06     14 Sep 2011
        A07     24 Oct 2011
        A10     08 Dec 2011
        A14     06 Sep 2012
        A16     15 Oct 2012
        A18     30 Sep 2013
        A19     23 Sep 2015
        A20     02 Jun 2017
        A23     07 Mar 2018
        A24     21 Aug 2018

While it's overkill to flash and test each of the above, it would seem
likely that the issue was contained within A0x BIOS versions, given the
dates above and the dates of issue reports[4] from distros.  So rather than
just throw out the quirk entirely, limit the scope to just those early BIOS
versions, in case people are still running systems from 2011 with the
original as-shipped early A0x BIOS versions.

[1] https://lore.kernel.org/lkml/1320373471-3942-1-git-send-email-trenn@suse.de/
[2] https://www.dell.com/support/kbdoc/en-ca/000131908/linux-based-operating-systems-stall-upon-reboot-on-optiplex-390-790-990-systems
[3] https://www.dell.com/support/home/en-ca/drivers/driversdetails?driverid=85j10
[4] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/768039

Fixes: 6be30bb7d7 ("x86/reboot: Blacklist Dell OptiPlex 990 known to require PCI reboot")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210530162447.996461-4-paul.gortmaker@windriver.com
2021-08-12 12:06:58 +02:00
Baokun Li
afc880cbb2 x86/power: Fix kernel-doc warnings in cpu.c
Fixes the following kernel-doc warnings:

 arch/x86/power/cpu.c:76: warning: Function parameter or
  member 'ctxt' not described in '__save_processor_state'
 arch/x86/power/cpu.c:192: warning: Function parameter or
  member 'ctxt' not described in '__restore_processor_state'

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210618022421.1595596-1-libaokun1@huawei.com
2021-08-12 10:15:40 +02:00
James Morse
111136e69c x86/resctrl: Make resctrl_arch_get_config() return its value
resctrl_arch_get_config() has no return, but does pass a single value
back via one of its arguments.

Return the value instead.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210811163831.14917-1-james.morse@arm.com
2021-08-11 18:42:53 +02:00
James Morse
5c3b63cdba x86/resctrl: Merge the CDP resources
resctrl uses struct rdt_resource to describe the available hardware
resources. The domains of the CDP aliases share a single ctrl_val[]
array. The only differences between the struct rdt_hw_resource aliases
is the name and conf_type.

The name from struct rdt_hw_resource is visible to user-space. To
support another architecture, as many user-visible details should be
handled in the filesystem parts of the code that is common to all
architectures. The name and conf_type go together.

Remove conf_type and the CDP aliases. When CDP is supported and enabled,
schemata_list_create() can create two schemata using the single
resource, generating the CODE/DATA suffix to the schema name itself.

This allows the alloc_ctrlval_array() and complications around free()ing
the ctrl_val arrays to be removed.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-25-james.morse@arm.com
2021-08-11 18:39:42 +02:00
James Morse
327364d5b6 x86/resctrl: Expand resctrl_arch_update_domains()'s msr_param range
resctrl_arch_update_domains() specifies the one closid that has been
modified and needs copying to the hardware.

resctrl_arch_update_domains() takes a struct rdt_resource and a closid
as arguments, but copies all the staged configurations for that closid
into the ctrl_val[] array.

resctrl_arch_update_domains() is called once per schema, but once the
resources and domains are merged, the second call of a L2CODE/L2DATA
pair will find no staged configurations, as they were previously
applied. The msr_param of the first call only has one index, so would
only have update the hardware for the last staged configuration.

To avoid a second round of IPIs when changing L2CODE and L2DATA in one
go, expand the range of the msr_param if multiple staged configurations
are found.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-24-james.morse@arm.com
2021-08-11 18:37:02 +02:00
James Morse
fbc06c6980 x86/resctrl: Remove rdt_cdp_peer_get()
When CDP is enabled, rdt_cdp_peer_get() finds the alternative
CODE/DATA resource and returns the alternative domain. This is used
to determine if bitmaps overlap when there are aliased entries
in the two struct rdt_hw_resources.

Now that the ctrl_val[] used by the CODE/DATA resources is the same,
the search for an alternate resource/domain is not needed.

Replace rdt_cdp_peer_get() with resctrl_peer_type(), which returns
the alternative type. This can be passed to resctrl_arch_get_config()
with the same resource and domain.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-23-james.morse@arm.com
2021-08-11 18:33:48 +02:00
James Morse
43ac1dbf61 x86/resctrl: Merge the ctrl_val arrays
Each struct rdt_hw_resource has its own ctrl_val[] array. When CDP is
enabled, two resources are in use, each with its own ctrl_val[] array
that holds half of the configuration used by hardware. One uses the odd
slots, the other the even. rdt_cdp_peer_get() is the helper to find the
alternate resource, its domain, and corresponding entry in the other
ctrl_val[] array.

Once the CDP resources are merged there will be one struct
rdt_hw_resource and one ctrl_val[] array for each hardware resource.
This will include changes to rdt_cdp_peer_get(), making it hard to
bisect any issue.

Merge the ctrl_val[] arrays for three CODE/DATA/NONE resources first.
Doing this before merging the resources temporarily complicates
allocating and freeing the ctrl_val arrays. Add a helper to allocate
the ctrl_val array, that returns the value on the L2 or L3 resource if
it already exists. This gets removed once the resources are merged, and
there really is only one ctrl_val[] array.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-22-james.morse@arm.com
2021-08-11 18:31:04 +02:00
James Morse
2b8dd4ab65 x86/resctrl: Calculate the index from the configuration type
resctrl uses cbm_idx() to map a closid to an index in the configuration
array. This is based on a multiplier and offset that are held in the
resource.

To merge the resources, the resctrl arch code needs to calculate the
index from something else, as there will only be one resource.

Decide based on the staged configuration type. This makes the static
mult and offset parameters redundant.

 [ bp: Remove superfluous brackets in get_config_index() ]

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-21-james.morse@arm.com
2021-08-11 18:19:06 +02:00
James Morse
2e7df368fc x86/resctrl: Apply offset correction when config is staged
When resctrl comes to copy the CAT MSR values from the ctrl_val[] array
into hardware, it applies an offset adjustment based on the type of
the resource. CODE and DATA resources have their closid mapped into an
odd/even range. This mapping is based on a property of the resource.

This happens once the new control value has been written to the ctrl_val[]
array. Once the CDP resources are merged, there will only be a single
property that needs to cover both odd/even mappings to the single
ctrl_val[] array. The offset adjustment must be applied before the new
value is written to the array.

Move the logic from cat_wrmsr() to resctrl_arch_update_domains(). The
value provided to apply_config() is now an index in the array, not the
closid. The parameters provided via struct msr_param are now indexes
too. As resctrl's use of closid is a u32, struct msr_param's type is
changed to match.

With this, the CODE and DATA resources only use the odd or even
indexes in the array. This allows the temporary num_closid/2 fixes in
domain_setup_ctrlval() and reset_all_ctrls() to be removed.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-20-james.morse@arm.com
2021-08-11 18:03:28 +02:00
James Morse
141739aa73 x86/resctrl: Make ctrlval arrays the same size
The CODE and DATA resources report a num_closid that is half the actual
size supported by the hardware. This behaviour is visible to user-space
when CDP is enabled.

The CODE and DATA resources have their own ctrlval arrays which are
half the size of the underlying hardware because num_closid was already
adjusted. One holds the odd configurations values, the other even.

Before the CDP resources can be merged, the 'half the closids' behaviour
needs to be implemented by schemata_list_create(), but this causes the
ctrl_val[] array to be full sized.

Remove the logic from the architecture specific rdt_get_cdp_config()
setup, and add it to schemata_list_create(). Functions that walk all the
configurations, such as domain_setup_ctrlval() and reset_all_ctrls(),
take num_closid directly from struct rdt_hw_resource also have
to halve num_closid as only the lower half of each array is in
use. domain_setup_ctrlval() and reset_all_ctrls() both copy struct
rdt_hw_resource's num_closid to a struct msr_param. Correct the value
here.

This is temporary as a subsequent patch will merge all three ctrl_val[]
arrays such that when CDP is in use, the CODA/DATA layout in the array
matches the hardware. reset_all_ctrls()'s loop over the whole of
ctrl_val[] is not touched as this is harmless, and will be required as
it is once the resources are merged.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-19-james.morse@arm.com
2021-08-11 17:58:33 +02:00
James Morse
fa8f711d2f x86/resctrl: Pass configuration type to resctrl_arch_get_config()
The ctrl_val[] array for a struct rdt_hw_resource only holds
configurations of one type. The type is implicit.

Once the CDP resources are merged, the ctrl_val[] array will hold all
the configurations for the hardware resource. When a particular type of
configuration is needed, it must be specified explicitly.

Pass the expected type from the schema into resctrl_arch_get_config().
Nothing uses this yet, but once a single ctrl_val[] array is used for
the three struct rdt_hw_resources that share hardware, the type will be
used to return the correct configuration value from the shared array.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-18-james.morse@arm.com
2021-08-11 17:53:53 +02:00
James Morse
f07e9d0250 x86/resctrl: Add a helper to read a closid's configuration
Functions like show_doms() reach into the architecture's private
structure to retrieve the configuration from the struct rdt_hw_resource.

The hardware configuration may look completely different to the
values resctrl gets from user-space. The staged configuration and
resctrl_arch_update_domains() allow the architecture to convert or
translate these values.

Resctrl shouldn't read or write the ctrl_val[] values directly. Add
a helper to read the current configuration. This will allow another
architecture to scale the bitmaps if necessary, and possibly use
controls that don't take the user-space control format at all.

Of the remaining functions that access ctrl_val[] directly,
apply_config() is part of the architecture-specific code, and is
called via resctrl_arch_update_domains(). reset_all_ctrls() will be an
architecture specific helper.

update_mba_bw() manipulates both ctrl_val[], mbps_val[] and the
hardware. The mbps_val[] that matches the mba_sc state of the resource
is changed, but the other is left unchanged. Abstracting this is the
subject of later patches that affect set_mba_sc() too.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-17-james.morse@arm.com
2021-08-11 17:46:34 +02:00
James Morse
2e6678195d x86/resctrl: Rename update_domains() to resctrl_arch_update_domains()
update_domains() merges the staged configuration changes into the arch
codes configuration array. Rename to make it clear it is part of the
arch code interface to resctrl.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-16-james.morse@arm.com
2021-08-11 16:36:58 +02:00
James Morse
75408e4350 x86/resctrl: Allow different CODE/DATA configurations to be staged
Before the CDP resources can be merged, struct rdt_domain will need an
array of struct resctrl_staged_config, one per type of configuration.

Use the type as an index to the array to ensure that a schema
configuration string can't specify the same domain twice. This will
allow two schemata to apply configuration changes to one resource.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-15-james.morse@arm.com
2021-08-11 16:33:42 +02:00
James Morse
e8f7282552 x86/resctrl: Group staged configuration into a separate struct
When configuration changes are made, the new value is written to struct
rdt_domain's new_ctrl field and the have_new_ctrl flag is set. Later
new_ctrl is copied to hardware by a call to update_domains().

Once the CDP resources are merged, there will be one new_ctrl field in
use by two struct resctrl_schema requiring a per-schema IPI to copy the
value to hardware.

Move new_ctrl and have_new_ctrl into a new struct resctrl_staged_config.
Before the CDP resources can be merged, struct rdt_domain will need an
array of these, one per type of configuration. Using the type as an
index to the array will ensure that a schema configuration string can't
specify the same domain twice.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-14-james.morse@arm.com
2021-08-11 16:32:32 +02:00
James Morse
e198fde3fe x86/resctrl: Move the schemata names into struct resctrl_schema
resctrl 'info' directories and schema parsing use the schema name.
This lives in the struct rdt_resource, and is specified by the
architecture code.

Once the CDP resources are merged, there will only be one resource (and
one name) in use by two schemata. To allow the CDP CODE/DATA property to
be the type of configuration the schema uses, the name should also be
per-schema.

Add a name field to struct resctrl_schema, and use this wherever
the schema name is exposed (or read from) user-space. Calculating
max_name_width for padding the schemata file also moves as this is
visible to user-space. As the names in struct rdt_resource already
include the CDP information, schemata_list_create() copies them.

schemata_list_create() includes the length of the CDP suffix when
calculating max_name_width in preparation for CDP resources being
merged.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-13-james.morse@arm.com
2021-08-11 16:21:35 +02:00
James Morse
c091e90721 x86/resctrl: Add a helper to read/set the CDP configuration
Whether CDP is enabled for a hardware resource like the L3 cache can be
found by inspecting the alloc_enabled flags of the L3CODE/L3DATA struct
rdt_hw_resources, even if they aren't in use.

Once these resources are merged, the flags can't be compared. Whether
CDP is enabled needs tracking explicitly. If another architecture is
emulating CDP the behaviour may not be per-resource. 'cdp_capable' needs
to be visible to resctrl, even if its not in use, as this affects the
padding of the schemata table visible to user-space.

Add cdp_enabled to struct rdt_hw_resource and cdp_capable to struct
rdt_resource. Add resctrl_arch_set_cdp_enabled() to let resctrl enable
or disable CDP on a resource. resctrl_arch_get_cdp_enabled() lets it
read the current state.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-12-james.morse@arm.com
2021-08-11 15:54:26 +02:00
James Morse
32150edd3f x86/resctrl: Swizzle rdt_resource and resctrl_schema in pseudo_lock_region
struct pseudo_lock_region points to the rdt_resource.

Once the resources are merged, this won't be unique. The resource name
is moving into the schema, so that the filesystem portions of resctrl can
generate it.

Swap pseudo_lock_region's rdt_resource pointer for a schema pointer.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-11-james.morse@arm.com
2021-08-11 15:51:45 +02:00
James Morse
1c290682c0 x86/resctrl: Pass the schema to resctrl filesystem functions
Once the CDP resources are merged, there will be two struct
resctrl_schema for one struct rdt_resource. CDP becomes a type of
configuration that belongs to the schema.

Helpers like rdtgroup_cbm_overlaps() need access to the schema to query
the configuration (or configurations) based on schema properties.

Change these functions to take a struct schema instead of the struct
rdt_resource. All the modified functions are part of the filesystem code
that will move to /fs/resctrl once it is possible to support a second
architecture.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-10-james.morse@arm.com
2021-08-11 15:43:54 +02:00
James Morse
eb6f318769 x86/resctrl: Add resctrl_arch_get_num_closid()
To initialise struct resctrl_schema's num_closid, schemata_list_create()
reaches into the architectures private structure to retrieve num_closid
from the struct rdt_hw_resource. The 'half the closids' behaviour should
be part of the filesystem parts of resctrl that are the same on any
architecture. struct resctrl_schema's num_closid should include any
correction for CDP.

Having two properties called num_closid is likely to be confusing when
they have different values.

Add a helper to read the resource's num_closid from the arch code.
This should return the number of closid that the resource supports,
regardless of whether CDP is in use. Once the CDP resources are merged,
schemata_list_create() can apply the correction itself.

Using a type with an obvious size for the arch helper means changing the
type of num_closid to u32, which matches the type already used by struct
rdtgroup.

reset_all_ctrls() does not use resctrl_arch_get_num_closid(), even
though it sets up a structure for modifying the hardware. This function
will be part of the architecture code, the maximum closid should be the
maximum value the hardware has, regardless of the way resctrl is using
it. All the uses of num_closid in core.c are naturally part of the
architecture specific code.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-9-james.morse@arm.com
2021-08-11 15:35:42 +02:00
James Morse
3183e87c1b x86/resctrl: Store the effective num_closid in the schema
Struct resctrl_schema holds properties that vary with the style of
configuration that resctrl applies to a resource. There are already
two values for the hardware's num_closid, depending on whether the
architecture presents the L3 or L3CODE/L3DATA resources.

As the way CDP changes the number of control groups that resctrl can
create is part of the user-space interface, it should be managed by the
filesystem parts of resctrl. This allows the architecture code to only
describe the value the hardware supports.

Add num_closid to resctrl_schema. This is the value seen by the
filesystem, which may be different to the maximum value described by the
arch code when CDP is enabled.

These functions operate on the num_closid value that is exposed to
user-space:

  * rdtgroup_parse_resource()
  * rdtgroup_schemata_show()
  * rdt_num_closids_show()
  * closid_init()

Change them to use the schema value instead. schemata_list_create() sets
this value, and reaches into the architecture-specific structure to get
the value. This will eventually be replaced with a helper.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-8-james.morse@arm.com
2021-08-11 15:24:27 +02:00
James Morse
331ebe4c43 x86/resctrl: Walk the resctrl schema list instead of an arch list
When parsing a schema configuration value from user-space, resctrl walks
the architectures rdt_resources_all[] array to find a matching struct
rdt_resource.

Once the CDP resources are merged there will be one resource in use
by two schemata. Anything walking rdt_resources_all[] on behalf of a
user-space request should walk the list of struct resctrl_schema
instead.

Change the users of for_each_alloc_enabled_rdt_resource() to walk the
schema instead. Schemata were only created for alloc_enabled resources
so these two lists are currently equivalent.

schemata_list_create() and rdt_kill_sb() are ignored. The first
creates the schema list, and will eventually loop over the resource
indexes using an arch helper to retrieve the resource. rdt_kill_sb()
will eventually make use of an arch 'reset everything' helper.

After the filesystem code is moved, rdtgroup_pseudo_locked_in_hierarchy()
remains part of the x86 specific hooks to support pseudo lock. This
code walks each domain, and still does this after the separate resources
are merged.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-7-james.morse@arm.com
2021-08-11 13:20:43 +02:00
James Morse
208ab16847 x86/resctrl: Label the resources with their configuration type
The names of resources are used for the schema name presented to
user-space. The name used is rooted in a structure provided by the
architecture code because the names are different when CDP is enabled.
x86 implements this by swapping between two sets of resource structures
based on their alloc_enabled flag. The type of configuration in-use is
encoded in the name (and cbm_idx_offset).

Once the CDP behaviour is moved into the parts of resctrl that will
move to /fs/, there will be two struct resctrl_schema for one struct
rdt_resource. The schema describes the type of configuration being
applied to the resource. The name of the schema should be generated
by resctrl, base on the type of configuration. To do this struct
resctrl_schema needs to store the type of configuration in use for a
schema.

Create an enum resctrl_conf_type describing the options, and add it to
struct resctrl_schema. The underlying resources are still separate, as
cbm_idx_offset is still in use.

Temporarily label all the entries in rdt_resources_all[] and copy that
value to struct resctrl_schema. Copying the value ensures there is no
mismatch while the filesystem parts of resctrl are modified to use the
schema. Once the resources are merged, the filesystem code can assign
this value based on the schema being created.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-6-james.morse@arm.com
2021-08-11 13:13:18 +02:00
James Morse
f259449230 x86/resctrl: Pass the schema in info dir's private pointer
Many of resctrl's per-schema files return a value from struct
rdt_resource, which they take as their 'priv' pointer.

Moving properties that resctrl exposes to user-space into the core 'fs'
code, (e.g. the name of the schema), means some of the functions that
back the filesystem need the schema struct (to where the properties are
moved), but currently take struct rdt_resource. For example, once the
CDP resources are merged, struct rdt_resource no longer reflects all the
properties of the schema.

For the info dirs that represent a control, the information needed
will be accessed via struct resctrl_schema, as this is how the resource
is being used. For the monitors, its still struct rdt_resource as the
monitors aren't described as schema.

This difference means the type of the private pointers varies between
control and monitor info dirs.

Change the 'priv' pointer to point to struct resctrl_schema for
the per-schema files that represent a control. The type can be
determined from the fflags field. If the flags are RF_MON_INFO, its
a struct rdt_resource. If the flags are RF_CTRL_INFO, its a struct
resctrl_schema. No entry in res_common_files[] has both flags.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-5-james.morse@arm.com
2021-08-11 12:41:19 +02:00
James Morse
cdb9ebc917 x86/resctrl: Add a separate schema list for resctrl
Resctrl exposes schemata to user-space, which allow the control values
to be specified for a group of tasks.

User-visible properties of the interface, (such as the schemata names
and how the values are parsed) are rooted in a struct provided by the
architecture code. (struct rdt_hw_resource). Once a second architecture
uses resctrl, this would allow user-visible properties to diverge
between architectures.

These properties should come from the resctrl code that will be common
to all architectures. Resctrl has no per-schema structure, only struct
rdt_{hw_,}resource. Create a struct resctrl_schema to hold the
rdt_resource. Before a second architecture can be supported, this
structure will also need to hold the schema name visible to user-space
and the type of configuration values for resctrl.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-4-james.morse@arm.com
2021-08-11 12:28:01 +02:00
James Morse
792e0f6f78 x86/resctrl: Split struct rdt_domain
resctrl is the defacto Linux ABI for SoC resource partitioning features.

To support it on another architecture, it needs to be abstracted from
the features provided by Intel RDT and AMD PQoS, and moved to /fs/.
struct rdt_domain contains a mix of architecture private details and
properties of the filesystem interface user-space uses.

Continue by splitting struct rdt_domain, into an architecture private
'hw' struct, which contains the common resctrl structure that would be
used by any architecture. The hardware values in ctrl_val and mbps_val
need to be accessed via helpers to allow another architecture to convert
these into a different format if necessary. After this split, filesystem
code paths touching a 'hw' struct indicates where an abstraction is
needed.

Splitting this structure only moves types around, and should not lead
to any change in behaviour.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-3-james.morse@arm.com
2021-08-11 12:00:43 +02:00
James Morse
63c8b12319 x86/resctrl: Split struct rdt_resource
resctrl is the defacto Linux ABI for SoC resource partitioning features.

To support it on another architecture, it needs to be abstracted from
the features provided by Intel RDT and AMD PQoS, and moved to /fs/.
struct rdt_resource contains a mix of architecture private details
and properties of the filesystem interface user-space uses.

Start by splitting struct rdt_resource, into an architecture private
'hw' struct, which contains the common resctrl structure that would be
used by any architecture. The foreach helpers are most commonly used by
the filesystem code, and should return the common resctrl structure.
for_each_rdt_resource() is changed to walk the common structure in its
parent arch private structure.

Move as much of the structure as possible into the common structure
in the core code's header file. The x86 hardware accessors remain
part of the architecture private code, as do num_closid, mon_scale
and mbm_width.

mon_scale and mbm_width are used to detect overflow of the hardware
counters, and convert them from their native size to bytes. Any
cross-architecture abstraction should be in terms of bytes, making
these properties private.

The hardware's num_closid is kept in the private structure to force the
filesystem code to use a helper to access it. MPAM would return a single
value for the system, regardless of the resource. Using the helper
prevents this field from being confused with the version of num_closid
that is being exposed to user-space (added in a later patch).

After this split, filesystem code touching a 'hw' struct indicates
where an abstraction is needed.

Splitting this structure only moves types around, and should not lead
to any change in behaviour.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-2-james.morse@arm.com
2021-08-11 11:51:34 +02:00
Maciej W. Rozycki
34739a2809 x86: Fix typo s/ECLR/ELCR/ for the PIC register
The proper spelling for the acronym referring to the Edge/Level Control 
Register is ELCR rather than ECLR.  Adjust references accordingly.  No 
functional change.

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2107200251080.9461@angie.orcam.me.uk
2021-08-10 23:31:44 +02:00
Maciej W. Rozycki
d253166168 x86: Avoid magic number with ELCR register accesses
Define PIC_ELCR1 and PIC_ELCR2 macros for accesses to the ELCR registers 
implemented by many chipsets in their embedded 8259A PIC cores, avoiding 
magic numbers that are difficult to handle, and complementing the macros 
we already have for registers originally defined with discrete 8259A PIC 
implementations.  No functional change.

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2107200237300.9461@angie.orcam.me.uk
2021-08-10 23:31:43 +02:00
Maciej W. Rozycki
0e8c6f56fa x86/PCI: Add support for the Intel 82426EX PIRQ router
The Intel 82426EX ISA Bridge (IB), a part of the Intel 82420EX PCIset, 
implements PCI interrupt steering with a PIRQ router in the form of two 
PIRQ Route Control registers, available in the PCI configuration space 
at locations 0x66 and 0x67 for the PIRQ0# and PIRQ1# lines respectively.

The semantics is the same as with the PIIX router, however it is not
clear if BIOSes use register indices or line numbers as the cookie to
identify PCI interrupts in their routing tables and therefore support
either scheme.

The IB is directly attached to the Intel 82425EX PCI System Controller 
(PSC) component of the chipset via a dedicated PSC/IB Link interface 
rather than the host bus or PCI.  Therefore it does not itself appear in 
the PCI configuration space even though it responds to configuration 
cycles addressing registers it implements.  Use 82425EX's identification 
then for determining the presence of the IB.

References:

[1] "82420EX PCIset Data Sheet, 82425EX PCI System Controller (PSC) and 
    82426EX ISA Bridge (IB)", Intel Corporation, Order Number: 
    290488-004, December 1995, Section 3.3.18 "PIRQ1RC/PIRQ0RC--PIRQ 
    Route Control Registers", p. 61

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2107200213490.9461@angie.orcam.me.uk
2021-08-10 23:31:43 +02:00
Maciej W. Rozycki
6b79164f60 x86/PCI: Add support for the Intel 82374EB/82374SB (ESC) PIRQ router
The Intel 82374EB/82374SB EISA System Component (ESC) devices implement 
PCI interrupt steering with a PIRQ router[1] in the form of four PIRQ 
Route Control registers, available in the port I/O space accessible 
indirectly via the index/data register pair at 0x22/0x23, located at 
indices 0x60/0x61/0x62/0x63 for the PIRQ0/1/2/3# lines respectively.  

The semantics is the same as with the PIIX router, however it is not 
clear if BIOSes use register indices or line numbers as the cookie to 
identify PCI interrupts in their routing tables and therefore support 
either scheme.

Accesses to the port I/O space concerned here need to be unlocked by 
writing the value of 0x0f to the ESC ID Register at index 0x02 
beforehand[2].  Do so then and then lock access after use for safety. 

This locking could possibly interfere with accesses to the Intel MP spec 
IMCR register, implemented by the 82374SB variant of the ESC only as the 
PCI/APIC Control Register at index 0x70[3], for which leaving access to 
the configuration space concerned unlocked may have been a requirement 
for the BIOS to remain compliant with the MP spec.  However we only poke 
at the IMCR register if the APIC mode is used, in which case the PIRQ 
router is not, so this arrangement is not going to interfere with IMCR 
access code.

The ESC is implemented as a part of the combined southbridge also made 
of 82375EB/82375SB PCI-EISA Bridge (PCEB) and does itself appear in the 
PCI configuration space.  Use the PCEB's device identification then for
determining the presence of the ESC.

References:

[1] "82374EB/82374SB EISA System Component (ESC)", Intel Corporation, 
    Order Number: 290476-004, March 1996, Section 3.1.12 
    "PIRQ[0:3]#--PIRQ Route Control Registers", pp. 44-45

[2] same, Section 3.1.1 "ESCID--ESC ID Register", p. 36

[3] same, Section 3.1.17 "PAC--PCI/APIC Control Register", p. 47

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2107192023450.9461@angie.orcam.me.uk
2021-08-10 23:31:43 +02:00
Maciej W. Rozycki
1ce849c755 x86/PCI: Add support for the ALi M1487 (IBC) PIRQ router
The ALi M1487 ISA Bus Controller (IBC), a part of the ALi FinALi 486 
chipset, implements PCI interrupt steering with a PIRQ router[1] in the 
form of four 4-bit mappings, spread across two PCI INTx Routing Table 
Mapping Registers, available in the port I/O space accessible indirectly 
via the index/data register pair at 0x22/0x23, located at indices 0x42 
and 0x43 for the INT1/INT2 and INT3/INT4 lines respectively.

Additionally there is a separate PCI INTx Sensitivity Register at index 
0x44 in the same port I/O space, whose bits 3:0 select the trigger mode 
for INT[4:1] lines respectively[2].  Manufacturer's documentation says 
that this register has to be set consistently with the relevant ELCR 
register[3].  Add a router-specific hook then and use it to handle this 
register.

Accesses to the port I/O space concerned here need to be unlocked by 
writing the value of 0xc5 to the Lock Register at index 0x03 
beforehand[4].  Do so then and then lock access after use for safety.

The IBC is implemented as a peer bridge on the host bus rather than a 
southbridge on PCI and therefore it does not itself appear in the PCI 
configuration space.  It is complemented by the M1489 Cache-Memory PCI 
Controller (CMP) host-to-PCI bridge, so use that device's identification 
for determining the presence of the IBC.

References:

[1] "M1489/M1487: 486 PCI Chip Set", Version 1.2, Acer Laboratories 
    Inc., July 1997, Section 4: "Configuration Registers", pp. 76-77

[2] same, p. 77

[3] same, Section 5: "M1489/M1487 Software Programming Guide", pp. 
    99-100

[4] same, Section 4: "Configuration Registers", p. 37

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2107191702020.9461@angie.orcam.me.uk
2021-08-10 23:31:43 +02:00
Maciej W. Rozycki
fb6a0408ea x86: Add support for 0x22/0x23 port I/O configuration space
Define macros and accessors for the configuration space addressed 
indirectly with an index register and a data register at the port I/O 
locations of 0x22 and 0x23 respectively.

This space is defined by the Intel MultiProcessor Specification for the 
IMCR register used to switch between the PIC and the APIC mode[1], by 
Cyrix processors for their configuration[2][3], and also some chipsets.

Given the lack of atomicity with the indirect addressing a spinlock is 
required to protect accesses, although for Cyrix processors it is enough 
if accesses are executed with interrupts locally disabled, because the 
registers are local to the accessing CPU, and IMCR is only ever poked at 
by the BSP and early enough for interrupts not to have been configured 
yet.  Therefore existing code does not have to change or use the new 
spinlock and neither it does.

Put the spinlock in a library file then, so that it does not get pulled 
unnecessarily for configurations that do not refer it.

Convert Cyrix accessors to wrappers so as to retain the brevity and 
clarity of the `getCx86' and `setCx86' calls.

References:

[1] "MultiProcessor Specification", Version 1.4, Intel Corporation, 
    Order Number: 242016-006, May 1997, Section 3.6.2.1 "PIC Mode", pp. 
    3-7, 3-8

[2] "5x86 Microprocessor", Cyrix Corporation, Order Number: 94192-00, 
    July 1995, Section 2.3.2.4 "Configuration Registers", p. 2-23

[3] "6x86 Processor", Cyrix Corporation, Order Number: 94175-01, March 
    1996, Section 2.4.4 "6x86 Configuration Registers", p. 2-23

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2107182353140.9461@angie.orcam.me.uk
2021-08-10 23:31:43 +02:00
Paolo Bonzini
c3e9434c98 Merge branch 'kvm-vmx-secctl' into HEAD
Merge common topic branch for 5.14-rc6 and 5.15 merge window.
2021-08-10 13:45:26 -04:00
Sean Christopherson
7b9cae027b KVM: VMX: Use current VMCS to query WAITPKG support for MSR emulation
Use the secondary_exec_controls_get() accessor in vmx_has_waitpkg() to
effectively get the controls for the current VMCS, as opposed to using
vmx->secondary_exec_controls, which is the cached value of KVM's desired
controls for vmcs01 and truly not reflective of any particular VMCS.

While the waitpkg control is not dynamic, i.e. vmcs01 will always hold
the same waitpkg configuration as vmx->secondary_exec_controls, the same
does not hold true for vmcs02 if the L1 VMM hides the feature from L2.
If L1 hides the feature _and_ does not intercept MSR_IA32_UMWAIT_CONTROL,
L2 could incorrectly read/write L1's virtual MSR instead of taking a #GP.

Fixes: 6e3ba4abce ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210810171952.2758100-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-10 13:32:09 -04:00
Sebastian Andrzej Siewior
8ae9e3f638 x86/mce/inject: Replace deprecated CPU-hotplug functions.
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-10-bigeasy@linutronix.de
2021-08-10 14:46:27 +02:00
Sebastian Andrzej Siewior
2089f34f8c x86/microcode: Replace deprecated CPU-hotplug functions.
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-9-bigeasy@linutronix.de
2021-08-10 14:46:27 +02:00
Sebastian Andrzej Siewior
1a351eefd4 x86/mtrr: Replace deprecated CPU-hotplug functions.
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-8-bigeasy@linutronix.de
2021-08-10 14:46:27 +02:00
Sebastian Andrzej Siewior
77ad320cfb x86/mmiotrace: Replace deprecated CPU-hotplug functions.
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20210803141621.780504-7-bigeasy@linutronix.de
2021-08-10 14:46:27 +02:00
Thomas Gleixner
ff363f480e x86/msi: Force affinity setup before startup
The X86 MSI mechanism cannot handle interrupt affinity changes safely after
startup other than from an interrupt handler, unless interrupt remapping is
enabled. The startup sequence in the generic interrupt code violates that
assumption.

Mark the irq chips with the new IRQCHIP_AFFINITY_PRE_STARTUP flag so that
the default interrupt setting happens before the interrupt is started up
for the first time.

While the interrupt remapping MSI chip does not require this, there is no
point in treating it differently as this might spare an interrupt to a CPU
which is not in the default affinity mask.

For the non-remapping case go to the direct write path when the interrupt
is not yet started similar to the not yet activated case.

Fixes: 1840475676 ("genirq: Expose default irq affinity mask (take 3)")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.886722080@linutronix.de
2021-08-10 10:59:21 +02:00
Thomas Gleixner
0c0e37dc11 x86/ioapic: Force affinity setup before startup
The IO/APIC cannot handle interrupt affinity changes safely after startup
other than from an interrupt handler. The startup sequence in the generic
interrupt code violates that assumption.

Mark the irq chip with the new IRQCHIP_AFFINITY_PRE_STARTUP flag so that
the default interrupt setting happens before the interrupt is started up
for the first time.

Fixes: 1840475676 ("genirq: Expose default irq affinity mask (take 3)")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.832143400@linutronix.de
2021-08-10 10:59:21 +02:00
Masahiro Yamada
d828563955 kbuild: do not require sub-make for separate output tree builds
As explained in commit 3204a7fb98 ("kbuild: prefix $(srctree)/ to some
included Makefiles"), I want to stop using --include-dir some day.

I already fixed up the top Makefile, but some arch Makefiles (mips, um,
x86) still include check-in Makefiles without $(srctree)/.

Fix them up so 'need-sub-make := 1' can go away for this case.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2021-08-10 08:23:39 +09:00
Logan Gunthorpe
183dc86335 x86/amd_gart: don't set failed sg dma_address to DMA_MAPPING_ERROR
Setting the ->dma_address to DMA_MAPPING_ERROR is not part of
the ->map_sg calling convention, so remove it.

Link: https://lore.kernel.org/linux-mips/20210716063241.GC13345@lst.de/
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-09 17:13:06 +02:00
Martin Oliveira
fcacc8a614 x86/amd_gart: return error code from gart_map_sg()
The .map_sg() op now expects an error code instead of zero on failure.

So make __dma_map_cont() return a valid errno (which is then propagated
to gart_map_sg() via dma_map_cont()) and return it in case of failure.

Also, return -EINVAL in case of invalid nents.

Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-09 17:13:06 +02:00
Linus Torvalds
74eedeba45 A set of perf fixes:
- Correct the permission checks for perf event which send SIGTRAP to a
    different process and clean up that code to be more readable.
 
  - Prevent an out of bound MSR access in the x86 perf code which happened
    due to an incomplete limiting to the actually available hardware
    counters.
 
  - Prevent access to the AMD64_EVENTSEL_HOSTONLY bit when running inside a
    guest.
 
  - Handle small core counter re-enabling correctly by issuing an ACK right
    before reenabling it to prevent a stale PEBS record being kept around.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEPv6UTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYob8hD/wMmRLAoc/uvJIIICJ+IQVnnU8WToIS
 Qy1dAPpQMz6pQpRQor1AGpcP89IMnLVhZn84lsd+kw0/Lv630JbWsXvQ8jB2GPHn
 17XewPp4l4PDUgKaGEKIjPSjsmnZmzOLTYIy5gWOfA/h5EG/1D+ozvcRGDMaXWUw
 +65Pinaf2QKfjYZV11SVJMLF5zLYUxMc6vRag00WrcPxd+JO4eVeV36g0LTmhABW
 fOSDcBOSVrT2w9MYDpNmPvMh3dN2vlfhrEk10NBKslx8uk4t8sV/Jbs+48WhydKa
 zmdqthtjIekRUSxhiHJve70D9ngveCBSKQDp0Us2BWWxdnM0+HV6ozjuxO0julCH
 5tW4413fz2AoZJhWkTn3PE4nPG3apRCnL2B+jTFHHqCjKSkkrNDRJDOEUwasXjV5
 jn25DLhOq5ltkMrLFDTV/h2RZqU0fAMV2iwNSkjD3lVLgKt6B3/uSnvE9SXmaJjs
 njk/1LzeWwY+sk7YYXouPQ2STEDCKvOJGYZSS5pFA03mVaQgfuJxpyHKH+7nj9tV
 k0FLDLMmSucYIWBq0iapa8cR69e0ZIE48hSNR3AOIIOVh3LusmA4HkogOAQG7kdZ
 P2nKQUdN+SR8rL9KQRauP63J508fg0kkXNgSAm1lFWBDnFKt6shkkHGcL+5PzxJW
 1Bjx2wc52Ww84A==
 =hhv+
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Thomas Gleixner:
 "A set of perf fixes:

   - Correct the permission checks for perf event which send SIGTRAP to
     a different process and clean up that code to be more readable.

   - Prevent an out of bound MSR access in the x86 perf code which
     happened due to an incomplete limiting to the actually available
     hardware counters.

   - Prevent access to the AMD64_EVENTSEL_HOSTONLY bit when running
     inside a guest.

   - Handle small core counter re-enabling correctly by issuing an ACK
     right before reenabling it to prevent a stale PEBS record being
     kept around"

* tag 'perf-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Apply mid ACK for small core
  perf/x86/amd: Don't touch the AMD64_EVENTSEL_HOSTONLY bit inside the guest
  perf/x86: Fix out of bound MSR access
  perf: Refactor permissions check into perf_check_permission()
  perf: Fix required permissions if sigtrap is requested
2021-08-08 11:46:13 -07:00
Linus Torvalds
4972bb90c3 Kbuild fixes for v5.14 (2nd)
- Correct the Extended Regular Expressions in tools
 
  - Adjust scripts/checkversion.pl for the current Kbuild
 
  - Unset sub_make_done for 'make install' to make DKMS working again
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmEOl2QVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsG8ikP/1LjJPbB3PjFtVU24TD78z4ztfHq
 xCtwOWzcP5FMXcb5sQWGc0UjwlJq3+meIm8rcRqJfKSUSRxrtUyKm9llwK0sFezF
 GfC84AGKNwJwCAAoxZ7bpqmlQw7HnIGsrk9mzkw/NWa19nUMm3D4Oaek3KMdumdV
 BYPmm4AzTuyXah4a1ZZxmR/47WRty37jIBELAkpQyqhgFrxz420weewEMUiL53cv
 ipaXSluD1v+9ezJi5VBtsedC5TTUYPsqPfmwGaI6QNX4rT/kNNxj1e478JnAtkPM
 CKAbR/0pswWOvlYiDpdVTVmmyigYznCRsuwOwBp0DVYLlshVnCItKiv1rrhUHpED
 1m5jExu5NFoFNhHYTPoOxAj34AlQ7PQAN+M14cklvv1DNrtunR5QbVPEj7PMEd0W
 O5orQag4OqQQ2hqz/q57+FhX3QjijcGHwutjfxb/wfY84+q+e4QjaVITzpD92Yvq
 6j/FhDBE9Z0ZaznF1zgxghM72995n8HW6ZkCGDg6etfUi2aSeeNxFle1OYAtRtmp
 ZCefhAnPsUVjTvwzOZ/43ukUjW20o4uR/I/25MFVdQbFGDnCYpbC/RJDyJK2VxqY
 yznpY6sI9LbpbzxwXUzB5DyAosaExOi1iUJ0NK2YZ47lp7RVAxpvWBEr1VT7m37W
 WF7TETMdF5IV9lFn
 =M+2q
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v5.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Correct the Extended Regular Expressions in tools

 - Adjust scripts/checkversion.pl for the current Kbuild

 - Unset sub_make_done for 'make install' to make DKMS work again

* tag 'kbuild-fixes-v5.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: cancel sub_make_done for the install target to fix DKMS
  scripts: checkversion: modernize linux/version.h search strings
  mips: Fix non-POSIX regexp
  x86/tools/relocs: Fix non-POSIX regexp
2021-08-07 10:03:02 -07:00
Kan Liang
acade63799 perf/x86/intel: Apply mid ACK for small core
A warning as below may be occasionally triggered in an ADL machine when
these conditions occur:

 - Two perf record commands run one by one. Both record a PEBS event.
 - Both runs on small cores.
 - They have different adaptive PEBS configuration (PEBS_DATA_CFG).

  [ ] WARNING: CPU: 4 PID: 9874 at arch/x86/events/intel/ds.c:1743 setup_pebs_adaptive_sample_data+0x55e/0x5b0
  [ ] RIP: 0010:setup_pebs_adaptive_sample_data+0x55e/0x5b0
  [ ] Call Trace:
  [ ]  <NMI>
  [ ]  intel_pmu_drain_pebs_icl+0x48b/0x810
  [ ]  perf_event_nmi_handler+0x41/0x80
  [ ]  </NMI>
  [ ]  __perf_event_task_sched_in+0x2c2/0x3a0

Different from the big core, the small core requires the ACK right
before re-enabling counters in the NMI handler, otherwise a stale PEBS
record may be dumped into the later NMI handler, which trigger the
warning.

Add a new mid_ack flag to track the case. Add all PMI handler bits in
the struct x86_hybrid_pmu to track the bits for different types of
PMUs.  Apply mid ACK for the small cores on an Alder Lake machine.

The existing hybrid() macro has a compile error when taking address of
a bit-field variable. Add a new macro hybrid_bit() to get the
bit-field value of a given PMU.

Fixes: f83d2f91d2 ("perf/x86/intel: Add Alder Lake Hybrid support")
Reported-by: Ammy Yi <ammy.yi@intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Tested-by: Ammy Yi <ammy.yi@intel.com>
Link: https://lkml.kernel.org/r/1627997128-57891-1-git-send-email-kan.liang@linux.intel.com
2021-08-06 14:25:15 +02:00
David Matlack
93e083d4f4 KVM: x86/mmu: Rename __gfn_to_rmap to gfn_to_rmap
gfn_to_rmap was removed in the previous patch so there is no need to
retain the double underscore on __gfn_to_rmap.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210804222844.1419481-7-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-06 07:52:58 -04:00
David Matlack
601f8af01e KVM: x86/mmu: Leverage vcpu->last_used_slot for rmap_add and rmap_recycle
rmap_add() and rmap_recycle() both run in the context of the vCPU and
thus we can use kvm_vcpu_gfn_to_memslot() to look up the memslot. This
enables rmap_add() and rmap_recycle() to take advantage of
vcpu->last_used_slot and avoid expensive memslot searching.

This change improves the performance of "Populate memory time" in
dirty_log_perf_test with tdp_mmu=N. In addition to improving the
performance, "Populate memory time" no longer scales with the number
of memslots in the VM.

Command                         | Before           | After
------------------------------- | ---------------- | -------------
./dirty_log_perf_test -v64 -x1  | 15.18001570s     | 14.99469366s
./dirty_log_perf_test -v64 -x64 | 18.71336392s     | 14.98675076s

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210804222844.1419481-6-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-06 07:52:29 -04:00
David Matlack
081de470f1 KVM: x86/mmu: Leverage vcpu->last_used_slot in tdp_mmu_map_handle_target_level
The existing TDP MMU methods to handle dirty logging are vcpu-agnostic
since they can be driven by MMU notifiers and other non-vcpu-specific
events in addition to page faults. However this means that the TDP MMU
is not benefiting from the new vcpu->last_used_slot. Fix that by
introducing a tdp_mmu_map_set_spte_atomic() which is only called during
a TDP page fault and has access to the kvm_vcpu for fast slot lookups.

This improves "Populate memory time" in dirty_log_perf_test by 5%:

Command                         | Before           | After
------------------------------- | ---------------- | -------------
./dirty_log_perf_test -v64 -x64 | 5.472321072s     | 5.169832886s

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210804222844.1419481-5-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-06 07:52:29 -04:00
Jakub Kicinski
0ca8d3ca45 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Build failure in drivers/net/wwan/mhi_wwan_mbim.c:
add missing parameter (0, assuming we don't want buffer pre-alloc).

Conflict in drivers/net/dsa/sja1105/sja1105_main.c between:
  589918df93 ("net: dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too")
  0fac6aa098 ("net: dsa: sja1105: delete the best_effort_vlan_filtering mode")

Follow the instructions from the commit message of the former commit
- removed the if conditions. When looking at commit 589918df93 ("net:
dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too")
note that the mask_iotag fields get removed by the following patch.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-05 15:08:47 -07:00
Linus Torvalds
97fcc07be8 Mostly bugfixes; plus, support for XMM arguments to Hyper-V hypercalls
now obeys KVM_CAP_HYPERV_ENFORCE_CPUID.  Both the XMM arguments feature
 and KVM_CAP_HYPERV_ENFORCE_CPUID are new in 5.14, and each did not know
 of the other.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmELlMoUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOIigf+O1JWYyfVR16qpVnkI9voa0j7aPKd
 0v3Qd0ybEbBqOl2QjYFe+aJa42d8bhiMGRG7UNNxYPmW6MukxM3rWZ0wd8IvlktW
 KMR/+2jPw8v8B+M44ty9y30cdW65EGxY8PXPrzSXSF9wT5v0fV8s14Yzxkc1edKM
 X3OVJk1AhlRaGN9PyGl3MJxdvQG7Ci9Ep500i5vhAsdb2Azu9TzfQRVUUJ8c2BWj
 PEI31Z+E0//G0oU/MiHZ5bsYCboiciiHHJxpdFpxVYN7rQ4sOxJswEsO+PpD00K3
 HsZXuTXCRQyZfjbI5QlwNWpU6mZrAb3T8GVNBOP+0g3+fDNyrZzJBiCzbQ==
 =bI5h
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "Mostly bugfixes; plus, support for XMM arguments to Hyper-V hypercalls
  now obeys KVM_CAP_HYPERV_ENFORCE_CPUID.

  Both the XMM arguments feature and KVM_CAP_HYPERV_ENFORCE_CPUID are
  new in 5.14, and each did not know of the other"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86/mmu: Fix per-cpu counter corruption on 32-bit builds
  KVM: selftests: fix hyperv_clock test
  KVM: SVM: improve the code readability for ASID management
  KVM: SVM: Fix off-by-one indexing when nullifying last used SEV VMCB
  KVM: Do not leak memory for duplicate debugfs directories
  KVM: selftests: Test access to XMM fast hypercalls
  KVM: x86: hyper-v: Check if guest is allowed to use XMM registers for hypercall input
  KVM: x86: Introduce trace_kvm_hv_hypercall_done()
  KVM: x86: hyper-v: Check access to hypercall before reading XMM registers
  KVM: x86: accept userspace interrupt only if no event is injected
2021-08-05 11:23:09 -07:00
H. Nikolaus Schaller
fa953adfad x86/tools/relocs: Fix non-POSIX regexp
Trying to run a cross-compiled x86 relocs tool on a BSD based
HOSTCC leads to errors like

  VOFFSET arch/x86/boot/compressed/../voffset.h - due to: vmlinux
  CC      arch/x86/boot/compressed/misc.o - due to: arch/x86/boot/compressed/../voffset.h
  OBJCOPY arch/x86/boot/compressed/vmlinux.bin - due to: vmlinux
  RELOCS  arch/x86/boot/compressed/vmlinux.relocs - due to: vmlinux
empty (sub)expressionarch/x86/boot/compressed/Makefile:118: recipe for target 'arch/x86/boot/compressed/vmlinux.relocs' failed
make[3]: *** [arch/x86/boot/compressed/vmlinux.relocs] Error 1

It turns out that relocs.c uses patterns like

	"something(|_end)"

This is not valid syntax or gives undefined results according
to POSIX 9.5.3 ERE Grammar

	https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html

It seems to be silently accepted by the Linux regexp() implementation
while a BSD host complains.

Such patterns can be replaced by a transformation like

	"(|p1|p2)" -> "(p1|p2)?"

Fixes: fd95281530 ("x86-32, relocs: Whitelist more symbols for ld bug workaround")
Signed-off-by: H. Nikolaus Schaller <hns@goldelico.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2021-08-05 20:51:11 +09:00
Sean Christopherson
d5aaad6f83 KVM: x86/mmu: Fix per-cpu counter corruption on 32-bit builds
Take a signed 'long' instead of an 'unsigned long' for the number of
pages to add/subtract to the total number of pages used by the MMU.  This
fixes a zero-extension bug on 32-bit kernels that effectively corrupts
the per-cpu counter used by the shrinker.

Per-cpu counters take a signed 64-bit value on both 32-bit and 64-bit
kernels, whereas kvm_mod_used_mmu_pages() takes an unsigned long and thus
an unsigned 32-bit value on 32-bit kernels.  As a result, the value used
to adjust the per-cpu counter is zero-extended (unsigned -> signed), not
sign-extended (signed -> signed), and so KVM's intended -1 gets morphed to
4294967295 and effectively corrupts the counter.

This was found by a staggering amount of sheer dumb luck when running
kvm-unit-tests on a 32-bit KVM build.  The shrinker just happened to kick
in while running tests and do_shrink_slab() logged an error about trying
to free a negative number of objects.  The truly lucky part is that the
kernel just happened to be a slightly stale build, as the shrinker no
longer yells about negative objects as of commit 18bb473e50 ("mm:
vmscan: shrink deferred objects proportional to priority").

 vmscan: shrink_slab: mmu_shrink_scan+0x0/0x210 [kvm] negative objects to delete nr=-858993460

Fixes: bc8a3d8925 ("kvm: mmu: Fix overflow on kvm mmu page limit calculation")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210804214609.1096003-1-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-05 03:33:56 -04:00
Paolo Bonzini
319afe6856 KVM: xen: do not use struct gfn_to_hva_cache
gfn_to_hva_cache is not thread-safe, so it is usually used only within
a vCPU (whose code is protected by vcpu->mutex).  The Xen interface
implementation has such a cache in kvm->arch, but it is not really
used except to store the location of the shared info page.  Replace
shinfo_set and shinfo_cache with just the value that is passed via
KVM_XEN_ATTR_TYPE_SHARED_INFO; the only complication is that the
initialization value is not zero anymore and therefore kvm_xen_init_vm
needs to be introduced.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-05 03:31:40 -04:00
Mingwei Zhang
bb2baeb214 KVM: SVM: improve the code readability for ASID management
KVM SEV code uses bitmaps to manage ASID states. ASID 0 was always skipped
because it is never used by VM. Thus, in existing code, ASID value and its
bitmap postion always has an 'offset-by-1' relationship.

Both SEV and SEV-ES shares the ASID space, thus KVM uses a dynamic range
[min_asid, max_asid] to handle SEV and SEV-ES ASIDs separately.

Existing code mixes the usage of ASID value and its bitmap position by
using the same variable called 'min_asid'.

Fix the min_asid usage: ensure that its usage is consistent with its name;
allocate extra size for ASID 0 to ensure that each ASID has the same value
with its bitmap position. Add comments on ASID bitmap allocation to clarify
the size change.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Marc Orr <marcorr@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Alper Gun <alpergun@google.com>
Cc: Dionna Glaze <dionnaglaze@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vipin Sharma <vipinsh@google.com>
Cc: Peter Gonda <pgonda@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Message-Id: <20210802180903.159381-1-mizhang@google.com>
[Fix up sev_asid_free to also index by ASID, as suggested by Sean
 Christopherson, and use nr_asids in sev_cpu_init. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-04 09:43:03 -04:00
Like Xu
df51fe7ea1 perf/x86/amd: Don't touch the AMD64_EVENTSEL_HOSTONLY bit inside the guest
If we use "perf record" in an AMD Milan guest, dmesg reports a #GP
warning from an unchecked MSR access error on MSR_F15H_PERF_CTLx:

  [] unchecked MSR access error: WRMSR to 0xc0010200 (tried to write 0x0000020000110076) at rIP: 0xffffffff8106ddb4 (native_write_msr+0x4/0x20)
  [] Call Trace:
  []  amd_pmu_disable_event+0x22/0x90
  []  x86_pmu_stop+0x4c/0xa0
  []  x86_pmu_del+0x3a/0x140

The AMD64_EVENTSEL_HOSTONLY bit is defined and used on the host,
while the guest perf driver should avoid such use.

Fixes: 1018faa6cf ("perf/x86/kvm: Fix Host-Only/Guest-Only counting with SVM disabled")
Signed-off-by: Like Xu <likexu@tencent.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Tested-by: Liam Merwick <liam.merwick@oracle.com>
Link: https://lkml.kernel.org/r/20210802070850.35295-1-likexu@tencent.com
2021-08-04 15:16:34 +02:00
Peter Zijlstra
f4b4b45652 perf/x86: Fix out of bound MSR access
On Wed, Jul 28, 2021 at 12:49:43PM -0400, Vince Weaver wrote:
> [32694.087403] unchecked MSR access error: WRMSR to 0x318 (tried to write 0x0000000000000000) at rIP: 0xffffffff8106f854 (native_write_msr+0x4/0x20)
> [32694.101374] Call Trace:
> [32694.103974]  perf_clear_dirty_counters+0x86/0x100

The problem being that it doesn't filter out all fake counters, in
specific the above (erroneously) tries to use FIXED_BTS. Limit the
fixed counters indexes to the hardware supplied number.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Like Xu <likexu@tencent.com>
Link: https://lkml.kernel.org/r/YQJxka3dxgdIdebG@hirez.programming.kicks-ass.net
2021-08-04 15:16:33 +02:00
Praveen Kumar
e5d9b714fe x86/hyperv: fix root partition faults when writing to VP assist page MSR
For root partition the VP assist pages are pre-determined by the
hypervisor. The root kernel is not allowed to change them to
different locations. And thus, we are getting below stack as in
current implementation root is trying to perform write to specific
MSR.

    [ 2.778197] unchecked MSR access error: WRMSR to 0x40000073 (tried to write 0x0000000145ac5001) at rIP: 0xffffffff810c1084 (native_write_msr+0x4/0x30)
    [ 2.784867] Call Trace:
    [ 2.791507] hv_cpu_init+0xf1/0x1c0
    [ 2.798144] ? hyperv_report_panic+0xd0/0xd0
    [ 2.804806] cpuhp_invoke_callback+0x11a/0x440
    [ 2.811465] ? hv_resume+0x90/0x90
    [ 2.818137] cpuhp_issue_call+0x126/0x130
    [ 2.824782] __cpuhp_setup_state_cpuslocked+0x102/0x2b0
    [ 2.831427] ? hyperv_report_panic+0xd0/0xd0
    [ 2.838075] ? hyperv_report_panic+0xd0/0xd0
    [ 2.844723] ? hv_resume+0x90/0x90
    [ 2.851375] __cpuhp_setup_state+0x3d/0x90
    [ 2.858030] hyperv_init+0x14e/0x410
    [ 2.864689] ? enable_IR_x2apic+0x190/0x1a0
    [ 2.871349] apic_intr_mode_init+0x8b/0x100
    [ 2.878017] x86_late_time_init+0x20/0x30
    [ 2.884675] start_kernel+0x459/0x4fb
    [ 2.891329] secondary_startup_64_no_verify+0xb0/0xbb

Since the hypervisor already provides the VP assist pages for root
partition, we need to memremap the memory from hypervisor for root
kernel to use. The mapping is done in hv_cpu_init during bringup and is
unmapped in hv_cpu_die during teardown.

Signed-off-by: Praveen Kumar <kumarpraveen@linux.microsoft.com>
Reviewed-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Link: https://lore.kernel.org/r/20210731120519.17154-1-kumarpraveen@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-08-04 11:56:53 +00:00
Sean Christopherson
179c6c27bf KVM: SVM: Fix off-by-one indexing when nullifying last used SEV VMCB
Use the raw ASID, not ASID-1, when nullifying the last used VMCB when
freeing an SEV ASID.  The consumer, pre_sev_run(), indexes the array by
the raw ASID, thus KVM could get a false negative when checking for a
different VMCB if KVM manages to reallocate the same ASID+VMCB combo for
a new VM.

Note, this cannot cause a functional issue _in the current code_, as
pre_sev_run() also checks which pCPU last did VMRUN for the vCPU, and
last_vmentry_cpu is initialized to -1 during vCPU creation, i.e. is
guaranteed to mismatch on the first VMRUN.  However, prior to commit
8a14fe4f0c ("kvm: x86: Move last_cpu into kvm_vcpu_arch as
last_vmentry_cpu"), SVM tracked pCPU on its own and zero-initialized the
last_cpu variable.  Thus it's theoretically possible that older versions
of KVM could miss a TLB flush if the first VMRUN is on pCPU0 and the ASID
and VMCB exactly match those of a prior VM.

Fixes: 70cd94e60c ("KVM: SVM: VMRUN should use associated ASID when SEV is enabled")
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-04 06:02:09 -04:00
Like Xu
e79f49c37c KVM: x86/pmu: Introduce pmc->is_paused to reduce the call time of perf interfaces
Based on our observations, after any vm-exit associated with vPMU, there
are at least two or more perf interfaces to be called for guest counter
emulation, such as perf_event_{pause, read_value, period}(), and each one
will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes
more severe when guest use counters in a multiplexed manner.

Holding a lock once and completing the KVM request operations in the perf
context would introduce a set of impractical new interfaces. So we can
further optimize the vPMU implementation by avoiding repeated calls to
these interfaces in the KVM context for at least one pattern:

After we call perf_event_pause() once, the event will be disabled and its
internal count will be reset to 0. So there is no need to pause it again
or read its value. Once the event is paused, event period will not be
updated until the next time it's resumed or reprogrammed. And there is
also no need to call perf_event_period twice for a non-running counter,
considering the perf_event for a running counter is never paused.

Based on this implementation, for the following common usage of
sampling 4 events using perf on a 4u8g guest:

  echo 0 > /proc/sys/kernel/watchdog
  echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent
  echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate
  echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
  for i in `seq 1 1 10`
  do
  taskset -c 0 perf record \
  -e cpu-cycles -e instructions -e branch-instructions -e cache-misses \
  /root/br_instr a
  done

the average latency of the guest NMI handler is reduced from
37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server.
Also, in addition to collecting more samples, no loss of sampling
accuracy was observed compared to before the optimization.

Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20210728120705.6855-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
2021-08-04 05:55:56 -04:00
Peter Xu
a75b540451 KVM: X86: Optimize zapping rmap
Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be
slow.  The easy way is to travers the rmap list, collecting the a/d bits and
free the slots along the way.

Provide a pte_list_destroy() and do exactly that.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220605.26377-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-04 05:55:56 -04:00
Peter Xu
13236e25eb KVM: X86: Optimize pte_list_desc with per-array counter
Add a counter field into pte_list_desc, so as to simplify the add/remove/loop
logic.  E.g., we don't need to loop over the array any more for most reasons.

This will make more sense after we've switched the array size to be larger
otherwise the counter will be a waste.

Initially I wanted to store a tail pointer at the head of the array list so we
don't need to traverse the list at least for pushing new ones (if without the
counter we traverse both the list and the array).  However that'll need
slightly more change without a huge lot benefit, e.g., after we grow entry
numbers per array the list traversing is not so expensive.

So let's be simple but still try to get as much benefit as we can with just
these extra few lines of changes (not to mention the code looks easier too
without looping over arrays).

I used the same a test case to fork 500 child and recycle them ("./rmap_fork
500" [1]), this patch further speeds up the total fork time of about 4%, which
is a total of 33% of vanilla kernel:

        Vanilla:      473.90 (+-5.93%)
        3->15 slots:  366.10 (+-4.94%)
        Add counter:  351.00 (+-3.70%)

[1] 825436f825

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220602.26327-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-04 05:55:56 -04:00
Peter Xu
dc1cff9691 KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger
Currently rmap array element only contains 3 entries.  However for EPT=N there
could have a lot of guest pages that got tens of even hundreds of rmap entry.

A normal distribution of a 6G guest (even if idle) shows this with rmap count
statistics:

Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
Level=4K:       3089171 49005   14016   1363    235     212     15      7       0       0       0
Level=2M:       5951    227     0       0       0       0       0       0       0       0       0
Level=1G:       32      0       0       0       0       0       0       0       0       0       0

If we do some more fork some pages will grow even larger rmap counts.

This patch makes PTE_LIST_EXT bigger so it'll be more efficient for the general
use case of EPT=N as we do list reference less and the loops over PTE_LIST_EXT
will be slightly more efficient; but still not too large so less waste when
array not full.

It should not affecting EPT=Y since EPT normally only has zero or one rmap
entry for each page, so no array is even allocated.

With a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]),
this patch speeds up fork time of about 29%.

    Before: 473.90 (+-5.93%)
    After:  366.10 (+-4.94%)

[1] 825436f825

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220455.26054-6-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-04 05:55:56 -04:00
Vitaly Kuznetsov
4e62aa96d6 KVM: x86: hyper-v: Check if guest is allowed to use XMM registers for hypercall input
TLFS states that "Availability of the XMM fast hypercall interface is
indicated via the “Hypervisor Feature Identification” CPUID Leaf
(0x40000003, see section 2.4.4) ... Any attempt to use this interface
when the hypervisor does not indicate availability will result in a #UD
fault."

Implement the check for 'strict' mode (KVM_CAP_HYPERV_ENFORCE_CPUID).

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Siddharth Chandrasekaran <sidcha@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210730122625.112848-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-03 06:16:40 -04:00
Vitaly Kuznetsov
f5714bbb5b KVM: x86: Introduce trace_kvm_hv_hypercall_done()
Hypercall failures are unusual with potentially far going consequences
so it would be useful to see their results when tracing.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Siddharth Chandrasekaran <sidcha@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210730122625.112848-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-03 06:16:40 -04:00
Vitaly Kuznetsov
2e2f1e8d04 KVM: x86: hyper-v: Check access to hypercall before reading XMM registers
In case guest doesn't have access to the particular hypercall we can avoid
reading XMM registers.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Siddharth Chandrasekaran <sidcha@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210730122625.112848-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-03 06:16:40 -04:00
Hamza Mahfooz
269e9552d2 KVM: const-ify all relevant uses of struct kvm_memory_slot
As alluded to in commit f36f3f2846 ("KVM: add "new" argument to
kvm_arch_commit_memory_region"), a bunch of other places where struct
kvm_memory_slot is used, needs to be refactored to preserve the
"const"ness of struct kvm_memory_slot across-the-board.

Signed-off-by: Hamza Mahfooz <someguy@effective-light.com>
Message-Id: <20210713023338.57108-1-someguy@effective-light.com>
[Do not touch body of slot_rmap_walk_init. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-03 06:04:24 -04:00
Paolo Bonzini
db105fab8d KVM: nSVM: remove useless kvm_clear_*_queue
For an event to be in injected state when nested_svm_vmrun executes,
it must have come from exitintinfo when svm_complete_interrupts ran:

  vcpu_enter_guest
   static_call(kvm_x86_run) -> svm_vcpu_run
    svm_complete_interrupts
     // now the event went from "exitintinfo" to "injected"
   static_call(kvm_x86_handle_exit) -> handle_exit
    svm_invoke_exit_handler
      vmrun_interception
       nested_svm_vmrun

However, no event could have been in exitintinfo before a VMRUN
vmexit.  The code in svm.c is a bit more permissive than the one
in vmx.c:

        if (is_external_interrupt(svm->vmcb->control.exit_int_info) &&
            exit_code != SVM_EXIT_EXCP_BASE + PF_VECTOR &&
            exit_code != SVM_EXIT_NPF && exit_code != SVM_EXIT_TASK_SWITCH &&
            exit_code != SVM_EXIT_INTR && exit_code != SVM_EXIT_NMI)

but in any case, a VMRUN instruction would not even start to execute
during an attempted event delivery.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:02:00 -04:00
Sean Christopherson
4c72ab5aa6 KVM: x86: Preserve guest's CR0.CD/NW on INIT
Preserve CR0.CD and CR0.NW on INIT instead of forcing them to '1', as
defined by both Intel's SDM and AMD's APM.

Note, current versions of Intel's SDM are very poorly written with
respect to INIT behavior.  Table 9-1. "IA-32 and Intel 64 Processor
States Following Power-up, Reset, or INIT" quite clearly lists power-up,
RESET, _and_ INIT as setting CR0=60000010H, i.e. CD/NW=1.  But the SDM
then attempts to qualify CD/NW behavior in a footnote:

  2. The CD and NW flags are unchanged, bit 4 is set to 1, all other bits
     are cleared.

Presumably that footnote is only meant for INIT, as the RESET case and
especially the power-up case are rather non-sensical.  Another footnote
all but confirms that:

  6. Internal caches are invalid after power-up and RESET, but left
     unchanged with an INIT.

Bare metal testing shows that CD/NW are indeed preserved on INIT (someone
else can hack their BIOS to check RESET and power-up :-D).

Reported-by: Reiji Watanabe <reijiw@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-47-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:02:00 -04:00
Sean Christopherson
46f4898b20 KVM: SVM: Drop redundant clearing of vcpu->arch.hflags at INIT/RESET
Drop redundant clears of vcpu->arch.hflags in init_vmcb() since
kvm_vcpu_reset() always clears hflags, and it is also always
zero at vCPU creation time.  And of course, the second clearing
in init_vmcb() was always redundant.

Suggested-by: Reiji Watanabe <reijiw@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-46-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:59 -04:00
Sean Christopherson
265e43530c KVM: SVM: Emulate #INIT in response to triple fault shutdown
Emulate a full #INIT instead of simply initializing the VMCB if the
guest hits a shutdown.  Initializing the VMCB but not other vCPU state,
much of which is mirrored by the VMCB, results in incoherent and broken
vCPU state.

Ideally, KVM would not automatically init anything on shutdown, and
instead put the vCPU into e.g. KVM_MP_STATE_UNINITIALIZED and force
userspace to explicitly INIT or RESET the vCPU.  Even better would be to
add KVM_MP_STATE_SHUTDOWN, since technically NMI can break shutdown
(and SMI on Intel CPUs).

But, that ship has sailed, and emulating #INIT is the next best thing as
that has at least some connection with reality since there exist bare
metal platforms that automatically INIT the CPU if it hits shutdown.

Fixes: 46fe4ddd9d ("[PATCH] KVM: SVM: Propagate cpu shutdown events to userspace")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-45-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:59 -04:00
Sean Christopherson
e54949408a KVM: VMX: Move RESET-only VMWRITE sequences to init_vmcs()
Move VMWRITE sequences in vmx_vcpu_reset() guarded by !init_event into
init_vmcs() to make it more obvious that they're, uh, initializing the
VMCS.

No meaningful functional change intended (though the order of VMWRITEs
and whatnot is different).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-44-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:59 -04:00
Sean Christopherson
7aa13fc3d8 KVM: VMX: Remove redundant write to set vCPU as active at RESET/INIT
Drop a call to vmx_clear_hlt() during vCPU INIT, the guest's activity
state is unconditionally set to "active" a few lines earlier in
vmx_vcpu_reset().

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-43-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:59 -04:00
Sean Christopherson
84ec8d2d53 KVM: VMX: Smush x2APIC MSR bitmap adjustments into single function
Consolidate all of the dynamic MSR bitmap adjustments into
vmx_update_msr_bitmap_x2apic(), and rename the mode tracker to reflect
that it is x2APIC specific.  If KVM gains more cases of dynamic MSR
pass-through, odds are very good that those new cases will be better off
with their own logic, e.g. see Intel PT MSRs and MSR_IA32_SPEC_CTRL.

Attempting to handle all updates in a common helper did more harm than
good, as KVM ended up collecting a large number of useless "updates".

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-42-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:58 -04:00
Sean Christopherson
e7c701dd7a KVM: VMX: Remove unnecessary initialization of msr_bitmap_mode
Don't bother initializing msr_bitmap_mode to 0, all of struct vcpu_vmx is
zero initialized.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-41-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:58 -04:00
Sean Christopherson
002f87a41e KVM: VMX: Don't redo x2APIC MSR bitmaps when userspace filter is changed
Drop an explicit call to update the x2APIC MSRs when the userspace MSR
filter is modified.  The x2APIC MSRs are deliberately exempt from
userspace filtering.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-40-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:58 -04:00
Sean Christopherson
284036c644 KVM: nVMX: Remove obsolete MSR bitmap refresh at nested transitions
Drop unnecessary MSR bitmap updates during nested transitions, as L1's
APIC_BASE MSR is not modified by the standard VM-Enter/VM-Exit flows,
and L2's MSR bitmap is managed separately.  In the unlikely event that L1
is pathological and loads APIC_BASE via the VM-Exit load list, KVM will
handle updating the bitmap in its normal WRMSR flows.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-39-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:58 -04:00
Sean Christopherson
9e4784e19d KVM: VMX: Remove obsolete MSR bitmap refresh at vCPU RESET/INIT
Remove an unnecessary MSR bitmap refresh during vCPU RESET/INIT.  In both
cases, the MSR bitmap already has the desired values and state.

At RESET, the vCPU is guaranteed to be running with x2APIC disabled, the
x2APIC MSRs are guaranteed to be intercepted due to the MSR bitmap being
initialized to all ones by alloc_loaded_vmcs(), and vmx->msr_bitmap_mode
is guaranteed to be zero, i.e. reflecting x2APIC disabled.

At INIT, the APIC_BASE MSR is not modified, thus there can't be any
change in x2APIC state.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-38-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:57 -04:00
Sean Christopherson
f39e805ee1 KVM: x86: Move setting of sregs during vCPU RESET/INIT to common x86
Move the setting of CR0, CR4, EFER, RFLAGS, and RIP from vendor code to
common x86.  VMX and SVM now have near-identical sequences, the only
difference being that VMX updates the exception bitmap.  Updating the
bitmap on SVM is unnecessary, but benign.  Unfortunately it can't be left
behind in VMX due to the need to update exception intercepts after the
control registers are set.

Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-37-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:57 -04:00
Sean Christopherson
c5c9f920f7 KVM: VMX: Don't _explicitly_ reconfigure user return MSRs on vCPU INIT
When emulating vCPU INIT, do not unconditionally refresh the list of user
return MSRs that need to be loaded into hardware when running the guest.
Unconditionally refreshing the list is confusing, as the vast majority of
MSRs are not modified on INIT.  The real motivation is to handle the case
where an INIT during long mode obviates the need to load the SYSCALL MSRs,
and that is handled as needed by vmx_set_efer().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-36-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:57 -04:00