* pm-cpufreq: (24 commits)
cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode
cpufreq: intel_pstate: Rearrange the storing of new EPP values
cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported
cpufreq: intel_pstate: Clean up aperf_mperf_shift description
cpufreq: powernv: Make some symbols static
cpufreq: amd_freq_sensitivity: Mark sometimes used ID structs as __maybe_unused
cpufreq: intel_pstate: Supply struct attribute description for get_aperf_mperf_shift()
cpufreq: pcc-cpufreq: Mark sometimes used ID structs as __maybe_unused
cpufreq: powernow-k8: Mark 'hi' and 'lo' dummy variables as __always_unused
cpufreq: acpi-cpufreq: Mark sometimes used ID structs as __maybe_unused
cpufreq: acpi-cpufreq: Mark 'dummy' variable as __always_unused
cpufreq: powernv-cpufreq: Fix a bunch of kerneldoc related issues
cpufreq: pasemi: Include header file for {check,restore}_astate prototypes
cpufreq: cpufreq_governor: Demote store_sampling_rate() header to standard comment block
cpufreq: cpufreq: Demote lots of function headers unworthy of kerneldoc status
cpufreq: freq_table: Demote obvious misuse of kerneldoc to standard comment blocks
cpufreq: Replace HTTP links with HTTPS ones
cpufreq: intel_pstate: Fix static checker warning for epp variable
cpufreq: Remove the weakly defined cpufreq_default_governor()
cpufreq: Specify default governor on command line
...
Conflicts:
arch/arm/include/asm/percpu.h
As Stephen Rothwell noted, there's a conflict between this commit
in locking/core:
a21ee6055c ("lockdep: Change hardirq{s_enabled,_context} to per-cpu variables")
and this fresh upstream commit:
aa54ea903a ("ARM: percpu.h: fix build error")
a21ee6055c is a simpler solution to the dependency problem and doesn't
further increase header hell - so this conflict resolution effectively
reverts aa54ea903a and uses the a21ee6055c solution.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Add support for zstd compressed kernel
- Define __DISABLE_EXPORTS in Makefile
- Remove __DISABLE_EXPORTS definition from kaslr.c
- Bump the heap size for zstd.
- Update the documentation.
Integrates the ZSTD decompression code to the x86 pre-boot code.
Zstandard requires slightly more memory during the kernel decompression
on x86 (192 KB vs 64 KB), and the memory usage is independent of the
window size.
__DISABLE_EXPORTS is now defined in the Makefile, which covers both
the existing use in kaslr.c, and the use needed by the zstd decompressor
in misc.c.
This patch has been boot tested with both a zstd and gzip compressed
kernel on i386 and x86_64 using buildroot and QEMU.
Additionally, this has been tested in production on x86_64 devices.
We saw a 2 second boot time reduction by switching kernel compression
from xz to zstd.
Signed-off-by: Nick Terrell <terrelln@fb.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20200730190841.2071656-7-nickrterrell@gmail.com
Capture the max TDP level during kvm_configure_mmu() instead of using a
kvm_x86_ops hook to do it at every vCPU creation.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200716034122.5998-10-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Calculate the desired TDP level on the fly using the max TDP level and
MAXPHYADDR instead of doing the same when CPUID is updated. This avoids
the hidden dependency on cpuid_maxphyaddr() in vmx_get_tdp_level() and
also standardizes the "use 5-level paging iff MAXPHYADDR > 48" behavior
across x86.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200716034122.5998-8-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the shadow_root_level from the current MMU as the root level for the
PGD, i.e. for VMX's EPTP. This eliminates the weird dependency between
VMX and the MMU where both must independently calculate the same root
level for things to work correctly. Temporarily keep VMX's calculation
of the level and use it to WARN if the incoming level diverges.
Opportunistically refactor kvm_mmu_load_pgd() to avoid indentation hell,
and rename a 'cr3' param in the load_mmu_pgd prototype that managed to
survive the cr3 purge.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200716034122.5998-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch moves ATOMIC_INIT from asm/atomic.h into linux/types.h.
This allows users of atomic_t to use ATOMIC_INIT without having to
include atomic.h as that way may lead to header loops.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lkml.kernel.org/r/20200729123105.GB7047@gondor.apana.org.au
All instances of ->get() in arch/x86 switched; that might or might
not be worth splitting up. Notes:
* for xstateregs_get() the amount we want to store is determined at
the boot time; see init_xstate_size() and update_regset_xstate_info() for
details. task->thread.fpu.state.xsave ends with a flexible array member and
the amount of data in it depends upon the FPU features supported/enabled.
* fpregs_get() writes slightly less than full ->thread.fpu.state.fsave
(the last word is not copied); we pass the full size of state.fsave and let
membuf_write() trim to the amount declared by regset - __regset_get() will
make sure that the space in buffer is no more than that.
* copy_xstate_to_user() and its helpers are gone now.
* fpregs_soft_get() was getting user_regset_copyout() arguments
wrong. Since "x86: x86 user_regset math_emu" back in 2008... I really
doubt that it's worth splitting out for -stable, though - you need
a 486SX box for that to trigger...
[Kevin's braino fix for copy_xstate_to_kernel() essentially duplicated here]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
all uses are conditional upon ELF_CORE_COPY_XFPREGS, which has not
been defined on any architecture since 2010
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of having #ifdef/#endif blocks inside sync_core() for X86_64 and
X86_32, implement the new function iret_to_self() with two versions.
In this manner, avoid having to use even more more #ifdef/#endif blocks
when adding support for SERIALIZE in sync_core().
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200727043132.15082-4-ricardo.neri-calderon@linux.intel.com
Having sync_core() in processor.h is problematic since it is not possible
to check for hardware capabilities via the *cpu_has() family of macros.
The latter needs the definitions in processor.h.
It also looks more intuitive to relocate the function to sync_core.h.
This changeset does not make changes in functionality.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200727043132.15082-3-ricardo.neri-calderon@linux.intel.com
The Intel architecture defines a set of Serializing Instructions (a
detailed definition can be found in Vol.3 Section 8.3 of the Intel "main"
manual, SDM). However, these instructions do more than what is required,
have side effects and/or may be rather invasive. Furthermore, some of
these instructions are only available in kernel mode or may cause VMExits.
Thus, software using these instructions only to serialize execution (as
defined in the manual) must handle the undesired side effects.
As indicated in the name, SERIALIZE is a new Intel architecture
Serializing Instruction. Crucially, it does not have any of the mentioned
side effects. Also, it does not cause VMExit and can be used in user mode.
This new instruction is currently documented in the latest "extensions"
manual (ISE). It will appear in the "main" manual in the future.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20200727043132.15082-2-ricardo.neri-calderon@linux.intel.com
The function is only called from within init_64.c and can be static.
Also remove it from pgtable_64.h.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/20200721095953.6218-4-joro@8bytes.org
Remove the code to sync the vmalloc and ioremap ranges for x86-64. The
page-table pages are all pre-allocated now so that synchronization is
no longer necessary.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/20200721095953.6218-3-joro@8bytes.org
Resolve conflicts with ongoing lockdep work that fixed the NMI entry code.
Conflicts:
arch/x86/entry/common.c
arch/x86/include/asm/idtentry.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
AFAICS the last uses of directly 'making' kernel PGDs was removed 7 years ago:
8b78c21d72: ("x86, 64bit, mm: hibernate use generic mapping_init")
Where the explicit PGD walking loop was replaced with kernel_ident_mapping_init()
calls. This was then (unnecessarily) carried over through the 5-level paging conversion.
Also clean up the 'level' comments a bit, to convey the original, meanwhile somewhat
bit-rotten notion, that these are empty comment blocks with no methods to handle any
of the levels except the PTE level.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200724114418.629021-4-mingo@kernel.org
Last use of them was removed 13 years ago, when the code was converted
to use CYC2NS_SCALE_FACTOR:
53d517cdba: ("x86: scale cyc_2_nsec according to CPU frequency")
The current TSC code uses the 'struct cyc2ns_data' scaling abstraction,
the old fixed scaling approach is long gone.
This cleanup also removes the 'arbitralrily' typo from the comment,
so win-win. ;-)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200724114418.629021-3-mingo@kernel.org
Replace the x86 variant with the generic version. Provide the relevant
architecture specific helper functions and defines.
Use a temporary define for idtentry_exit_user which will be cleaned up
seperately.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220520.494648601@linutronix.de
Replace the syscall entry work handling with the generic version. Provide
the necessary helper inlines to handle the real architecture specific
parts, e.g. ptrace.
Use a temporary define for idtentry_enter_user which will be cleaned up
seperately.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220520.376213694@linutronix.de
As a preparatory step for moving the syscall and interrupt entry/exit
handling into generic code, provide a pt_regs helper which retrieves the
interrupt state from pt_regs. This is required to check whether interrupts
are reenabled by return from interrupt/exception.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220520.258511584@linutronix.de
Linus pointed out that compiler.h - which is a key header that gets included in every
single one of the 28,000+ kernel files during a kernel build - was bloated in:
6553896666: ("vmlinux.lds.h: Create section for protection against instrumentation")
Linus noted:
> I have pulled this, but do we really want to add this to a header file
> that is _so_ core that it gets included for basically every single
> file built?
>
> I don't even see those instrumentation_begin/end() things used
> anywhere right now.
>
> It seems excessive. That 53 lines is maybe not a lot, but it pushed
> that header file to over 12kB, and while it's mostly comments, it's
> extra IO and parsing basically for _every_ single file compiled in the
> kernel.
>
> For what appears to be absolutely zero upside right now, and I really
> don't see why this should be in such a core header file!
Move these primitives into a new header: <linux/instrumentation.h>, and include that
header in the headers that make use of it.
Unfortunately one of these headers is asm-generic/bug.h, which does get included
in a lot of places, similarly to compiler.h. So the de-bloating effect isn't as
good as we'd like it to be - but at least the interfaces are defined separately.
No change to functionality intended.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200604071921.GA1361070@gmail.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
The macro is not used anywhere, and has an incorrect value (going by the
comment) on x86_64 since commit c898faf91b ("x86: 46 bit physical address
support on 64 bits")
To avoid confusion, just remove the definition.
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200723231544.17274-2-nivedita@alum.mit.edu
Clang fails to compile __get_user_size() on 32-bit for the following code:
long long val;
__get_user(val, usrptr);
with: error: invalid output size for constraint '=q'
GCC compiles the same code without complaints.
The reason is that GCC and Clang are architecturally different, which leads
to subtle issues for code that's invalid but clearly dead, i.e. with code
that emulates polymorphism with the preprocessor and sizeof.
GCC will perform semantic analysis after early inlining and dead code
elimination, so it will not warn on invalid code that's dead. Clang
strictly performs optimizations after semantic analysis, so it will warn
for dead code.
Neither Clang nor GCC like this very much with -m32:
long long ret;
asm ("movb $5, %0" : "=q" (ret));
However, GCC can tolerate this variant:
long long ret;
switch (sizeof(ret)) {
case 1:
asm ("movb $5, %0" : "=q" (ret));
break;
case 8:;
}
Clang, on the other hand, won't accept that because it validates the inline
asm for the '1' case before the optimisation phase where it realises that
it wouldn't have to emit it anyway.
If LLVM (Clang's "back end") fails such as during instruction selection or
register allocation, it cannot provide accurate diagnostics (warnings /
errors) that contain line information, as the AST has been discarded from
memory at that point.
While there have been early discussions about having C/C++ specific
language optimizations in Clang via the use of MLIR, which would enable
such earlier optimizations, such work is not scoped and likely a multi-year
endeavor.
It was discussed to change the asm output constraint for the one byte case
from "=q" to "=r". While it works for 64-bit, it fails on 32-bit. With '=r'
the compiler could fail to chose a register accessible as high/low which is
required for the byte operation. If that happens the assembly will fail.
Use a local temporary variable of type 'unsigned char' as output for the
byte copy inline asm and then assign it to the real output variable. This
prevents Clang from failing the semantic analysis in the above case.
The resulting code for the actual one byte copy is not affected as the
temporary variable is optimized out.
[ tglx: Amended changelog ]
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: David Woodhouse <dwmw2@infradead.org>
Reported-by: Dmitry Golovin <dima@golovin.in>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Link: https://bugs.llvm.org/show_bug.cgi?id=33587
Link: https://github.com/ClangBuiltLinux/linux/issues/3
Link: https://github.com/ClangBuiltLinux/linux/issues/194
Link: https://github.com/ClangBuiltLinux/linux/issues/781
Link: https://lore.kernel.org/lkml/20180209161833.4605-1-dwmw2@infradead.org/
Link: https://lore.kernel.org/lkml/CAK8P3a1EBaWdbAEzirFDSgHVJMtWjuNt2HGG8z+vpXeNHwETFQ@mail.gmail.com/
Link: https://lkml.kernel.org/r/20200720204925.3654302-12-ndesaulniers@google.com
Also remove now unused __percpu_mov_op.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Link: https://lkml.kernel.org/r/20200720204925.3654302-11-ndesaulniers@google.com
Use __pcpu_size_call_return() to simplify this_cpu_read_stable().
Also remove __bad_percpu_size() which is now unused.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Link: https://lkml.kernel.org/r/20200720204925.3654302-10-ndesaulniers@google.com
The core percpu macros already have a switch on the data size, so the switch
in the x86 code is redundant and produces more dead code.
Also use appropriate types for the width of the instructions. This avoids
errors when compiling with Clang.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Link: https://lkml.kernel.org/r/20200720204925.3654302-9-ndesaulniers@google.com
The core percpu macros already have a switch on the data size, so the switch
in the x86 code is redundant and produces more dead code.
Also use appropriate types for the width of the instructions. This avoids
errors when compiling with Clang.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Link: https://lkml.kernel.org/r/20200720204925.3654302-8-ndesaulniers@google.com
The core percpu macros already have a switch on the data size, so the switch
in the x86 code is redundant and produces more dead code.
Also use appropriate types for the width of the instructions. This avoids
errors when compiling with Clang.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Link: https://lkml.kernel.org/r/20200720204925.3654302-7-ndesaulniers@google.com
The "e" constraint represents a constant, but the XADD instruction doesn't
accept immediate operands.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Link: https://lkml.kernel.org/r/20200720204925.3654302-6-ndesaulniers@google.com
The core percpu macros already have a switch on the data size, so the switch
in the x86 code is redundant and produces more dead code.
Also use appropriate types for the width of the instructions. This avoids
errors when compiling with Clang.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Link: https://lkml.kernel.org/r/20200720204925.3654302-5-ndesaulniers@google.com
The core percpu macros already have a switch on the data size, so the switch
in the x86 code is redundant and produces more dead code.
Also use appropriate types for the width of the instructions. This avoids
errors when compiling with Clang.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Link: https://lkml.kernel.org/r/20200720204925.3654302-4-ndesaulniers@google.com
The core percpu macros already have a switch on the data size, so the switch
in the x86 code is redundant and produces more dead code.
Also use appropriate types for the width of the instructions. This avoids
errors when compiling with Clang.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Link: https://lkml.kernel.org/r/20200720204925.3654302-3-ndesaulniers@google.com
In preparation for cleaning up the percpu operations, define macros for
abstraction based on the width of the operation.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Link: https://lkml.kernel.org/r/20200720204925.3654302-2-ndesaulniers@google.com
show_trace_log_lvl() provides x86 platform-specific way to unwind
backtrace with a given log level. Unfortunately, registers dump(s) are
not printed with the same log level - instead, KERN_DEFAULT is always
used.
Arista's switches uses quite common setup with rsyslog, where only
urgent messages goes to console (console_log_level=KERN_ERR), everything
else goes into /var/log/ as the console baud-rate often is indecently
slow (9600 bps).
Backtrace dumps without registers printed have proven to be as useful as
morning standups. Furthermore, in order to introduce KERN_UNSUPPRESSED
(which I believe is still the most elegant way to fix raciness of sysrq[1])
the log level should be passed down the stack to register dumping
functions. Besides, there is a potential use-case for printing traces
with KERN_DEBUG level [2] (where registers dump shouldn't appear with
higher log level).
Add log_lvl parameter to __show_regs().
Keep the used log level intact to separate visible change.
[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/
[2]: https://lore.kernel.org/linux-doc/20190724170249.9644-1-dima@arista.com/
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Petr Mladek <pmladek@suse.com>
Link: https://lkml.kernel.org/r/20200629144847.492794-3-dima@arista.com
show_trace_log_lvl() provides x86 platform-specific way to unwind
backtrace with a given log level. Unfortunately, registers dump(s) are
not printed with the same log level - instead, KERN_DEFAULT is always
used.
Arista's switches uses quite common setup with rsyslog, where only
urgent messages goes to console (console_log_level=KERN_ERR), everything
else goes into /var/log/ as the console baud-rate often is indecently
slow (9600 bps).
Backtrace dumps without registers printed have proven to be as useful as
morning standups. Furthermore, in order to introduce KERN_UNSUPPRESSED
(which I believe is still the most elegant way to fix raciness of sysrq[1])
the log level should be passed down the stack to register dumping
functions. Besides, there is a potential use-case for printing traces
with KERN_DEBUG level [2] (where registers dump shouldn't appear with
higher log level).
Add log_lvl parameter to show_iret_regs() as a preparation to add it
to __show_regs() and show_regs_if_on_stack().
[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/
[2]: https://lore.kernel.org/linux-doc/20190724170249.9644-1-dima@arista.com/
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Petr Mladek <pmladek@suse.com>
Link: https://lkml.kernel.org/r/20200629144847.492794-2-dima@arista.com
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+QmuaPwR3wnBdVwACF8+vY7k4RUFAl8VlckACgkQCF8+vY7k
4RWg9RAAhmGk/mYYAIAy6o2G8CPgcNnLrUuhXD+SyYrM25QlkinMjHXVhWiPo0F8
DKq21vXg0qlOe7kwn0vSEQ7tClAuJhK/ERnLnDFWe36lO6QF5ydllzEg9XS+xVq2
4Ur+jdkllaS8j7sjQrEdQF17HW52Sr0WB8KegoeUXfeKcGaIBmB+kfNstK2fWMa+
yuVITK/YNpr53zwwsHlMvmIndui0DJHKGGWuTknZsoQfWJtO62YW50FmkizjqXIt
+4DCRiq9Juc/MPYotP2pArwBFiqSl/uP0MeWqn4YzPXvhhhoGPTAhskBWDmZD8jT
XI9Q9Uj5x5kCtTLx4CI3mG2X2XrZ+GqFKZhVD0WKnZN2skdHRn/3qn6tHm7O/Rtr
s27MVKae5zSQs1XrCXwlEz03T5wZ1y1FCX3dlXGkQqbOQLQGbZTsE8wybPElzkLR
18e8xTA1Ss7tUYf7qcrEmMk242byXP8F43cStaWCwsbJgddBpV2jFFfxan3/j6q9
Kl6Y/ssRn6OrIy1UNpdCClRweLi0U53sFQqM/ZmyUbBXMbT/y+rL6h8Wel8EvFlR
PzisYBI/G0AWHahbdHhS9mzQFAZb9etlAboMvI31bo4Bjpbv9UVkX5dMLxabcbEr
B70EZ7nigg5EMqeBkemz0kD4W68qvQRKHK0d+TYcYSnq0VMmErI=
=/5rQ
-----END PGP SIGNATURE-----
Merge tag 'media/v5.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media into master
Pull media fixes from Mauro Carvalho Chehab:
"A series of fixes for the upcoming atomisp driver. They solve issues
when probing atomisp on devices with multiple cameras and get rid of
warnings when built with W=1.
The diffstat is a bit long, as this driver has several abstractions.
The patches that solved the issues with W=1 had to get rid of some
duplicated code (there used to have 2 versions of the same code, one
for ISP2401 and another one for ISP2400).
As this driver is not in 5.7, such changes won't cause regressions"
* tag 'media/v5.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (38 commits)
Revert "media: atomisp: keep the ISP powered on when setting it"
media: atomisp: fix mask and shift operation on ISPSSPM0
media: atomisp: move system_local consts into a C file
media: atomisp: get rid of version-specific system_local.h
media: atomisp: move global stuff into a common header
media: atomisp: remove non-used 32-bits consts at system_local
media: atomisp: get rid of some unused static vars
media: atomisp: Fix error code in ov5693_probe()
media: atomisp: Replace trace_printk by pr_info
media: atomisp: Fix __func__ style warnings
media: atomisp: fix help message for ISP2401 selection
media: atomisp: i2c: atomisp-ov2680.c: fixed a brace coding style issue.
media: atomisp: make const arrays static, makes object smaller
media: atomisp: Clean up non-existing folders from Makefile
media: atomisp: Get rid of ACPI specifics in gmin_subdev_add()
media: atomisp: Provide Gmin subdev as parameter to gmin_subdev_add()
media: atomisp: Use temporary variable for device in gmin_subdev_add()
media: atomisp: Refactor PMIC detection to a separate function
media: atomisp: Deduplicate return ret in gmin_i2c_write()
media: atomisp: Make pointer to PMIC client global
...
- Fix the I/O bitmap invalidation on XEN PV, which was overlooked in the
recent ioperm/iopl rework. This caused the TSS and XEN's I/O bitmap to
get out of sync.
- Use the proper vectors for HYPERV.
- Make disabling of stack protector for the entry code work with GCC
builds which enable stack protector by default. Removing the option is
not sufficient, it needs an explicit -fno-stack-protector to shut it
off.
- Mark check_user_regs() noinstr as it is called from noinstr code. The
missing annotation causes it to be placed in the text section which
makes it instrumentable.
- Add the missing interrupt disable in exc_alignment_check()
- Fixup a XEN_PV build dependency in the 32bit entry code
- A few fixes to make the Clang integrated assembler happy
- Move EFI stub build to the right place for out of tree builds
- Make prepare_exit_to_usermode() static. It's not longer called from ASM
code.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8UR+MTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoQCUD/4/9W5FFvdZvQPwmXsHPaVnW9hUsXxG
0tjc34xqDcgEl1U3khu+6jj+oHx+JM+4wGP/V49Wqx6xkrJ33/a8uYErAgI7+Pmp
s3T2gXMWkgJtYFlDQdAMbeuuM2cOFZJw4BxxvTMth5EixQvk1EkX6QyBjLaSGo8y
78sWtZ6Oh5Ql9ua/9TOilewLsCsQSFIFn0o/hawwwPUMrwGvD29scha0XHom+AO7
uwejfU8klq2HJJaLaaiUaiNBkFz0TNGJtY+3mQpw8BPjCuuBQhYygrS0X4uQzo01
4XJzhDnOVbAYWqi0/T+mAEmuJ9NBZJwYiYrwBYCkZgELwJKLzhzO2GOgP9xEsFY4
VUNgqHFhKrQp10k2k4L/A5tmr+0GntiCQhdZi+/gty6oO/t3ni57pRcAhA9qBNOb
8ZqumBwgaaAIqcmdtoyXAIveWOHnzwKEg6wmIGFbyCwHjeLJKJG7KhpXIpEuX+j2
DC7EfYvRB+jllAk1CBypBvzD0DHfMZ0myPxCcZiW2wHTVAlkpY7hiIyPHqocjE9L
OjOQ7FS6E2/p24lYVcLUFWcESxGFvQjjxwXk7htjpGUIZsQOhz/LOW+CIPCsfbqE
HoEsHmNltksYYV9FDfevXRp5sbxpx3wQSLOgqNqiOpy4cTCG8boalUqHQ0OsN8Oa
EgU067yF77ymRg==
=QAeH
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-07-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master
Pull x86 fixes from Thomas Gleixner:
"A pile of fixes for x86:
- Fix the I/O bitmap invalidation on XEN PV, which was overlooked in
the recent ioperm/iopl rework. This caused the TSS and XEN's I/O
bitmap to get out of sync.
- Use the proper vectors for HYPERV.
- Make disabling of stack protector for the entry code work with GCC
builds which enable stack protector by default. Removing the option
is not sufficient, it needs an explicit -fno-stack-protector to
shut it off.
- Mark check_user_regs() noinstr as it is called from noinstr code.
The missing annotation causes it to be placed in the text section
which makes it instrumentable.
- Add the missing interrupt disable in exc_alignment_check()
- Fixup a XEN_PV build dependency in the 32bit entry code
- A few fixes to make the Clang integrated assembler happy
- Move EFI stub build to the right place for out of tree builds
- Make prepare_exit_to_usermode() static. It's not longer called from
ASM code"
* tag 'x86-urgent-2020-07-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Don't add the EFI stub to targets
x86/entry: Actually disable stack protector
x86/ioperm: Fix io bitmap invalidation on Xen PV
x86: math-emu: Fix up 'cmp' insn for clang ias
x86/entry: Fix vectors to IDTENTRY_SYSVEC for CONFIG_HYPERV
x86/entry: Add compatibility with IAS
x86/entry/common: Make prepare_exit_to_usermode() static
x86/entry: Mark check_user_regs() noinstr
x86/traps: Disable interrupts in exc_aligment_check()
x86/entry/32: Fix XEN_PV build dependency
tss_invalidate_io_bitmap() wasn't wired up properly through the pvop
machinery, so the TSS and Xen's io bitmap would get out of sync
whenever disabling a valid io bitmap.
Add a new pvop for tss_invalidate_io_bitmap() to fix it.
This is XSA-329.
Fixes: 22fe5b0439 ("x86/ioperm: Move TSS bitmap update to exit to user work")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/d53075590e1f91c19f8af705059d3ff99424c020.1595030016.git.luto@kernel.org
IOSF MBI header contains a lot of definitions, such as
end point addresses of IPs. Move CCK address from AtomISP driver
to generic header.
While here, drop unused one.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
With UV1 support removed, EFI_UV1_MEMMAP is no longer used.
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lkml.kernel.org/r/20200713212956.019149227@hpe.com
When assembling with Clang via `make LLVM_IAS=1` and CONFIG_HYPERV enabled,
we observe the following error:
<instantiation>:9:6: error: expected absolute expression
.if HYPERVISOR_REENLIGHTENMENT_VECTOR == 3
^
<instantiation>:1:1: note: while in macro instantiation
idtentry HYPERVISOR_REENLIGHTENMENT_VECTOR asm_sysvec_hyperv_reenlightenment sysvec_hyperv_reenlightenment has_error_code=0
^
./arch/x86/include/asm/idtentry.h:627:1: note: while in macro instantiation
idtentry_sysvec HYPERVISOR_REENLIGHTENMENT_VECTOR sysvec_hyperv_reenlightenment;
^
<instantiation>:9:6: error: expected absolute expression
.if HYPERVISOR_STIMER0_VECTOR == 3
^
<instantiation>:1:1: note: while in macro instantiation
idtentry HYPERVISOR_STIMER0_VECTOR asm_sysvec_hyperv_stimer0 sysvec_hyperv_stimer0 has_error_code=0
^
./arch/x86/include/asm/idtentry.h:628:1: note: while in macro instantiation
idtentry_sysvec HYPERVISOR_STIMER0_VECTOR sysvec_hyperv_stimer0;
This is caused by typos in arch/x86/include/asm/idtentry.h:
HYPERVISOR_REENLIGHTENMENT_VECTOR -> HYPERV_REENLIGHTENMENT_VECTOR
HYPERVISOR_STIMER0_VECTOR -> HYPERV_STIMER0_VECTOR
For more details see ClangBuiltLinux issue #1088.
Fixes: a16be368dd ("x86/entry: Convert various hypervisor vectors to IDTENTRY_SYSVEC")
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/1088
Link: https://github.com/ClangBuiltLinux/linux/issues/1043
Link: https://lore.kernel.org/patchwork/patch/1272115/
Link: https://lkml.kernel.org/r/20200714194740.4548-1-sedat.dilek@gmail.com
Clang's integrated assembler does not allow symbols with non-absolute
values to be reassigned. Modify the interrupt entry loop macro to be
compatible with IAS by using a label and an offset.
Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Suggested-by: Brian Gerst <brgerst@gmail.com>
Suggested-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Jian Cai <caij2003@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com> #
Link: https://github.com/ClangBuiltLinux/linux/issues/1043
Link: https://lkml.kernel.org/r/20200714233024.1789985-1-caij2003@gmail.com
Current minimum required version of binutils is 2.23,
which supports PSHUFB, PCLMULQDQ, PEXTRD, AESKEYGENASSIST,
AESIMC, AESENC, AESENCLAST, AESDEC, AESDECLAST and MOVQ
instruction mnemonics.
Substitute macros from include/asm/inst.h with a proper
instruction mnemonics in various assmbly files from
x86/crypto directory, and remove now unneeded file.
The patch was tested by calculating and comparing sha256sum
hashes of stripped object files before and after the patch,
to be sure that executable code didn't change.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: "David S. Miller" <davem@davemloft.net>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Stack switching for interrupt handlers happens in C now for both 64 and
32bit. Remove the stale comment which claims the contrary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds a new capability KVM_CAP_SMALLER_MAXPHYADDR which
allows userspace to query if the underlying architecture would
support GUEST_MAXPHYADDR < HOST_MAXPHYADDR and hence act accordingly
(e.g. qemu can decide if it should warn for -cpu ..,phys-bits=X)
The complications in this patch are due to unexpected (but documented)
behaviour we see with NPF vmexit handling in AMD processor. If
SVM is modified to add guest physical address checks in the NPF
and guest #PF paths, we see the followning error multiple times in
the 'access' test in kvm-unit-tests:
test pte.p pte.36 pde.p: FAIL: pte 2000021 expected 2000001
Dump mapping: address: 0x123400000000
------L4: 24c3027
------L3: 24c4027
------L2: 24c5021
------L1: 1002000021
This is because the PTE's accessed bit is set by the CPU hardware before
the NPF vmexit. This is handled completely by hardware and cannot be fixed
in software.
Therefore, availability of the new capability depends on a boolean variable
allow_smaller_maxphyaddr which is set individually by VMX and SVM init
routines. On VMX it's always set to true, on SVM it's only set to true
when NPT is not enabled.
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Babu Moger <babu.moger@amd.com>
Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
Message-Id: <20200710154811.418214-10-mgamal@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We would like to introduce a callback to update the #PF intercept
when CPUID changes. Just reuse update_bp_intercept since VMX is
already using update_exception_bitmap instead of a bespoke function.
While at it, remove an unnecessary assignment in the SVM version,
which is already done in the caller (kvm_arch_vcpu_ioctl_set_guest_debug)
and has nothing to do with the exception bitmap.
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Also no point of it being inline since it's always called through
function pointers. So remove that.
Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200710154811.418214-3-mgamal@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl8IQ20UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNzxQf+KziWiVgLYnRmGVJ4xztRGv8Cjt+o
g1YRyJgpST4UEdHyO+T1JvO8HVzTxtDwZlRNHB9UtIaAsAuybSdpUaHeK9lvcZvi
vd59ItmyHKFOtItfG6Qpj6MJKN9tbN1y2F9Vc+TXNLL9BLHwPTbvF3l5ffdtBJ6F
zurQBec7hoCarXZJS2GfzBQ+16WxZmm7RLDpYtEqAayp+UHNw+Z1eMrMV6TwdxAA
3LgDn3l+A+BNuIIUKFF9Y5Ef3T4zBWGbdsV7FR9mH1fq2DiT1Vz4IT674L+6rEom
/KvyyjVPIfQF33sMZmVpLCpoe2IEmOtc7cu3zffqNL6gNw3YHZwMK6cgsg==
=abJa
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull vkm fixes from Paolo Bonzini:
"Two simple but important bugfixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: MIPS: Fix build errors for 32bit kernel
KVM: nVMX: fixes for preemption timer migration
Commit 850448f35a ("KVM: nVMX: Fix VMX preemption timer migration",
2020-06-01) accidentally broke nVMX live migration from older version
by changing the userspace ABI. Restore it and, while at it, ensure
that vmx->nested.has_preemption_timer_deadline is always initialized
according to the KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE flag.
Cc: Makarand Sonare <makarandsonare@google.com>
Fixes: 850448f35a ("KVM: nVMX: Fix VMX preemption timer migration")
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While the nmi_enter() users did
trace_hardirqs_{off_prepare,on_finish}() there was no matching
lockdep_hardirqs_*() calls to complete the picture.
Introduce idtentry_{enter,exit}_nmi() to enable proper IRQ state
tracking across the NMIs.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200623083721.216740948@infradead.org
Move x86's 'struct kvm_mmu_memory_cache' to common code in anticipation
of moving the entire x86 implementation code to common KVM and reusing
it for arm64 and MIPS. Add a new architecture specific asm/kvm_types.h
to control the existence and parameters of the struct. The new header
is needed to avoid a chicken-and-egg problem with asm/kvm_host.h as all
architectures define instances of the struct in their vCPU structs.
Add an asm-generic version of kvm_types.h to avoid having empty files on
PPC and s390 in the long term, and for arm64 and mips in the short term.
Suggested-by: Christoffer Dall <christoffer.dall@arm.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-15-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a gfp_zero flag to 'struct kvm_mmu_memory_cache' and use it to
control __GFP_ZERO instead of hardcoding a call to kmem_cache_zalloc().
A future patch needs such a flag for the __get_free_page() path, as
gfn arrays do not need/want the allocator to zero the memory. Convert
the kmem_cache paths to __GFP_ZERO now so as to avoid a weird and
inconsistent API in the future.
No functional change intended.
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-11-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use separate caches for allocating shadow pages versus gfn arrays. This
sets the stage for specifying __GFP_ZERO when allocating shadow pages
without incurring extra cost for gfn arrays.
No functional change intended.
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-10-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Track the kmem_cache used for non-page KVM MMU memory caches instead of
passing in the associated kmem_cache when filling the cache. This will
allow consolidating code and other cleanups.
No functional change intended.
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the functions which are inside the RCU off region into the
non-instrumentable text section.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200708195322.037311579@linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The name of callback cpuid_update() is misleading that it's not about
updating CPUID settings of vcpu but updating the configurations of vcpu
based on the CPUIDs. So rename it to vcpu_after_set_cpuid().
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20200709043426.92712-5-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Instead of creating the mask for guest CR4 reserved bits in kvm_valid_cr4(),
do it in kvm_update_cpuid() so that it can be reused instead of creating it
each time kvm_valid_cr4() is called.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Message-Id: <1594168797-29444-2-git-send-email-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are cases where a guest tries to switch spinlocks to bare metal
behavior (e.g. by setting "xen_nopvspin" on XEN platform and
"hv_nopvspin" on HYPER_V).
That feature is missed on KVM, add a new parameter "nopvspin" to disable
PV spinlocks for KVM guest.
The new 'nopvspin' parameter will also replace Xen and Hyper-V specific
parameters in future patches.
Define variable nopvsin as global because it will be used in future
patches as above.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make 'struct kvm_mmu_page' MMU-only, nothing outside of the MMU should
be poking into the gory details of shadow pages.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622202034.15093-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Current minimum required version of binutils is 2.23,
which supports VMCALL and VMMCALL instruction mnemonics.
Replace the byte-wise specification of VMCALL and
VMMCALL with these proper mnemonics.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200623183439.5526-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Both the vcpu_vmx structure and the vcpu_svm structure have a
'last_cpu' field. Move the common field into the kvm_vcpu_arch
structure. For clarity, rename it to 'last_vmentry_cpu.'
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Message-Id: <20200603235623.245638-6-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move .write_log_dirty() into kvm_x86_nested_ops to help differentiate it
from the non-nested dirty log hooks. And because it's a nested-only
operation.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622215832.22090-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the LBR call stack mode, LBR information is used to reconstruct a
call stack. To get the complete call stack, perf has to save/restore
all LBR registers during a context switch. Due to a large number of the
LBR registers, this process causes a high CPU overhead. To reduce the
CPU overhead during a context switch, use the XSAVES/XRSTORS
instructions.
Every XSAVE area must follow a canonical format: the legacy region, an
XSAVE header and the extended region. Although the LBR information is
only kept in the extended region, a space for the legacy region and
XSAVE header is still required. Add a new dedicated structure for LBR
XSAVES support.
Before enabling XSAVES support, the size of the LBR state has to be
sanity checked, because:
- the size of the software structure is calculated from the max number
of the LBR depth, which is enumerated by the CPUID leaf for Arch LBR.
The size of the LBR state is enumerated by the CPUID leaf for XSAVE
support of Arch LBR. If the values from the two CPUID leaves are not
consistent, it may trigger a buffer overflow. For example, a hypervisor
may unconsciously set inconsistent values for the two emulated CPUID.
- unlike other state components, the size of an LBR state depends on the
max number of LBRs, which may vary from generation to generation.
Expose the function xfeature_size() for the sanity check.
The LBR XSAVES support will be disabled if the size of the LBR state
enumerated by CPUID doesn't match with the size of the software
structure.
The XSAVE instruction requires 64-byte alignment for state buffers. A
new macro is added to reflect the alignment requirement. A 64-byte
aligned kmem_cache is created for architecture LBR.
Currently, the structure for each state component is maintained in
fpu/types.h. The structure for the new LBR state component should be
maintained in the same place. Move structure lbr_entry to fpu/types.h as
well for broader sharing.
Add dedicated lbr_save/lbr_restore functions for LBR XSAVES support,
which invokes the corresponding xstate helpers to XSAVES/XRSTORS LBR
information at the context switch when the call stack mode is enabled.
Since the XSAVES/XRSTORS instructions will be eventually invoked, the
dedicated functions is named with '_xsaves'/'_xrstors' postfix.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/1593780569-62993-23-git-send-email-kan.liang@linux.intel.com
The perf subsystem will only need to save/restore the LBR state.
However, the existing helpers save all supported supervisor states to a
kernel buffer, which will be unnecessary. Two helpers are introduced to
only save/restore requested dynamic supervisor states. The supervisor
features in XFEATURE_MASK_SUPERVISOR_SUPPORTED and
XFEATURE_MASK_SUPERVISOR_UNSUPPORTED mask cannot be saved/restored using
these helpers.
The helpers will be used in the following patch.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/1593780569-62993-22-git-send-email-kan.liang@linux.intel.com
Last Branch Records (LBR) registers are used to log taken branches and
other control flows. In perf with call stack mode, LBR information is
used to reconstruct a call stack. To get the complete call stack, perf
has to save/restore all LBR registers during a context switch. Due to
the large number of the LBR registers, e.g., the current platform has
96 LBR registers, this process causes a high CPU overhead. To reduce
the CPU overhead during a context switch, an LBR state component that
contains all the LBR related registers is introduced in hardware. All
LBR registers can be saved/restored together using one XSAVES/XRSTORS
instruction.
However, the kernel should not save/restore the LBR state component at
each context switch, like other state components, because of the
following unique features of LBR:
- The LBR state component only contains valuable information when LBR
is enabled in the perf subsystem, but for most of the time, LBR is
disabled.
- The size of the LBR state component is huge. For the current
platform, it's 808 bytes.
If the kernel saves/restores the LBR state at each context switch, for
most of the time, it is just a waste of space and cycles.
To efficiently support the LBR state component, it is desired to have:
- only context-switch the LBR when the LBR feature is enabled in perf.
- only allocate an LBR-specific XSAVE buffer on demand.
(Besides the LBR state, a legacy region and an XSAVE header have to be
included in the buffer as well. There is a total of (808+576) byte
overhead for the LBR-specific XSAVE buffer. The overhead only happens
when the perf is actively using LBRs. There is still a space-saving,
on average, when it replaces the constant 808 bytes of overhead for
every task, all the time on the systems that support architectural
LBR.)
- be able to use XSAVES/XRSTORS for accessing LBR at run time.
However, the IA32_XSS should not be adjusted at run time.
(The XCR0 | IA32_XSS are used to determine the requested-feature
bitmap (RFBM) of XSAVES.)
A solution, called dynamic supervisor feature, is introduced to address
this issue, which
- does not allocate a buffer in each task->fpu;
- does not save/restore a state component at each context switch;
- sets the bit corresponding to the dynamic supervisor feature in
IA32_XSS at boot time, and avoids setting it at run time.
- dynamically allocates a specific buffer for a state component
on demand, e.g. only allocates LBR-specific XSAVE buffer when LBR is
enabled in perf. (Note: The buffer has to include the LBR state
component, a legacy region and a XSAVE header space.)
(Implemented in a later patch)
- saves/restores a state component on demand, e.g. manually invokes
the XSAVES/XRSTORS instruction to save/restore the LBR state
to/from the buffer when perf is active and a call stack is required.
(Implemented in a later patch)
A new mask XFEATURE_MASK_DYNAMIC and a helper xfeatures_mask_dynamic()
are introduced to indicate the dynamic supervisor feature. For the
systems which support the Architecture LBR, LBR is the only dynamic
supervisor feature for now. For the previous systems, there is no
dynamic supervisor feature available.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/1593780569-62993-21-git-send-email-kan.liang@linux.intel.com
When saving xstate to a kernel/user XSAVE area with the XSAVE family of
instructions, the current code applies the 'full' instruction mask (-1),
which tries to XSAVE all possible features. This method relies on
hardware to trim 'all possible' down to what is enabled in the
hardware. The code works well for now. However, there will be a
problem, if some features are enabled in hardware, but are not suitable
to be saved into all kernel XSAVE buffers, like task->fpu, due to
performance consideration.
One such example is the Last Branch Records (LBR) state. The LBR state
only contains valuable information when LBR is explicitly enabled by
the perf subsystem, and the size of an LBR state is large (808 bytes
for now). To avoid both CPU overhead and space overhead at each context
switch, the LBR state should not be saved into task->fpu like other
state components. It should be saved/restored on demand when LBR is
enabled in the perf subsystem. Current copy_xregs_to_* will trigger a
buffer overflow for such cases.
Three sites use the '-1' instruction mask which must be updated.
Two are saving/restoring the xstate to/from a kernel-allocated XSAVE
buffer and can use 'xfeatures_mask_all', which will save/restore all of
the features present in a normal task FPU buffer.
The last one saves the register state directly to a user buffer. It
could
also use 'xfeatures_mask_all'. Just as it was with the '-1' argument,
any supervisor states in the mask will be filtered out by the hardware
and not saved to the buffer. But, to be more explicit about what is
expected to be saved, use xfeatures_mask_user() for the instruction
mask.
KVM includes the header file fpu/internal.h. To avoid 'undefined
xfeatures_mask_all' compiling issue, move copy_fpregs_to_fpstate() to
fpu/core.c and export it, because:
- The xfeatures_mask_all is indirectly used via copy_fpregs_to_fpstate()
by KVM. The function which is directly used by other modules should be
exported.
- The copy_fpregs_to_fpstate() is a function, while xfeatures_mask_all
is a variable for the "internal" FPU state. It's safer to export a
function than a variable, which may be implicitly changed by others.
- The copy_fpregs_to_fpstate() is a big function with many checks. The
removal of the inline keyword should not impact the performance.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/1593780569-62993-20-git-send-email-kan.liang@linux.intel.com
Current LBR information in the structure x86_perf_task_context is stored
in a different format from the PEBS LBR record and Architecture LBR,
which prevents the sharing of the common codes.
Use the format of the PEBS LBR record as a unified format. Use a generic
name lbr_entry to replace pebs_lbr_entry.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-11-git-send-email-kan.liang@linux.intel.com
The LBR capabilities of Architecture LBR are retrieved from the CPUID
enumeration once at boot time. The capabilities have to be saved for
future usage.
Several new fields are added into structure x86_pmu to indicate the
capabilities. The fields will be used in the following patches.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-9-git-send-email-kan.liang@linux.intel.com
CPUID.(EAX=07H, ECX=0):EDX[19] indicates whether an Intel CPU supports
Architectural LBRs.
The "X86_FEATURE_..., word 18" is already mirrored from CPUID
"0x00000007:0 (EDX)". Add X86_FEATURE_ARCH_LBR under the "word 18"
section.
The feature will appear as "arch_lbr" in /proc/cpuinfo.
The Architectural Last Branch Records (LBR) feature enables recording
of software path history by logging taken branches and other control
flows. The feature will be supported in the perf_events subsystem.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/1593780569-62993-2-git-send-email-kan.liang@linux.intel.com
They were originally called _cond_rcu because they were special versions
with conditional RCU handling. Now they're the standard entry and exit
path, so the _cond_rcu part is just confusing. Drop it.
Also change the signature to make them more extensible and more foolproof.
No functional change -- it's pure refactoring.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/247fc67685263e0b673e1d7f808182d28ff80359.1593795633.git.luto@kernel.org
xenpv_exc_nmi() and xenpv_exc_debug() are only defined on 64-bit kernels,
but they snuck into the 32-bit build via <asm/identry.h>, causing the link
to fail:
ld: arch/x86/entry/entry_32.o: in function `asm_xenpv_exc_nmi':
(.entry.text+0x817): undefined reference to `xenpv_exc_nmi'
ld: arch/x86/entry/entry_32.o: in function `asm_xenpv_exc_debug':
(.entry.text+0x827): undefined reference to `xenpv_exc_debug'
Only use them on 64-bit kernels.
Fixes: f41f082422: ("x86/entry/xen: Route #DB correctly on Xen PV")
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>