The only use of KEXEC_BACKUP_SRC_END is as an argument to
walk_system_ram_res():
int crash_load_segments(struct kimage *image)
{
...
walk_system_ram_res(KEXEC_BACKUP_SRC_START, KEXEC_BACKUP_SRC_END,
image, determine_backup_region);
walk_system_ram_res() expects "start, end" arguments that are inclusive,
i.e., the range to be walked includes both the start and end addresses.
KEXEC_BACKUP_SRC_END was previously defined as (640 * 1024UL), which is the
first address *past* the desired 0-640KB range.
Define KEXEC_BACKUP_SRC_END as (640 * 1024UL - 1) so the KEXEC_BACKUP_SRC
region is [0-0x9ffff], not [0-0xa0000].
Fixes: dd5f726076 ("kexec: support for kexec on panic using new system call")
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Brijesh Singh <brijesh.singh@amd.com>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: Lianbo Jiang <lijiang@redhat.com>
CC: Takashi Iwai <tiwai@suse.de>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Vivek Goyal <vgoyal@redhat.com>
CC: baiyaowei@cmss.chinamobile.com
CC: bhe@redhat.com
CC: dan.j.williams@intel.com
CC: dyoung@redhat.com
CC: kexec@lists.infradead.org
Link: http://lkml.kernel.org/r/153805811578.1157.6948388946904655969.stgit@bhelgaas-glaptop.roam.corp.google.com
Spurious faults only ever occur in the kernel's address space. They
are also constrained specifically to faults with one of these error codes:
X86_PF_WRITE | X86_PF_PROT
X86_PF_INSTR | X86_PF_PROT
So, it's never even possible to reach spurious_kernel_fault_check() with
X86_PF_PK set.
In addition, the kernel's address space never has pages with user-mode
protections. Protection Keys are only enforced on pages with user-mode
protection.
This gives us lots of reasons to not check for protection keys in our
sprurious kernel fault handling.
But, let's also add some warnings to ensure that these assumptions about
protection keys hold true.
Cc: x86@kernel.org
Cc: Jann Horn <jannh@google.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180928160231.243A0D6A@viggo.jf.intel.com
The vsyscall page is weird. It is in what is traditionally part of
the kernel address space. But, it has user permissions and we handle
faults on it like we would on a user page: interrupts on.
Right now, we handle vsyscall emulation in the "bad_area" code, which
is used for both user-address-space and kernel-address-space faults.
Move the handling to the user-address-space code *only* and ensure we
get there by "excluding" the vsyscall page from the kernel address
space via a check in fault_in_kernel_space().
Since the fault_in_kernel_space() check is used on 32-bit, also add a
64-bit check to make it clear we only use this path on 64-bit. Also
move the unlikely() to be in is_vsyscall_vaddr() itself.
This helps clean up the kernel fault handling path by removing a case
that can happen in normal[1] operation. (Yeah, yeah, we can argue
about the vsyscall page being "normal" or not.) This also makes
sanity checks easier, like the "we never take pkey faults in the
kernel address space" check in the next patch.
Cc: x86@kernel.org
Cc: Jann Horn <jannh@google.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180928160230.6E9336EE@viggo.jf.intel.com
We will shortly be using this check in two locations. Put it in
a helper before we do so.
Let's also insert PAGE_MASK instead of the open-coded ~0xfff.
It is easier to read and also more obviously correct considering
the implicit type conversion that has to happen when it is not
an implicit 'unsigned long'.
Cc: x86@kernel.org
Cc: Jann Horn <jannh@google.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180928160228.C593509B@viggo.jf.intel.com
The comments here are wrong. They are too absolute about where
faults can occur when running in the kernel. The comments are
also a bit hard to match up with the code.
Trim down the comments, and make them more precise.
Also add a comment explaining why we are doing the
bad_area_nosemaphore() path here.
Cc: x86@kernel.org
Cc: Jann Horn <jannh@google.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180928160227.077DDD7A@viggo.jf.intel.com
The SMAP and Reserved checking do not have nice comments. Add
some to clarify and make it match everything else.
Cc: x86@kernel.org
Cc: Jann Horn <jannh@google.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180928160225.FFD44B8D@viggo.jf.intel.com
The last patch broke out kernel address space handing into its own
helper. Now, do the same for user address space handling.
Cc: x86@kernel.org
Cc: Jann Horn <jannh@google.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180928160223.9C4F6440@viggo.jf.intel.com
The page fault handler (__do_page_fault()) basically has two sections:
one for handling faults in the kernel portion of the address space
and another for faults in the user portion of the address space.
But, these two parts don't stick out that well. Let's make that more
clear from code separation and naming. Pull kernel fault
handling into its own helper, and reflect that naming by renaming
spurious_fault() -> spurious_kernel_fault().
Also, rewrite the vmalloc() handling comment a bit. It was a bit
stale and also glossed over the reserved bit handling.
Cc: x86@kernel.org
Cc: Jann Horn <jannh@google.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180928160222.401F4E10@viggo.jf.intel.com
We pass around a variable called "error_code" all around the page
fault code. Sounds simple enough, especially since "error_code" looks
like it exactly matches the values that the hardware gives us on the
stack to report the page fault error code (PFEC in SDM parlance).
But, that's not how it works.
For part of the page fault handler, "error_code" does exactly match
PFEC. But, during later parts, it diverges and starts to mean
something a bit different.
Give it two names for its two jobs.
The place it diverges is also really screwy. It's only in a spot
where the hardware tells us we have kernel-mode access that occurred
while we were in usermode accessing user-controlled address space.
Add a warning in there.
Cc: x86@kernel.org
Cc: Jann Horn <jannh@google.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180928160220.4A2272C9@viggo.jf.intel.com
Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
when all it really needs to do is reload %CR3 at the next context switch,
assuming no page table pages got freed.
Memory ordering is used to prevent race conditions between switch_mm_irqs_off,
which checks whether .tlb_gen changed, and the TLB invalidation code, which
increments .tlb_gen whenever page table entries get invalidated.
The atomic increment in inc_mm_tlb_gen is its own barrier; the context
switch code adds an explicit barrier between reading tlbstate.is_lazy and
next->context.tlb_gen.
CPUs in lazy TLB mode remain part of the mm_cpumask(mm), both because
that allows TLB flush IPIs to be sent at page table freeing time, and
because the cache line bouncing on the mm_cpumask(mm) was responsible
for about half the CPU use in switch_mm_irqs_off().
We can change native_flush_tlb_others() without touching other
(paravirt) implementations of flush_tlb_others() because we'll be
flushing less. The existing implementations flush more and are
therefore still correct.
Cc: npiggin@gmail.com
Cc: mingo@kernel.org
Cc: will.deacon@arm.com
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Cc: hpa@zytor.com
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180926035844.1420-8-riel@surriel.com
Move some code that will be needed for the lazy -> !lazy state
transition when a lazy TLB CPU has gotten out of date.
No functional changes, since the if (real_prev == next) branch
always returns.
(cherry picked from commit 61d0beb579)
Cc: npiggin@gmail.com
Cc: efault@gmx.de
Cc: will.deacon@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: songliubraving@fb.com
Cc: kernel-team@fb.com
Cc: hpa@zytor.com
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180716190337.26133-4-riel@surriel.com
On most workloads, the number of context switches far exceeds the
number of TLB flushes sent. Optimizing the context switches, by always
using lazy TLB mode, speeds up those workloads.
This patch results in about a 1% reduction in CPU use on a two socket
Broadwell system running a memcache like workload.
Cc: npiggin@gmail.com
Cc: efault@gmx.de
Cc: will.deacon@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Cc: hpa@zytor.com
Cc: luto@kernel.org
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
(cherry picked from commit 95b0e6357d)
Acked-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180716190337.26133-7-riel@surriel.com
Use the new tlb_get_unmap_shift() to determine the stride of the
INVLPG loop.
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Currently CONFIG_RANDOMIZE_BASE=y is set by default, which makes some of the
old comments above the KERNEL_IMAGE_SIZE definition out of date. Update them
to the current state of affairs.
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: corbet@lwn.net
Cc: linux-doc@vger.kernel.org
Cc: thgarnie@google.com
Link: http://lkml.kernel.org/r/20181006084327.27467-2-bhe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When SME is enabled, the memory is encrypted in the first kernel. In
this case, SME also needs to be enabled in the kdump kernel, and we have
to remap the old memory with the memory encryption mask.
The case of concern here is if SME is active in the first kernel,
and it is active too in the kdump kernel. There are four cases to be
considered:
a. dump vmcore
It is encrypted in the first kernel, and needs be read out in the
kdump kernel.
b. crash notes
When dumping vmcore, the people usually need to read useful
information from notes, and the notes is also encrypted.
c. iommu device table
It's encrypted in the first kernel, kdump kernel needs to access its
content to analyze and get information it needs.
d. mmio of AMD iommu
not encrypted in both kernels
Add a new bool parameter @encrypted to __ioremap_caller(). If set,
memory will be remapped with the SME mask.
Add a new function ioremap_encrypted() to explicitly pass in a true
value for @encrypted. Use ioremap_encrypted() for the above a, b, c
cases.
[ bp: cleanup commit message, extern defs in io.h and drop forgotten
include. ]
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: kexec@lists.infradead.org
Cc: tglx@linutronix.de
Cc: mingo@redhat.com
Cc: hpa@zytor.com
Cc: akpm@linux-foundation.org
Cc: dan.j.williams@intel.com
Cc: bhelgaas@google.com
Cc: baiyaowei@cmss.chinamobile.com
Cc: tiwai@suse.de
Cc: brijesh.singh@amd.com
Cc: dyoung@redhat.com
Cc: bhe@redhat.com
Cc: jroedel@suse.de
Link: https://lkml.kernel.org/r/20180927071954.29615-2-lijiang@redhat.com
If we IPI for WBINDV, then we might as well kill the entire TLB too.
But if we don't have to invalidate cache, there is no reason not to
use a range TLB flush.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180919085948.195633798@infradead.org
The start of cpa_flush_range() and cpa_flush_array() is the same, use
a common function.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180919085948.138859183@infradead.org
Rather than guarding cpa_flush_array() users with a CLFLUSH test, put
it inside.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180919085948.087848187@infradead.org
Rather than guarding all cpa_flush_range() uses with a CLFLUSH test,
put it inside.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180919085948.036195503@infradead.org
Both cpa_flush_range() and cpa_flush_array() have a well specified
range, use that to do a range based TLB invalidate.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180919085947.985193217@infradead.org
CAT has happened, WBINDV is bad (even before CAT blowing away the
entire cache on a multi-core platform wasn't nice), try not to use it
ever.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180919085947.933674526@infradead.org
There is an atom errata, where we do a local TLB invalidate right
before we return and then do a global TLB invalidate.
Move the global invalidate up a little bit and avoid the local
invalidate entirely.
This does put the global invalidate under pgd_lock, but that shouldn't
matter.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180919085947.882287392@infradead.org
Instead of open-coding it..
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180919085947.831102058@infradead.org
The extra loop which tries hard to preserve large pages in case of conflicts
with static protection regions turns out to be not preserving anything, at
least not in the experiments which have been conducted.
There might be corner cases in which the code would be able to preserve a
large page oaccsionally, but it's really not worth the extra code and the
cycles wasted in the common case.
Before:
1G pages checked: 2
1G pages sameprot: 0
1G pages preserved: 0
2M pages checked: 541
2M pages sameprot: 466
2M pages preserved: 47
4K pages checked: 514
4K pages set-checked: 7668
After:
1G pages checked: 2
1G pages sameprot: 0
1G pages preserved: 0
2M pages checked: 538
2M pages sameprot: 466
2M pages preserved: 47
4K pages set-checked: 7668
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143546.589642503@linutronix.de
To avoid excessive 4k wise checks in the common case do a quick check first
whether the requested new page protections conflict with a static
protection area in the large page. If there is no conflict then the
decision whether to preserve or to split the page can be made immediately.
If the requested range covers the full large page, preserve it. Otherwise
split it up. No point in doing a slow crawl in 4k steps.
Before:
1G pages checked: 2
1G pages sameprot: 0
1G pages preserved: 0
2M pages checked: 538
2M pages sameprot: 466
2M pages preserved: 47
4K pages checked: 560642
4K pages set-checked: 7668
After:
1G pages checked: 2
1G pages sameprot: 0
1G pages preserved: 0
2M pages checked: 541
2M pages sameprot: 466
2M pages preserved: 47
4K pages checked: 514
4K pages set-checked: 7668
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143546.507259989@linutronix.de
When the existing mapping is correct and the new requested page protections
are the same as the existing ones, then further checks can be omitted and the
large page can be preserved. The slow path 4k wise check will not come up with
a different result.
Before:
1G pages checked: 2
1G pages sameprot: 0
1G pages preserved: 0
2M pages checked: 540
2M pages sameprot: 466
2M pages preserved: 47
4K pages checked: 800709
4K pages set-checked: 7668
After:
1G pages checked: 2
1G pages sameprot: 0
1G pages preserved: 0
2M pages checked: 538
2M pages sameprot: 466
2M pages preserved: 47
4K pages checked: 560642
4K pages set-checked: 7668
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143546.424477581@linutronix.de
With the range check it is possible to do a quick verification that the
current mapping is correct vs. the static protection areas.
In case a incorrect mapping is detected a warning is emitted and the large
page is split up. If the large page is a 2M page, then the split code is
forced to check the static protections for the PTE entries to fix up the
incorrectness. For 1G pages this can't be done easily because that would
require to either find the offending 2M areas before the split or
afterwards. For now just warn about that case and revisit it when reported.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143546.331408643@linutronix.de
The large page preservation mechanism is just magic and provides no
information at all. Add optional statistic output in debugfs so the magic can
be evaluated. Defaults is off.
Output:
1G pages checked: 2
1G pages sameprot: 0
1G pages preserved: 0
2M pages checked: 540
2M pages sameprot: 466
2M pages preserved: 47
4K pages checked: 800770
4K pages set-checked: 7668
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143546.160867778@linutronix.de
The whole static protection magic is silently fixing up anything which is
handed in. That's just wrong. The offending call sites need to be fixed.
Add a debug mechanism which emits a warning if a requested mapping needs to be
fixed up. The DETECT debug mechanism is really not meant to be enabled except
for developers, so limit the output hard to the protection fixups.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143546.078998733@linutronix.de
Checking static protections only page by page is slow especially for huge
pages. To allow quick checks over a complete range, add the ability to do
that.
Make the checks inclusive so the ranges can be directly used for debug output
later.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143545.995734490@linutronix.de
static_protections() is pretty unreadable. Split it up into separate checks
for each protection area.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143545.913005317@linutronix.de
Avoid the extra variable and gotos by splitting the function into the
actual algorithm and a callable function which contains the lock
protection.
Rename it to should_split_large_page() while at it so the return values make
actually sense.
Clean up the code flow, comments and general whitespace damage while at it. No
functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143545.830507216@linutronix.de
The sequence of marking text and rodata read-only in 32bit init is:
set_ro(text);
kernel_set_to_readonly = 1;
set_ro(rodata);
When kernel_set_to_readonly is 1 it enables the protection mechanism in CPA
for the read only regions. With the upcoming checks for existing mappings
this consequently triggers the warning about an existing mapping being
incorrect vs. static protections because rodata has not been converted yet.
There is no technical reason to split the two, so just combine the RO
protection to convert text and rodata in one go.
Convert the printks to pr_info while at it.
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Bin Yang <bin.yang@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Link: https://lkml.kernel.org/r/20180917143545.731701535@linutronix.de
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCW6dGhgAKCRCAXGG7T9hj
vs1UAPwPSDmelfUus+P5ndRQZdK/iQWuRgQUe9Gd3RUVTfcQ7AEAljcN4/dSj7SB
hOgRlCp5WB1s5/vFF7z4jc2wtqvOPAk=
=8P9c
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.19d-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Juergen writes:
"xen:
Two small fixes for xen drivers."
* tag 'for-linus-4.19d-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: issue warning message when out of grant maptrack entries
xen/x86/vpmu: Zero struct pt_regs before calling into sample handling code
Thomas writes:
"A set of fixes for x86:
- Resolve the kvmclock regression on AMD systems with memory
encryption enabled. The rework of the kvmclock memory allocation
during early boot results in encrypted storage, which is not
shareable with the hypervisor. Create a new section for this data
which is mapped unencrypted and take care that the later
allocations for shared kvmclock memory is unencrypted as well.
- Fix the build regression in the paravirt code introduced by the
recent spectre v2 updates.
- Ensure that the initial static page tables cover the fixmap space
correctly so early console always works. This worked so far by
chance, but recent modifications to the fixmap layout can -
depending on kernel configuration - move the relevant entries to a
different place which is not covered by the initial static page
tables.
- Address the regressions and issues which got introduced with the
recent extensions to the Intel Recource Director Technology code.
- Update maintainer entries to document reality"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Expand static page table for fixmap space
MAINTAINERS: Add X86 MM entry
x86/intel_rdt: Add Reinette as co-maintainer for RDT
MAINTAINERS: Add Borislav to the x86 maintainers
x86/paravirt: Fix some warning messages
x86/intel_rdt: Fix incorrect loop end condition
x86/intel_rdt: Fix exclusive mode handling of MBA resource
x86/intel_rdt: Fix incorrect loop end condition
x86/intel_rdt: Do not allow pseudo-locking of MBA resource
x86/intel_rdt: Fix unchecked MSR access
x86/intel_rdt: Fix invalid mode warning when multiple resources are managed
x86/intel_rdt: Global closid helper to support future fixes
x86/intel_rdt: Fix size reporting of MBA resource
x86/intel_rdt: Fix data type in parsing callbacks
x86/kvm: Use __bss_decrypted attribute in shared variables
x86/mm: Add .bss..decrypted section to hold shared variables
I swear I would have sent it the same to Linus! The main cause for
this is that I was on vacation until two weeks ago and it took a while
to sort all the pending patches between 4.19 and 4.20, test them and
so on.
It's mostly small bugfixes and cleanups, mostly around x86 nested
virtualization. One important change, not related to nested
virtualization, is that the ability for the guest kernel to trap CPUID
instructions (in Linux that's the ARCH_SET_CPUID arch_prctl) is now
masked by default. This is because the feature is detected through an
MSR; a very bad idea that Intel seems to like more and more. Some
applications choke if the other fields of that MSR are not initialized
as on real hardware, hence we have to disable the whole MSR by default,
as was the case before Linux 4.12.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJbpPo1AAoJEL/70l94x66DdxgH/is0qe6ZBtzb6Qc0W+8mHHD7
nxIkWAs2V5NsouJ750YwRQ+0Ym407+wlNt30acdBUEoXhrnA5/TvyGq999XvCL96
upWEIxpIgbvTMX/e2nLhe4wQdhsboUK4r0/B9IFgVFYrdCt5uRXjB2G4ewxcqxL/
GxxqrAKhaRsbQG9Xv0Fw5Vohh/Ls6fQDJcyuY1EBnbMpVenq2QDLI6cOAPXncyFb
uLN6ov4GNCWIPckwxejri5XhZesUOsafrmn48sApShh4T6TrisrdtSYdzl+DGza+
j5vhIEwdFO5kulZ3viuhqKJOnS2+F6wvfZ75IKT0tEKeU2bi+ifGDyGRefSF6Q0=
=YXLw
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Paolo writes:
"It's mostly small bugfixes and cleanups, mostly around x86 nested
virtualization. One important change, not related to nested
virtualization, is that the ability for the guest kernel to trap
CPUID instructions (in Linux that's the ARCH_SET_CPUID arch_prctl) is
now masked by default. This is because the feature is detected
through an MSR; a very bad idea that Intel seems to like more and
more. Some applications choke if the other fields of that MSR are
not initialized as on real hardware, hence we have to disable the
whole MSR by default, as was the case before Linux 4.12."
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (23 commits)
KVM: nVMX: Fix bad cleanup on error of get/set nested state IOCTLs
kvm: selftests: Add platform_info_test
KVM: x86: Control guest reads of MSR_PLATFORM_INFO
KVM: x86: Turbo bits in MSR_PLATFORM_INFO
nVMX x86: Check VPID value on vmentry of L2 guests
nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2
KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICv
KVM: VMX: check nested state and CR4.VMXE against SMM
kvm: x86: make kvm_{load|put}_guest_fpu() static
x86/hyper-v: rename ipi_arg_{ex,non_ex} structures
KVM: VMX: use preemption timer to force immediate VMExit
KVM: VMX: modify preemption timer bit only when arming timer
KVM: VMX: immediately mark preemption timer expired only for zero value
KVM: SVM: Switch to bitmap_zalloc()
KVM/MMU: Fix comment in walk_shadow_page_lockless_end()
kvm: selftests: use -pthread instead of -lpthread
KVM: x86: don't reset root in kvm_mmu_setup()
kvm: mmu: Don't read PDPTEs when paging is not enabled
x86/kvm/lapic: always disable MMIO interface in x2APIC mode
KVM: s390: Make huge pages unavailable in ucontrol VMs
...
We met a kernel panic when enabling earlycon, which is due to the fixmap
address of earlycon is not statically setup.
Currently the static fixmap setup in head_64.S only covers 2M virtual
address space, while it actually could be in 4M space with different
kernel configurations, e.g. when VSYSCALL emulation is disabled.
So increase the static space to 4M for now by defining FIXMAP_PMD_NUM to 2,
and add a build time check to ensure that the fixmap is covered by the
initial static page tables.
Fixes: 1ad83c858c ("x86_64,vsyscall: Make vsyscall emulation configurable")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: kernel test robot <rong.a.chen@intel.com>
Reviewed-by: Juergen Gross <jgross@suse.com> (Xen parts)
Cc: H Peter Anvin <hpa@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180920025828.23699-1-feng.tang@intel.com
The handlers of IOCTLs in kvm_arch_vcpu_ioctl() are expected to set
their return value in "r" local var and break out of switch block
when they encounter some error.
This is because vcpu_load() is called before the switch block which
have a proper cleanup of vcpu_put() afterwards.
However, KVM_{GET,SET}_NESTED_STATE IOCTLs handlers just return
immediately on error without performing above mentioned cleanup.
Thus, change these handlers to behave as expected.
Fixes: 8fcc4b5923 ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Reviewed-by: Patrick Colp <patrick.colp@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access
to reads of MSR_PLATFORM_INFO.
Disabling access to reads of this MSR gives userspace the control to "expose"
this platform-dependent information to guests in a clear way. As it exists
today, guests that read this MSR would get unpopulated information if userspace
hadn't already set it (and prior to this patch series, only the CPUID faulting
information could have been populated). This existing interface could be
confusing if guests don't handle the potential for incorrect/incomplete
information gracefully (e.g. zero reported for base frequency).
Signed-off-by: Drew Schmitt <dasch@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow userspace to set turbo bits in MSR_PLATFORM_INFO. Previously, only
the CPUID faulting bit was settable. But now any bit in
MSR_PLATFORM_INFO would be settable. This can be used, for example, to
convey frequency information about the platform on which the guest is
running.
Signed-off-by: Drew Schmitt <dasch@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to section "Checks on VMX Controls" in Intel SDM vol 3C, the
following check needs to be enforced on vmentry of L2 guests:
If the 'enable VPID' VM-execution control is 1, the value of the
of the VPID VM-execution control field must not be 0000H.
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to section "Checks on VMX Controls" in Intel SDM vol 3C,
the following check needs to be enforced on vmentry of L2 guests:
- Bits 5:0 of the posted-interrupt descriptor address are all 0.
- The posted-interrupt descriptor address does not set any bits
beyond the processor's physical-address width.
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In case L1 do not intercept L2 HLT or enter L2 in HLT activity-state,
it is possible for a vCPU to be blocked while it is in guest-mode.
According to Intel SDM 26.6.5 Interrupt-Window Exiting and
Virtual-Interrupt Delivery: "These events wake the logical processor
if it just entered the HLT state because of a VM entry".
Therefore, if L1 enters L2 in HLT activity-state and L2 has a pending
deliverable interrupt in vmcs12->guest_intr_status.RVI, then the vCPU
should be waken from the HLT state and injected with the interrupt.
In addition, if while the vCPU is blocked (while it is in guest-mode),
it receives a nested posted-interrupt, then the vCPU should also be
waken and injected with the posted interrupt.
To handle these cases, this patch enhances kvm_vcpu_has_events() to also
check if there is a pending interrupt in L2 virtual APICv provided by
L1. That is, it evaluates if there is a pending virtual interrupt for L2
by checking RVI[7:4] > VPPR[7:4] as specified in Intel SDM 29.2.1
Evaluation of Pending Interrupts.
Note that this also handles the case of nested posted-interrupt by the
fact RVI is updated in vmx_complete_nested_posted_interrupt() which is
called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() ->
kvm_vcpu_running() -> vmx_check_nested_events() ->
vmx_complete_nested_posted_interrupt().
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMX cannot be enabled under SMM, check it when CR4 is set and when nested
virtualization state is restored.
This should fix some WARNs reported by syzkaller, mostly around
alloc_shadow_vmcs.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The functions
kvm_load_guest_fpu()
kvm_put_guest_fpu()
are only used locally, make them static. This requires also that both
functions are moved because they are used before their implementation.
Those functions were exported (via EXPORT_SYMBOL) before commit
e5bb40251a ("KVM: Drop kvm_{load,put}_guest_fpu() exports").
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>