The condition for reading CR4 was wrong: there are some CPUs with
CPUID but not CR4. Rather than trying to make the condition exact,
use __read_cr4_safe().
Fixes: 18bc7bd523 ("x86/boot: Synchronize trampoline_cr4_features and mmu_cr4_features directly")
Reported-by: david@saggiorato.net
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Link: http://lkml.kernel.org/r/8c453a61c4f44ab6ff43c29780ba04835234d2e5.1475178369.git.luto@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Current code can call set_cpu_sibling_map() and invoke sched_set_topology()
more than once (e.g. on CPU hot plug). When this happens after
sched_init_smp() has been called, we lose the NUMA topology extension to
sched_domain_topology in sched_init_numa(). This results in incorrect
topology when the sched domain is rebuilt.
This patch fixes the bug and issues warning if we call sched_set_topology()
after sched_init_smp().
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/1474485552-141429-2-git-send-email-srinivas.pandruvada@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJX6H4uAAoJEHm+PkMAQRiG5sMH/3yzrMiUCSokdS+cvY+jgKAG
JS58JmRvBPz2mRaU3MRPBGRDeCz/Nc9LggL2ZcgM+E1ZYirlYyQfIED3lkqk5R07
kIN1wmb+kQhXyU4IY3fEX7joqyKC6zOy4DUChPkBQU0/0+VUmdVmcJvsuPlnMZtf
g95m0BdYTui+eDezASRqOEp3Lb5ONL4c3ao4yBP0LHF033ctj3VJQiyi5uERPZJ0
5e6Mo7Wxn78t9WqJLQAiEH46kTwT2plNlxf3XXqTenfIdbWhqE873HPGeSMa3VQV
VywXTpCpSPQsA8BYg66qIbebdKOhs9MOviHVfqDtwQlvwhjlBDya0gNHfI5fSy4=
=Y/L5
-----END PGP SIGNATURE-----
Merge tag 'v4.8-rc8' into drm-next
Linux 4.8-rc8
There was a lot of fallout in the imx/amdgpu/i915 drivers, so backmerge
it now to avoid troubles.
* tag 'v4.8-rc8': (1442 commits)
Linux 4.8-rc8
fault_in_multipages_readable() throws set-but-unused error
mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing
radix tree: fix sibling entry handling in radix_tree_descend()
radix tree test suite: Test radix_tree_replace_slot() for multiorder entries
fix memory leaks in tracing_buffers_splice_read()
tracing: Move mutex to protect against resetting of seq data
MIPS: Fix delay slot emulation count in debugfs
MIPS: SMP: Fix possibility of deadlock when bringing CPUs online
mm: delete unnecessary and unsafe init_tlb_ubc()
huge tmpfs: fix Committed_AS leak
shmem: fix tmpfs to handle the huge= option properly
blk-mq: skip unmapped queues in blk_mq_alloc_request_hctx
MIPS: Fix pre-r6 emulation FPU initialisation
arm64: kgdb: handle read-only text / modules
arm64: Call numa_store_cpu_info() earlier.
locking/hung_task: Fix typo in CONFIG_DETECT_HUNG_TASK help text
nvme-rdma: only clear queue flags after successful connect
i2c: qup: skip qup_i2c_suspend if the device is already runtime suspended
perf/core: Limit matching exclusive events to one PMU
...
Fix up the silent merge conflict between commit c291b01515 in x86/urgent
and commit f7c28833c2 in x86/apic which both remove num_processors++
from the original location and then add it at two different locations. As a
result num_processors is incremented twice which can cut the number of
available cpus in half.
Remove the one which is added by commit c291b01515.
In hindsight I should have merged x86/urgent into x86/apic _before_ adding
the nodeid bits, but in hindsight we are always smarter.
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Fixes: 1e1b37273c ("Merge branch 'x86/urgent' into x86/apic")
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1609261350090.5483@nanos
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Bring in the upstream modifications so we can fixup the silent merge
conflict which is introduced by this merge.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In commit:
ec776ef6bb ("x86/mm: Add support for the non-standard protected e820 type")
Christoph references the original patch I wrote implementing pmem support.
The intent of the 'max_pfn' changes in that commit were to enable persistent
memory ranges to be covered by the struct page memmap by default.
However, that approach was abandoned when Christoph ported the patches [1], and
that functionality has since been replaced by devm_memremap_pages().
In the meantime, this max_pfn manipulation is confusing kdump [2] that
assumes that everything covered by the max_pfn is "System RAM". This
results in kdump hanging or crashing.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-March/000348.html
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1351098
So fix it.
Reported-by: Zhang Yi <yizhan@redhat.com>
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Zhang Yi <yizhan@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <stable@vger.kernel.org> # v4.1 and later kernels
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-nvdimm@lists.01.org
Fixes: ec776ef6bb ("x86/mm: Add support for the non-standard protected e820 type")
Link: http://lkml.kernel.org/r/147448744538.34910.11287693517367139607.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' apicid.
4. Establish all possible cpuid <-> nodeid mapping.
This patch finishes step 2.
In this patch, we introduce a new static array named cpuid_to_apicid[],
which is large enough to store info for all possible cpus.
And then, we modify the cpuid calculation. In generic_processor_info(),
it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid
mapping changes with node hotplug.
After this patch, we find the next unused cpuid, map it to an apicid,
and store the mapping in cpuid_to_apicid[], so that cpuid <-> apicid
mapping will be persistent.
And finally we will use this array to make cpuid <-> nodeid persistent.
cpuid <-> apicid mapping is established at local apic registeration time.
But non-present or disabled cpus are ignored.
In this patch, we establish all possible cpuid <-> apicid mapping when
registering local apic.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com>
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: mika.j.penttila@gmail.com
Cc: len.brown@intel.com
Cc: rafael@kernel.org
Cc: rjw@rjwysocki.net
Cc: yasu.isimatu@gmail.com
Cc: linux-mm@kvack.org
Cc: linux-acpi@vger.kernel.org
Cc: isimatu.yasuaki@jp.fujitsu.com
Cc: gongzhaogang@inspur.com
Cc: tj@kernel.org
Cc: izumi.taku@jp.fujitsu.com
Cc: cl@linux.com
Cc: chen.tang@easystack.cn
Cc: akpm@linux-foundation.org
Cc: kamezawa.hiroyu@jp.fujitsu.com
Cc: lenb@kernel.org
Link: http://lkml.kernel.org/r/1472114120-3281-4-git-send-email-douly.fnst@cn.fujitsu.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
cpuid <-> nodeid mapping is firstly established at boot time. And workqueue caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.
When doing node online/offline, cpuid <-> nodeid mapping is established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.
So here is the problem:
Assume we have the following cpuid <-> nodeid in the beginning:
Node | CPU
------------------------
node 0 | 0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119
and we hot-remove node2 and node3, it becomes:
Node | CPU
------------------------
node 0 | 0-14, 60-74
node 1 | 15-29, 75-89
and we hot-add node4 and node5, it becomes:
Node | CPU
------------------------
node 0 | 0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119
But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.
When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.
static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}
Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline node,
which will lead to memory allocation failure:
SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
node 0: slabs: 6172, objs: 259224, free: 245741
node 1: slabs: 3261, objs: 136962, free: 127656
It happens here:
create_worker(struct worker_pool *pool)
|--> worker = alloc_worker(pool->node);
static struct worker *alloc_worker(int node)
{
struct worker *worker;
worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, useing the wrong node.
......
return worker;
}
[Solution]
There are four mappings in the kernel:
1. nodeid (logical node id) <-> pxm
2. apicid (physical cpu id) <-> nodeid
3. cpuid (logical cpu id) <-> apicid
4. cpuid (logical cpu id) <-> nodeid
1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> pxm
mapping is setup at boot time. This mapping is persistent, won't change.
2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at boot
time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is also
persistent.
3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
allocated, lower ids first, and released at CPU hotremove time, reused for other
hotadded CPUs. So this mapping is not persistent.
4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
cleared at CPU hotremove time. As a result of 3, this mapping is not persistent.
To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> apicid
mapping. So the key point is obtaining all cpus' apicid.
apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
This is done by introducing an extra parameter to generic_processor_info to let the
caller control if disabled cpus are ignored.
2. Introduce a new array storing all possible cpuid <-> apicid mapping. And also modify
the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping when
registering local apic. Store the mapping in this array.
3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' apicid.
This is also done by introducing an extra parameter to these apis to let the caller
control if disabled cpus are ignored.
4. Establish all possible cpuid <-> nodeid mapping.
This is done via an additional acpi namespace walk for processors.
This patch finished step 1.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com>
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: mika.j.penttila@gmail.com
Cc: len.brown@intel.com
Cc: rafael@kernel.org
Cc: rjw@rjwysocki.net
Cc: yasu.isimatu@gmail.com
Cc: linux-mm@kvack.org
Cc: linux-acpi@vger.kernel.org
Cc: isimatu.yasuaki@jp.fujitsu.com
Cc: gongzhaogang@inspur.com
Cc: tj@kernel.org
Cc: izumi.taku@jp.fujitsu.com
Cc: cl@linux.com
Cc: chen.tang@easystack.cn
Cc: akpm@linux-foundation.org
Cc: kamezawa.hiroyu@jp.fujitsu.com
Cc: lenb@kernel.org
Link: http://lkml.kernel.org/r/1472114120-3281-3-git-send-email-douly.fnst@cn.fujitsu.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The maximum size of e820 map array for EFI systems is defined as
E820_X_MAX (E820MAX + 3 * MAX_NUMNODES).
In x86_64 defconfig, this ends up with E820_X_MAX = 320, e820 and e820_saved
are 6404 bytes each.
With larger configs, for example Fedora kernels, E820_X_MAX = 3200, e820
and e820_saved are 64004 bytes each. Most of this space is wasted.
Typical machines have some 20-30 e820 areas at most.
After previous patch, e820 and e820_saved are pointers to e280 maps.
Change them to initially point to maps which are __initdata.
At the very end of kernel init, just before __init[data] sections are freed
in free_initmem(), allocate smaller blocks, copy maps there,
and change pointers.
The late switch makes sure that all functions which can be used to change
e820 maps are no longer accessible (they are all __init functions).
Run-tested.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20160918182125.21000-1-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch turns e820 and e820_saved into pointers to e820 tables,
of the same size as before.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20160917213927.1787-2-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
They are all called only from other __init functions in e820.c
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20160917213927.1787-1-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With the following commit:
e18bcccd1a ("x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder")
The task pointer argument to show_stack_log_lvl() in show_stack() was
inadvertently changed to 'current'.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: byungchul.park@lge.com
Cc: fweisbec@gmail.com
Cc: keescook@chromium.org
Cc: linux-tip-commits@vger.kernel.org
Cc: luto@amacapital.net
Cc: nilayvaish@gmail.com
Cc: rostedt@goodmis.org
Cc: tip-bot for Josh Poimboeuf <tipbot@zytor.com>
Fixes: e18bcccd1a ("x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder")
Link: http://lkml.kernel.org/r/20160920155340.yhewlx7vmgmov5fb@treble
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce a function that reads the exact nanoseconds value that is
provided to the guest in kvmclock. This crystallizes the notion of
kvmclock as a thin veneer over a stable TSC, that the guest will
(hopefully) convert with NTP. In other words, kvmclock is *not* a
paravirtualized host-to-guest NTP.
Drop the get_kernel_ns() function, that was used both to get the base
value of the master clock and to get the current value of kvmclock.
The former use is replaced by ktime_get_boot_ns(), the latter is
the purpose of get_kernel_ns().
This also allows KVM to provide a Hyper-V time reference counter that
is synchronized with the time that is computed from the TSC page.
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
All previous users of dump_trace() have been converted to use the new
unwind interfaces, so we can remove it and the related
print_context_stack() and print_context_stack_bp() callback functions.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/5b97da3572b40b5a4d8e185cf2429308d0987a13.1474045023.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Convert show_trace_log_lvl() to use the new unwinder. dump_trace() has
been deprecated.
show_trace_log_lvl() is special compared to other users of the unwinder.
It's the only place where both reliable *and* unreliable addresses are
needed. With frame pointers enabled, most callers of the unwinder don't
want to know about unreliable addresses. But in this case, when we're
dumping the stack to the console because something presumably went
wrong, the unreliable addresses are useful:
- They show stale data on the stack which can provide useful clues.
- If something goes wrong with the unwinder, or if frame pointers are
corrupt or missing, all the stack addresses still get shown.
So in order to show all addresses on the stack, and at the same time
figure out which addresses are reliable, we have to do the scanning and
the unwinding in parallel.
The scanning is done with the help of get_stack_info() to traverse the
stacks. The unwinding is done separately by the new unwinder.
In theory we could simplify show_trace_log_lvl() by instead pushing some
of this logic into the unwind code. But then we would need some kind of
"fake" frame logic in the unwinder which would add a lot of complexity
and wouldn't be worth it in order to support only one user.
Another benefit of this approach is that once we have a DWARF unwinder,
we should be able to just plug it in with minimal impact to this code.
Another change here is that callers of show_trace_log_lvl() don't need
to provide the 'bp' argument. The unwinder already finds the relevant
frame pointer by unwinding until it reaches the first frame after the
provided stack pointer.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/703b5998604c712a1f801874b43f35d6dac52ede.1474045023.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The x86 stack dump code is a bit of a mess. dump_trace() uses
callbacks, and each user of it seems to have slightly different
requirements, so there are several slightly different callbacks floating
around.
Also there are some upcoming features which will need more changes to
the stack dump code, including the printing of stack pt_regs, reliable
stack detection for live patching, and a DWARF unwinder. Each of those
features would at least need more callbacks and/or callback interfaces,
resulting in a much bigger mess than what we have today.
Before doing all that, we should try to clean things up and replace
dump_trace() with something cleaner and more flexible.
The new unwinder is a simple state machine which was heavily inspired by
a suggestion from Andy Lutomirski:
https://lkml.kernel.org/r/CALCETrUbNTqaM2LRyXGRx=kVLRPeY5A3Pc6k4TtQxF320rUT=w@mail.gmail.com
It's also similar to the libunwind API:
http://www.nongnu.org/libunwind/man/libunwind(3).html
Some if its advantages:
- Simplicity: no more callback sprawl and less code duplication.
- Flexibility: it allows the caller to stop and inspect the stack state
at each step in the unwinding process.
- Modularity: the unwinder code, console stack dump code, and stack
metadata analysis code are all better separated so that changing one
of them shouldn't have much of an impact on any of the others.
Two implementations are added which conform to the new unwind interface:
- The frame pointer unwinder which is used for CONFIG_FRAME_POINTER=y.
- The "guess" unwinder which is used for CONFIG_FRAME_POINTER=n. This
isn't an "unwinder" per se. All it does is scan the stack for kernel
text addresses. But with no frame pointers, guesses are better than
nothing in most cases.
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/6dc2f909c47533d213d0505f0a113e64585bec82.1474045023.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
These files were only including module.h for exception table related
functions. We've now separated that content out into its own file
"extable.h" so now move over to that and avoid all the extra header content
in module.h that we don't really need to compile these files.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/20160919210418.30243-1-paul.gortmaker@windriver.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
commit aa297292d7 ("x86/tsc: Enumerate SKL cpu_khz and tsc_khz via
CPUID") added code to retrieve the crystal and TSC frequency from CPUID
leaves. If the crystal freqency is enumerated as 0,the resulting TSC
frequency is 0 as well. For CPUs with a known fixed crystal frequency a
quirk list is available to set the frequency,
Kabylake and SkylakeX CPUs are missing in the list of CPUs which need this
quirk. Add them so the TSC frequency can be calculated correctly.
[ tglx: Removed the silly default case as the switch() is only invoked when
cpu_khz is 0. Massaged changelog. ]
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Link: http://lkml.kernel.org/r/1474289501-31717-3-git-send-email-prarit@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
asm/intel-family.h contains defines for cpu ids which should be used
instead of hex constants. Convert the switch case in native_calibrate_tsc()
to use the defines before adding more cpu models.
[ tglx: Massaged changelog ]
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Link: http://lkml.kernel.org/r/1474289501-31717-2-git-send-email-prarit@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The array has a size of MAX_LOCAL_APIC, which can be as large as 32k, so it
can consume up to 128k.
The array has been there forever and was never used for anything useful
other than a version mismatch check which was introduced in 2009.
There is no reason to store the version in an array. The kernel is not
prepared to handle different APIC versions anyway, so the real important
part is to detect a version mismatch and warn about it, which can be done
with a single variable as well.
[ tglx: Massaged changelog ]
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Borislav Petkov <bp@alien8.de>
CC: Brian Gerst <brgerst@gmail.com>
CC: Mike Travis <travis@sgi.com>
Link: http://lkml.kernel.org/r/20160913181232.30815-1-dvlasenk@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The prctl code which references vdso_image_x32 is built when CONFIG_X86_X32
is set. This results in the following build failure:
LD init/built-in.o
arch/x86/built-in.o: In function `do_arch_prctl':
(.text+0x27466): undefined reference to `vdso_image_x32'
vdso_image_x32 depends on CONFIG_X86_X32_ABI. So we need to make the prctl
depend on that as well.
[ tglx: Massaged changelog ]
Fixes: 2eefd87896 ("x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_*")
Signed-off-by: Vinson Lee <vlee@freedesktop.org>
Reviewed-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Link: http://lkml.kernel.org/r/1474073513-6656-1-git-send-email-vlee@freedesktop.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Install the callbacks via the state machine.
CPU_UP_CANCELED_FROZEN() is not preserved: It is only there to free memory in an
error case because it is assumed if the CPU does show up on resume it won't be
seen ever again. As per Borislav:
|IOW, you don't need mc_cpu_dead().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160907164523.46a2xnffha4bv63g@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
show_stack_log_lvl() and friends allow a NULL pointer for the
task_struct to indicate the current task. This creates confusion and
can cause sneaky bugs.
Instead require the caller to pass 'current' directly.
This only changes the internal workings of the dumpstack code. The
dump_trace() and show_stack() interfaces still allow a NULL task
pointer. Those interfaces should also probably be fixed as well.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This will prevent a crash if get_wchan() runs after the task stack
is freed.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/337aeca8614024aa4d8d9c81053bbf8fcffbe4ad.1474003868.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Specifically, pin the stack in save_stack_trace_tsk() and
show_trace_log_lvl().
This will prevent a crash if the target task dies before or while
dumping its stack once we start freeing task stacks early.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/cf0082cde65d1941a996d026f2b2cdbfaca17bfa.1474003868.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that most of the thread_info users have been cleaned up,
this is straightforward.
Most of this code was written by Linus.
Originally-from: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a50eab40abeaec9cb9a9e3cbdeafd32190206654.1473801993.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Because sched.h and thread_info.h are a tangled mess, I turned
in_compat_syscall() into a macro. If we had current_thread_struct()
or similar and we could use it from thread_info.h, then this would
be a bit cleaner.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/ccc8a1b2f41f9c264a41f771bb4a6539a642ad72.1473801993.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
in_exception_stack() has some recursion checking which makes sure the
stack trace code never traverses a given exception stack more than once.
This prevents an infinite loop if corruption somehow causes a stack's
"next stack" pointer to point to itself (directly or indirectly).
The recursion checking can be useful for other stacks in addition to the
exception stack, so extend it to work for all stacks.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/95de5db4cfe111754845a5cef04e20630d01423f.1473905218.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When an interrupt happens in entry code while running on a software IRQ
stack, and the IRQ stack was empty, regs->sp will contain the stack end
address (e.g., irq_stack_ptr). If the regs are passed to dump_trace(),
get_stack_info() will report STACK_TYPE_UNKNOWN, causing dump_trace() to
return prematurely without trying to go to the next stack.
Update the bounds checking for software interrupt stacks so that the
ending address is now considered part of the stack.
This means that it's now possible for the 'walk_stack' callbacks --
print_context_stack() and print_context_stack_bp() -- to be called with
an empty stack. But that's fine; they're already prepared to deal with
that due to their on_stack() checks.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/5a5e5de92dcf11e8dc6b6e8e50ad7639d067830b.1473905218.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
valid_stack_ptr() is buggy: it assumes that all stacks are of size
THREAD_SIZE, which is not true for exception stacks. So the
walk_stack() callbacks will need to know the location of the beginning
of the stack as well as the end.
Another issue is that in general the various features of a stack (type,
size, next stack pointer, description string) are scattered around in
various places throughout the stack dump code.
Encapsulate all that information in a single place with a new stack_info
struct and a get_stack_info() interface.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/8164dd0db96b7e6a279fa17ae5e6dc375eecb4a9.1473905218.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
in_exception_stack() does some bad, bad things just so the unwinder can
print different values for different areas of the debug exception stack.
There's no need to clarify where exactly on the stack it is. Just print
"#DB" and be done with it.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e91cb410169dd576678dd427c35efb716fd0cee1.1473905218.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce new flags that defines which ABI to use on creating sigframe.
Those flags kernel will set according to sigaction syscall ABI,
which set handler for the signal being delivered.
So that will drop the dependency on TIF_IA32/TIF_X32 flags on signal deliver.
Those flags will be used only under CONFIG_COMPAT.
Similar way ARM uses sa_flags to differ in which mode deliver signal
for 26-bit applications (look at SA_THIRYTWO).
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: 0x7f454c46@gmail.com
Cc: oleg@redhat.com
Cc: linux-mm@kvack.org
Cc: gorcunov@openvz.org
Cc: xemul@virtuozzo.com
Link: http://lkml.kernel.org/r/20160905133308.28234-7-dsafonov@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
As the task isn't executing at the moment of {GET,SET}REGS,
return regset that corresponds to code selector, rather than
value of TIF_IA32 flag.
I.e. if we ptrace i386 elf binary that has just changed it's
code selector to __USER_CS, than GET_REGS will return
full x86_64 register set.
Note, that this will work only if application has changed it's CS.
If the application does 32-bit syscall with __USER_CS, ptrace
will still return 64-bit register set. Which might be still confusing
for tools that expect TS_COMPACT to be exposed [1, 2].
So this this change should make PTRACE_GETREGSET more reliable and
this will be another step to drop TIF_{IA32,X32} flags.
[1]: https://sourceforge.net/p/strace/mailman/message/30471411/
[2]: https://lkml.org/lkml/2012/1/18/320
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: 0x7f454c46@gmail.com
Cc: oleg@redhat.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: Pedro Alves <palves@redhat.com>
Cc: gorcunov@openvz.org
Cc: xemul@virtuozzo.com
Link: http://lkml.kernel.org/r/20160905133308.28234-6-dsafonov@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
show_stack_log_lvl() and dump_trace() are already preemption safe:
- If they're running in irq or exception context, preemption is already
disabled and the percpu stack pointers can be trusted.
- If they're running with preemption enabled, they must be running on
the task stack anyway, so it doesn't matter if they're comparing the
stack pointer against a percpu stack pointer from this CPU or another
one: either way it won't match.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a0ca0b1044eca97d4f0ec7c1619cf80b3b65560d.1473371307.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Ingo Molnar:
"Three fixes:
- AMD microcode loading fix with randomization
- an lguest tooling fix
- and an APIC enumeration boundary condition fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Fix num_processors value in case of failure
tools/lguest: Don't bork the terminal in case of wrong args
x86/microcode/AMD: Fix load of builtin microcode with randomized memory
__show_regs() fails to dump the PKRU state when the debug registers are in
their default state because there is a return statement on the debug
register state.
Change the logic to report PKRU value even when debug registers are in
their default state.
Fixes:c0b17b5bd4b7 ("x86/mm/pkeys: Dump PKRU with other kernel registers")
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20160910183045.4618-1-nicolas.iooss_linux@m4x.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The MCA_ADDR registers on Scalable MCA systems contain the ErrorAddr
in bits [55:0] and the least significant bit of the address in bits
[61:56]. We should extract the valid ErrorAddr bits from the MCA_ADDR
register rather than saving the raw value to struct mce.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1473275643-1721-1-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The MCA_SYND and MCA_IPID registers contain valuable information and
should be included in MCE output. The MCA_SYND register contains
syndrome and other error information, and the MCA_IPID register will
uniquely identify the MCA bank's type without having to rely on system
software.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1472680624-34221-2-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The MCA_IPID register uniquely identifies a bank's type and instance
on Scalable MCA systems. We should save the value of this register
in struct mce along with the other relevant error information. This
ensures that we can decode errors without relying on system software to
correlate the bank to the type.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1472680624-34221-1-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The Deferred Error Interrupt Type is set per bank on Scalable MCA
systems. This is done in a bitfield in the MCA_CONFIG register of each
bank. We should set its type to APIC-based interrupt and not assume BIOS
has set it for us.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1472737486-1720-1-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Define a bank's sysfs filename based on its IP type and InstanceId.
Credits go to Aravind for:
* The general idea and proto- get_name().
* Defining smca_umc_block_names[] and buf_mcatype[].
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
Link: http://lkml.kernel.org/r/1473193490-3291-2-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Scalable MCA defines a number of IP types. An MCA bank on an SMCA
system is defined as one of these IP types. A bank's type is uniquely
identified by the combination of the HWID and MCATYPE values read from
its MCA_IPID register.
Add the required tables in order to be able to lookup error descriptions
based on a bank's type and the error's extended error code.
[ bp: Align comments, simplify a bit. ]
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1472741832-1690-1-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Scalable MCA systems allow non-core MCA banks to only be accessible by
certain CPUs. The MSRs for these banks are Read-as-Zero on other CPUs.
During allocate_threshold_blocks(), get_block_address() can be scheduled
on CPUs other than the one allocating the block. This causes the MSRs to
be read on the wrong CPU and results in incorrect behavior.
Add a @cpu parameter to get_block_address() and pass this in to ensure
that the MSRs are only read on the CPU that is allocating the block.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1472673994-12235-2-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Syndrome information is no longer contained in MCA_STATUS for SMCA
systems but in a new register - MCA_SYND.
Add a synd field to struct mce to hold MCA_SYND register value. Add it
to the end of struct mce to maintain compatibility with old versions of
mcelog. Also, add it to the respective tracepoint.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1467633035-32080-1-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Change MSR_IA32_MCx_MISC() macro to msr_ops.misc() because SMCA machines
define a different set of MSRs and msr_ops will give you the correct
MISC register.
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1468269447-8808-1-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch has no functional change; it is purely cosmetic, though
it does make it a wee bit easier to understand the code. Before, the
count of LAPICs was being stored in the variable 'x2count' and the
count of X2APICs was being stored in the variable 'count'. This
patch swaps that so that the routine acpi_parse_madt_lapic_entries()
will now consistently use x2count to refer to X2APIC info, and count
to refer to LAPIC info.
Signed-off-by: Al Stone <ahs3@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 7b02d53e7852 ("efi: Allow drivers to reserve boot services forever")
introduced a new efi_mem_reserve to reserve the boot services memory
regions forever. This reservation involves allocating a new EFI memory
range descriptor. However, allocation can only succeed if there is memory
available for the allocation. Otherwise, error such as the following may
occur:
esrt: Reserving ESRT space from 0x000000003dd6a000 to 0x000000003dd6a010.
Kernel panic - not syncing: ERROR: Failed to allocate 0x9f0 bytes below \
0x0.
CPU: 0 PID: 0 Comm: swapper Not tainted 4.7.0-rc5+ #503
0000000000000000 ffffffff81e03ce0 ffffffff8131dae8 ffffffff81bb6c50
ffffffff81e03d70 ffffffff81e03d60 ffffffff8111f4df 0000000000000018
ffffffff81e03d70 ffffffff81e03d08 00000000000009f0 00000000000009f0
Call Trace:
[<ffffffff8131dae8>] dump_stack+0x4d/0x65
[<ffffffff8111f4df>] panic+0xc5/0x206
[<ffffffff81f7c6d3>] memblock_alloc_base+0x29/0x2e
[<ffffffff81f7c6e3>] memblock_alloc+0xb/0xd
[<ffffffff81f6c86d>] efi_arch_mem_reserve+0xbc/0x134
[<ffffffff81fa3280>] efi_mem_reserve+0x2c/0x31
[<ffffffff81fa3280>] ? efi_mem_reserve+0x2c/0x31
[<ffffffff81fa40d3>] efi_esrt_init+0x19e/0x1b4
[<ffffffff81f6d2dd>] efi_init+0x398/0x44a
[<ffffffff81f5c782>] setup_arch+0x415/0xc30
[<ffffffff81f55af1>] start_kernel+0x5b/0x3ef
[<ffffffff81f55434>] x86_64_start_reservations+0x2f/0x31
[<ffffffff81f55520>] x86_64_start_kernel+0xea/0xed
---[ end Kernel panic - not syncing: ERROR: Failed to allocate 0x9f0
bytes below 0x0.
An inspection of the memblock configuration reveals that there is no memory
available for the allocation:
MEMBLOCK configuration:
memory size = 0x0 reserved size = 0x4f339c0
memory.cnt = 0x1
memory[0x0] [0x00000000000000-0xffffffffffffffff], 0x0 bytes on node 0\
flags: 0x0
reserved.cnt = 0x4
reserved[0x0] [0x0000000008c000-0x0000000008c9bf], 0x9c0 bytes flags: 0x0
reserved[0x1] [0x0000000009f000-0x000000000fffff], 0x61000 bytes\
flags: 0x0
reserved[0x2] [0x00000002800000-0x0000000394bfff], 0x114c000 bytes\
flags: 0x0
reserved[0x3] [0x000000304e4000-0x00000034269fff], 0x3d86000 bytes\
flags: 0x0
This situation can be avoided if we call efi_esrt_init after memblock has
memory regions for the allocation.
Also, the EFI ESRT driver makes use of early_memremap'pings. Therfore, we
do not want to defer efi_esrt_init for too long. We must call such function
while calls to early_memremap are still valid.
A good place to meet the two aforementioned conditions is right after
memblock_x86_fill, grouped with other EFI-related functions.
Reported-by: Scott Lawson <scott.lawson@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Peter Jones <pjones@redhat.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Both efi_find_mirror() and efi_fake_memmap() really want to know
whether the EFI memory map is available, not just whether the machine
was booted using EFI. efi_fake_memmap() even has a check for
EFI_MEMMAP at the start of the function.
Since we've already got other code that has this dependency, merge
everything under one if() conditional, and remove the now superfluous
check from efi_fake_memmap().
Tested-by: Dave Young <dyoung@redhat.com> [kexec/kdump]
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> [arm]
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
On a large system with many CPUs, using HPET as the clock source can
have a significant impact on the overall system performance because
of the following reasons:
1) There is a single HPET counter shared by all the CPUs.
2) HPET counter reading is a very slow operation.
Using HPET as the default clock source may happen when, for example,
the TSC clock calibration exceeds the allowable tolerance. Something
the performance slowdown can be so severe that the system may crash
because of a NMI watchdog soft lockup, for example.
During the TSC clock calibration process, the default clock source
will be set temporarily to HPET. For systems with many CPUs, it is
possible that NMI watchdog soft lockup may occur occasionally during
that short time period where HPET clocking is active as is shown in
the kernel log below:
[ 71.646504] hpet0: 8 comparators, 64-bit 14.318180 MHz counter
[ 71.655313] Switching to clocksource hpet
[ 95.679135] BUG: soft lockup - CPU#144 stuck for 23s! [swapper/144:0]
[ 95.693363] BUG: soft lockup - CPU#145 stuck for 23s! [swapper/145:0]
[ 95.695580] BUG: soft lockup - CPU#582 stuck for 23s! [swapper/582:0]
[ 95.698128] BUG: soft lockup - CPU#357 stuck for 23s! [swapper/357:0]
This patch addresses the above issues by reducing HPET read contention
using the fact that if more than one CPUs are trying to access HPET at
the same time, it will be more efficient when only one CPU in the group
reads the HPET counter and shares it with the rest of the group instead
of each group member trying to read the HPET counter individually.
This is done by using a combination quadword that contains a 32-bit
stored HPET value and a 32-bit spinlock. The CPU that gets the lock
will be responsible for reading the HPET counter and storing it in
the quadword. The others will monitor the change in HPET value and
lock status and grab the latest stored HPET value accordingly. This
change is only enabled on 64-bit SMP configuration.
On a 4-socket Haswell-EX box with 144 threads (HT on), running the
AIM7 compute workload (1500 users) on a 4.8-rc1 kernel (HZ=1000)
with and without the patch has the following performance numbers
(with HPET or TSC as clock source):
TSC = 1042431 jobs/min
HPET w/o patch = 798068 jobs/min
HPET with patch = 1029445 jobs/min
The perf profile showed a reduction of the %CPU time consumed by
read_hpet from 11.19% without patch to 1.24% with patch.
[ tglx: It's really sad that we need to have such hacks just to deal with
the fact that cpu vendors have not managed to fix the TSC wreckage
within 15+ years. Were They Forgetting? ]
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Tested-by: Prarit Bhargava <prarit@redhat.com>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Randy Wright <rwright@hpe.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1473182530-29175-1-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
PKRU is the register that lets you disallow writes or all access to a given
protection key.
The XSAVE hardware defines an "init state" of 0 for PKRU: its most
permissive state, allowing access/writes to everything. Since we start off
all new processes with the init state, we start all processes off with the
most permissive possible PKRU.
This is unfortunate. If a thread is clone()'d [1] before a program has
time to set PKRU to a restrictive value, that thread will be able to write
to all data, no matter what pkey is set on it. This weakens any integrity
guarantees that we want pkeys to provide.
To fix this, we define a very restrictive PKRU to override the
XSAVE-provided value when we create a new FPU context. We choose a value
that only allows access to pkey 0, which is as restrictive as we can
practically make it.
This does not cause any practical problems with applications using
protection keys because we require them to specify initial permissions for
each key when it is allocated, which override the restrictive default.
In the end, this ensures that threads which do not know how to manage their
own pkey rights can not do damage to data which is pkey-protected.
I would have thought this was a pretty contrived scenario, except that I
heard a bug report from an MPX user who was creating threads in some very
early code before main(). It may be crazy, but folks evidently _do_ it.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: mgorman@techsingularity.net
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163021.F3C25D4A@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds two new system calls:
int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
int pkey_free(int pkey);
These implement an "allocator" for the protection keys
themselves, which can be thought of as analogous to the allocator
that the kernel has for file descriptors. The kernel tracks
which numbers are in use, and only allows operations on keys that
are valid. A key which was not obtained by pkey_alloc() may not,
for instance, be passed to pkey_mprotect().
These system calls are also very important given the kernel's use
of pkeys to implement execute-only support. These help ensure
that userspace can never assume that it has control of a key
unless it first asks the kernel. The kernel does not promise to
preserve PKRU (right register) contents except for allocated
pkeys.
The 'init_access_rights' argument to pkey_alloc() specifies the
rights that will be established for the returned pkey. For
instance:
pkey = pkey_alloc(flags, PKEY_DENY_WRITE);
will allocate 'pkey', but also sets the bits in PKRU[1] such that
writing to 'pkey' is already denied.
The kernel does not prevent pkey_free() from successfully freeing
in-use pkeys (those still assigned to a memory range by
pkey_mprotect()). It would be expensive to implement the checks
for this, so we instead say, "Just don't do it" since sane
software will never do it anyway.
Any piece of userspace calling pkey_alloc() needs to be prepared
for it to fail. Why? pkey_alloc() returns the same error code
(ENOSPC) when there are no pkeys and when pkeys are unsupported.
They can be unsupported for a whole host of reasons, so apps must
be prepared for this. Also, libraries or LD_PRELOADs might steal
keys before an application gets access to them.
This allocation mechanism could be implemented in userspace.
Even if we did it in userspace, we would still need additional
user/kernel interfaces to tell userspace which keys are being
used by the kernel internally (such as for execute-only
mappings). Having the kernel provide this facility completely
removes the need for these additional interfaces, or having an
implementation of this in userspace at all.
Note that we have to make changes to all of the architectures
that do not use mman-common.h because we use the new
PKEY_DENY_ACCESS/WRITE macros in arch-independent code.
1. PKRU is the Protection Key Rights User register. It is a
usermode-accessible register that controls whether writes
and/or access to each individual pkey is allowed or denied.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The CPPC registers can also be accessed via functional fixed hardware
addresse(FFH) in X86. Add support by modifying cpc_read and cpc_write to
be able to read/write MSRs on x86 platform on per cpu basis.
Also with this change, acpi_cppc_processor_probe doesn't bail out if
address space id is not equal to PCC or memory address space and FFH
is supported on the system.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When booting a kvm guest on AMD with the latest kernel the following
messages are displayed in the boot log:
tsc: Unable to calibrate against PIT
tsc: HPET/PMTIMER calibration failed
aa297292d7 ("x86/tsc: Enumerate SKL cpu_khz and tsc_khz via CPUID")
introduced a change to account for a difference in cpu and tsc frequencies for
Intel SKL processors. Before this change the native tsc set
x86_platform.calibrate_tsc to native_calibrate_tsc() which is a hardware
calibration of the tsc, and in tsc_init() executed
tsc_khz = x86_platform.calibrate_tsc();
cpu_khz = tsc_khz;
The kvm code changed x86_platform.calibrate_tsc to kvm_get_tsc_khz() and
executed the same tsc_init() function. This meant that KVM guests did not
execute the native hardware calibration function.
After aa297292d7, there are separate native calibrations for cpu_khz and
tsc_khz. The code sets x86_platform.calibrate_tsc to native_calibrate_tsc()
which is now an Intel specific calibration function, and
x86_platform.calibrate_cpu to native_calibrate_cpu() which is the "old"
native_calibrate_tsc() function (ie, the native hardware calibration
function).
tsc_init() now does
cpu_khz = x86_platform.calibrate_cpu();
tsc_khz = x86_platform.calibrate_tsc();
if (tsc_khz == 0)
tsc_khz = cpu_khz;
else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
cpu_khz = tsc_khz;
The kvm code should not call the hardware initialization in
native_calibrate_cpu(), as it isn't applicable for kvm and it didn't do that
prior to aa297292d7.
This patch resolves this issue by setting x86_platform.calibrate_cpu to
kvm_get_tsc_khz().
v2: I had originally set x86_platform.calibrate_cpu to
cpu_khz_from_cpuid(), however, pbonzini pointed out that the CPUID leaf
in that function is not available in KVM. I have changed the function
pointer to kvm_get_tsc_khz().
Fixes: aa297292d7 ("x86/tsc: Enumerate SKL cpu_khz and tsc_khz via CPUID")
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Len Brown <len.brown@intel.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The (start, size) tuple represents a range [start, start + size - 1],
which means "start" and "start + size - 1" should be compared to see
whether the range overflows.
For example, a range with (start, size):
(0xffffffff fffffff0, 0x00000000 00000010)
represents
[0xffffffff fffffff0, 0xffffffff ffffffff]
... would be judged overflow in the original code, while actually it is not.
This patch fixes this and makes sure it still works when size is zero.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/1471657213-31817-1-git-send-email-richard.weiyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The various functions involved in dumping the stack all do similar
things with regard to getting the stack pointer and the frame pointer
based on the regs and task arguments. Create helper functions to
do that instead.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f448914885a35f333fe04da1b97a6c2cc1f80974.1472057064.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change printk_stack_address() to be useful when called by an unwinder
outside the context of dump_trace().
Specifically:
- printk_stack_address()'s 'data' argument is always used as the log
level string. Make that explicit.
- Call touch_nmi_watchdog().
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/9fbe0db05bacf66d337c162edbf61450d0cff1e2.1472057064.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If we get a page fault indicating kernel stack overflow, invoke
handle_stack_overflow(). To prevent us from overflowing the stack
again while handling the overflow (because we are likely to have
very little stack space left), call handle_stack_overflow() on the
double-fault stack.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/6d6cf96b3fb9b4c9aa303817e1dc4de0c7c36487.1472603235.git.luto@kernel.org
[ Minor edit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the topology package map check of the APIC ID and the CPU is a failure,
we don't generate the processor info for that APIC ID yet we increase
disabled_cpus by one - which is buggy.
Only increase num_processors once we are sure we don't fail.
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1473214893-16481-1-git-send-email-douly.fnst@cn.fujitsu.com
[ Rewrote the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As recommended by Ville Syrjala removing .is_mobile field from the
platform struct definition for vlv and hsw+ GPUs as there's no need to
make the distinction in later hardware anymore. Keep it for older GPUs
as it is still needed for ilk-ivb.
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Install the callbacks via the state machine. The online & down callbacks are
invoked on the target CPU so we can avoid using smp_call_function_single().
local_irq_disable() is used because smp_call_function_single() used to invoke
the function with interrupts disabled.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160818125731.27256-15-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We now have a better way to determine if we are running on a cpu that
supports machine check recovery. Free up this feature bit.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Boris Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/d5db39e08d46cf1012d94d3902275d08ba931926.1472754712.git.tony.luck@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use the mcsafe_key defined in the previous patch to make decisions on which
copy function to use. We can't use the FEATURE bit any more because PCI
quirks run too late to affect the patching of code. So we use a static key.
Turn memcpy_mcsafe() into an inline function to make life easier for
callers. The assembly code that actually does the copy is now named
memcpy_mcsafe_unrolled()
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Boris Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/bfde2fc774e94f53d91b70a4321c85a0d33e7118.1472754712.git.tony.luck@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Each Xeon includes a number of capability registers in PCI space that
describe some features not enumerated by CPUID.
Use these to determine that we are running on a model that can recover from
machine checks. Hooks for Ivybridge ... Skylake provided.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Boris Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/abf331dc4a3e2a2d17444129bc51127437bcf4ba.1472754711.git.tony.luck@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We do not need to add the randomization offset when the microcode is
built in.
Reported-and-tested-by: Emanuel Czirai <icanrealizeum@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20160904093736.GA11939@pd.tnic
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull x86 fix from Thomas Gleixner:
"A single fix for an AMD erratum so machines without a BIOS fix work"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/AMD: Apply erratum 665 on machines without a BIOS fix
AMD F12h machines have an erratum which can cause DIV/IDIV to behave
unpredictably. The workaround is to set MSRC001_1029[31] but sometimes
there is no BIOS update containing that workaround so let's do it
ourselves unconditionally. It is simple enough.
[ Borislav: Wrote commit message. ]
Signed-off-by: Emanuel Czirai <icanrealizeum@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Yaowu Xu <yaowu@google.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20160902053550.18097-1-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Łukasz Daniluk reported that on a RHEL kernel that his machine would lock up
after enabling function tracer. I asked him to bisect the functions within
available_filter_functions, which he did and it came down to three:
_paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64()
It was found that this is only an issue when noreplace-paravirt is added
to the kernel command line.
This means that those functions are most likely called within critical
sections of the funtion tracer, and must not be traced.
In newer kenels _paravirt_nop() is defined within gcc asm(), and is no
longer an issue. But both _paravirt_ident_{32,64}() causes the
following splat when they are traced:
mm/pgtable-generic.c:33: bad pmd ffff8800d2435150(0000000001d00054)
mm/pgtable-generic.c:33: bad pmd ffff8800d3624190(0000000001d00070)
mm/pgtable-generic.c:33: bad pmd ffff8800d36a5110(0000000001d00054)
mm/pgtable-generic.c:33: bad pmd ffff880118eb1450(0000000001d00054)
NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-journal:469]
Modules linked in: e1000e
CPU: 2 PID: 469 Comm: systemd-journal Not tainted 4.6.0-rc4-test+ #513
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
task: ffff880118f740c0 ti: ffff8800d4aec000 task.ti: ffff8800d4aec000
RIP: 0010:[<ffffffff81134148>] [<ffffffff81134148>] queued_spin_lock_slowpath+0x118/0x1a0
RSP: 0018:ffff8800d4aefb90 EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88011eb16d40
RDX: ffffffff82485760 RSI: 000000001f288820 RDI: ffffea0000008030
RBP: ffff8800d4aefb90 R08: 00000000000c0000 R09: 0000000000000000
R10: ffffffff821c8e0e R11: 0000000000000000 R12: ffff880000200fb8
R13: 00007f7a4e3f7000 R14: ffffea000303f600 R15: ffff8800d4b562e0
FS: 00007f7a4e3d7840(0000) GS:ffff88011eb00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7a4e3f7000 CR3: 00000000d3e71000 CR4: 00000000001406e0
Call Trace:
_raw_spin_lock+0x27/0x30
handle_pte_fault+0x13db/0x16b0
handle_mm_fault+0x312/0x670
__do_page_fault+0x1b1/0x4e0
do_page_fault+0x22/0x30
page_fault+0x28/0x30
__vfs_read+0x28/0xe0
vfs_read+0x86/0x130
SyS_read+0x46/0xa0
entry_SYSCALL_64_fastpath+0x1e/0xa8
Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 6d 01 00 48 03 14 c5 80 6a 5d 82 48 89 0a 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b
Reported-by: Łukasz Daniluk <lukasz.daniluk@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
thread_saved_pc() was using a completely bogus method to get the return
address. Since switch_to() was previously inlined, there was no sane way
to know where on the stack the return address was stored. Now with the
frame of a sleeping thread well defined, this can be implemented correctly.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-7-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of setting up a fake pt_regs context, put the kernel thread
function pointer and arg into the unused callee-restored registers
of 'struct fork_frame'.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-6-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the low-level context switch code to an out-of-line asm stub instead of
using complex inline asm. This allows constructing a new stack frame for the
child process to make it seamlessly flow to ret_from_fork without an extra
test and branch in __switch_to(). It also improves code generation for
__schedule() by using the C calling convention instead of clobbering all
registers.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-5-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add 'struct inactive_task_frame', which defines the layout of the stack for
a sleeping process. For now, the only defined field is the BP register
(frame pointer).
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-4-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
switch_to() no longer saves EFLAGS, so it's bogus to look for it on the
stack. Set it to zero like 32-bit.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-3-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Match 64-bit and set gdb_regs[GDB_PC] to zero. thread.ip is always the
same point in the scheduler (except for newly forked processes), and will
be removed in a future patch.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-2-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When function graph tracing is enabled, print_context_stack_bp() can
report return_to_handler() as an unreliable address, which is confusing
and misleading: return_to_handler() is really only useful as a hint for
debugging, whereas print_context_stack_bp() users only care about the
actual 'reliable' call path.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/c51aef578d8027791b38d2ad9bac0c7f499fde91.1471607358.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When function graph tracing is enabled for a function, its return
address on the stack is replaced with the address of an ftrace handler
(return_to_handler).
Currently 'return_to_handler' can be reported as reliable. That's not
ideal, and can actually be misleading. When saving or dumping the
stack, you normally only care about what led up to that point (the call
path), rather than what will happen in the future (the return path).
That's especially true in the non-oops stack trace case, which isn't
used for debugging. For example, in a perf profiling operation,
reporting return_to_handler() in the trace would just be confusing.
And in the oops case, where debugging is important, "unreliable" is also
more appropriate there because it serves as a hint that graph tracing
was involved, instead of trying to imply that return_to_handler() was
the real caller.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f8af15749c7d632d3e7f815995831d5b7f82950d.1471607358.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the more reliable version of ftrace_graph_ret_addr() so we no longer
have to worry about the unwinder getting out of sync with the function
graph ret_stack index, which can happen if the unwinder skips any frames
before calling ftrace_graph_ret_addr().
This fixes this issue (and several others like it):
$ cat /proc/self/stack
[<ffffffff810489a2>] save_stack_trace_tsk+0x22/0x40
[<ffffffff81311a89>] proc_pid_stack+0xb9/0x110
[<ffffffff813127c4>] proc_single_show+0x54/0x80
[<ffffffff812be088>] seq_read+0x108/0x3e0
[<ffffffff812923d7>] __vfs_read+0x37/0x140
[<ffffffff812929d9>] vfs_read+0x99/0x140
[<ffffffff81293f28>] SyS_read+0x58/0xc0
[<ffffffff818af97c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
[<ffffffffffffffff>] 0xffffffffffffffff
$ echo function_graph > /sys/kernel/debug/tracing/current_tracer
$ cat /proc/self/stack
[<ffffffff818b2428>] return_to_handler+0x0/0x27
[<ffffffff810394cc>] print_context_stack+0xfc/0x100
[<ffffffff818b2428>] return_to_handler+0x0/0x27
[<ffffffff8103891b>] dump_trace+0x12b/0x350
[<ffffffff818b2428>] return_to_handler+0x0/0x27
[<ffffffff810489a2>] save_stack_trace_tsk+0x22/0x40
[<ffffffff818b2428>] return_to_handler+0x0/0x27
[<ffffffff81311a89>] proc_pid_stack+0xb9/0x110
[<ffffffff818b2428>] return_to_handler+0x0/0x27
[<ffffffff813127c4>] proc_single_show+0x54/0x80
[<ffffffff818b2428>] return_to_handler+0x0/0x27
[<ffffffff812be088>] seq_read+0x108/0x3e0
[<ffffffff818b2428>] return_to_handler+0x0/0x27
[<ffffffff812923d7>] __vfs_read+0x37/0x140
[<ffffffff818b2428>] return_to_handler+0x0/0x27
[<ffffffff812929d9>] vfs_read+0x99/0x140
[<ffffffffffffffff>] 0xffffffffffffffff
Enabling function graph tracing causes the stack trace to change in two
ways:
First, the real call addresses are confusingly interspersed with
'return_to_handler' addresses. This issue will be fixed by the next
patch.
Second, the stack trace is offset by two frames, because the unwinder
skipped the first two frames and got out of sync with the ret_stack
index. This patch fixes this issue.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a6d623e36f8d08f9a17bd74d804d201177a23afd.1471607358.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Storing this value will help prevent unwinders from getting out of sync
with the function graph tracer ret_stack. Now instead of needing a
stateful iterator, they can compare the return address pointer to find
the right ret_stack entry.
Note that an array of 50 ftrace_ret_stack structs is allocated for every
task. So when an arch implements this, it will add either 200 or 400
bytes of memory usage per task (depending on whether it's a 32-bit or
64-bit platform).
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a95cfcc39e8f26b89a430c56926af0bb217bc0a1.1471607358.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This allows x86_64 kernels to enable vmapped stacks by setting
HAVE_ARCH_VMAP_STACK=y - which enables the CONFIG_VMAP_STACK=y
high level Kconfig option.
There are a couple of interesting bits:
First, x86 lazily faults in top-level paging entries for the vmalloc
area. This won't work if we get a page fault while trying to access
the stack: the CPU will promote it to a double-fault and we'll die.
To avoid this problem, probe the new stack when switching stacks and
forcibly populate the pgd entry for the stack when switching mms.
Second, once we have guard pages around the stack, we'll want to
detect and handle stack overflow.
I didn't enable it on x86_32. We'd need to rework the double-fault
code a bit and I'm concerned about running out of vmalloc virtual
addresses under some workloads.
This patch, by itself, will behave somewhat erratically when the
stack overflows while RSP is still more than a few tens of bytes
above the bottom of the stack. Specifically, we'll get #PF and make
it to no_context and them oops without reliably triggering a
double-fault, and no_context doesn't know about stack overflows.
The next patch will improve that case.
Thank you to Nadav and Brian for helping me pay enough attention to
the SDM to hopefully get this right.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/c88f3e2920b18e6cc621d772a04a62c06869037e.1470907718.git.luto@kernel.org
[ Minor edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is not a bugfix, but code optimization.
If the BSP's APIC ID in local APIC is unexpected,
a kernel panic will occur and the system will halt.
That means no need to enable APIC mode, and no reason
to set up the default routing for APIC.
The combination of default_setup_apic_routing() and
apic_bsp_setup() are used to enable APIC mode.
They two should be kept together, rather than being
separated by the codes of checking APIC ID.
Just like their usage in APIC_init_uniprocessor().
Signed-off-by: Wei Jiangang <weijg.fnst@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/1471576957-12961-1-git-send-email-weijg.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The comment probably meant some old AMD64 incarnation which most likely
never saw the light of day. STAR and LSTAR are two different registers
and STAR sets CS/SS(DS) selectors for *all* modes, not only 32-bit.
So simply remove that comment.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160823172356.15879-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>