An older PVHVM guest (v3.0 based) crashed during vCPU hot-plug with:
kernel BUG at drivers/xen/events.c:1328!
RCU has detected that a CPU has not entered a quiescent state within the
grace period. It needs to send the CPU a reschedule IPI if it is not
offline. rcu_implicit_offline_qs() does this check:
/*
* If the CPU is offline, it is in a quiescent state. We can
* trust its state not to change because interrupts are disabled.
*/
if (cpu_is_offline(rdp->cpu)) {
rdp->offline_fqs++;
return 1;
}
Else the CPU is online. Send it a reschedule IPI.
The CPU is in the middle of being hot-plugged and has been marked online
(!cpu_is_offline()). See start_secondary():
set_cpu_online(smp_processor_id(), true);
...
per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE;
start_secondary() then waits for the CPU bringing up the hot-plugged CPU to
mark it as active:
/*
* Wait until the cpu which brought this one up marked it
* online before enabling interrupts. If we don't do that then
* we can end up waking up the softirq thread before this cpu
* reached the active state, which makes the scheduler unhappy
* and schedule the softirq thread on the wrong cpu. This is
* only observable with forced threaded interrupts, but in
* theory it could also happen w/o them. It's just way harder
* to achieve.
*/
while (!cpumask_test_cpu(smp_processor_id(), cpu_active_mask))
cpu_relax();
/* enable local interrupts */
local_irq_enable();
The CPU being hot-plugged will be marked active after it has been fully
initialized by the CPU managing the hot-plug. In the Xen PVHVM case
xen_smp_intr_init() is called to set up the hot-plugged vCPU's
XEN_RESCHEDULE_VECTOR.
The hot-plugging CPU is marked online, not marked active and does not have
its IPI vectors set up. rcu_implicit_offline_qs() sees the hot-plugging
cpu is !cpu_is_offline() and tries to send it a reschedule IPI:
This will lead to:
kernel BUG at drivers/xen/events.c:1328!
xen_send_IPI_one()
xen_smp_send_reschedule()
rcu_implicit_offline_qs()
rcu_implicit_dynticks_qs()
force_qs_rnp()
force_quiescent_state()
__rcu_process_callbacks()
rcu_process_callbacks()
__do_softirq()
call_softirq()
do_softirq()
irq_exit()
xen_evtchn_do_upcall()
because xen_send_IPI_one() will attempt to use an uninitialized IRQ for
the XEN_RESCHEDULE_VECTOR.
There is at least one other place that has caused the same crash:
xen_smp_send_reschedule()
wake_up_idle_cpu()
add_timer_on()
clocksource_watchdog()
call_timer_fn()
run_timer_softirq()
__do_softirq()
call_softirq()
do_softirq()
irq_exit()
xen_evtchn_do_upcall()
xen_hvm_callback_vector()
clocksource_watchdog() uses cpu_online_mask to pick the next CPU to handle
a watchdog timer:
/*
* Cycle through CPUs to check if the CPUs stay synchronized
* to each other.
*/
next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
if (next_cpu >= nr_cpu_ids)
next_cpu = cpumask_first(cpu_online_mask);
watchdog_timer.expires += WATCHDOG_INTERVAL;
add_timer_on(&watchdog_timer, next_cpu);
This resulted in an attempt to send an IPI to a hot-plugging CPU that
had not initialized its reschedule vector. One option would be to make
the RCU code check to not check for CPU offline but for CPU active.
As becoming active is done after a CPU is online (in older kernels).
But Srivatsa pointed out that "the cpu_active vs cpu_online ordering has been
completely reworked - in the online path, cpu_active is set *before* cpu_online,
and also, in the cpu offline path, the cpu_active bit is reset in the CPU_DYING
notification instead of CPU_DOWN_PREPARE." Drilling in this the bring-up
path: "[brought up CPU].. send out a CPU_STARTING notification, and in response
to that, the scheduler sets the CPU in the cpu_active_mask. Again, this mask
is better left to the scheduler alone, since it has the intelligence to use it
judiciously."
The conclusion was that:
"
1. At the IPI sender side:
It is incorrect to send an IPI to an offline CPU (cpu not present in
the cpu_online_mask). There are numerous places where we check this
and warn/complain.
2. At the IPI receiver side:
It is incorrect to let the world know of our presence (by setting
ourselves in global bitmasks) until our initialization steps are complete
to such an extent that we can handle the consequences (such as
receiving interrupts without crashing the sender etc.)
" (from Srivatsa)
As the native code enables the interrupts at some point we need to be
able to service them. In other words a CPU must have valid IPI vectors
if it has been marked online.
It doesn't need to handle the IPI (interrupts may be disabled) but needs
to have valid IPI vectors because another CPU may find it in cpu_online_mask
and attempt to send it an IPI.
This patch will change the order of the Xen vCPU bring-up functions so that
Xen vectors have been set up before start_secondary() is called.
It also will not continue to bring up a Xen vCPU if xen_smp_intr_init() fails
to initialize it.
Orabug 13823853
Signed-off-by Chuck Anderson <chuck.anderson@oracle.com>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
If there are UNUSABLE regions in the machine memory map, dom0 will
attempt to map them 1:1 which is not permitted by Xen and the kernel
will crash.
There isn't anything interesting in the UNUSABLE region that the dom0
kernel needs access to so we can avoid making the 1:1 mapping and
treat it as RAM.
We only do this for dom0, as that is where tboot case shows up.
A PV domU could have an UNUSABLE region in its pseudo-physical map
and would need to be handled in another patch.
This fixes a boot failure on hosts with tboot.
tboot marks a region in the e820 map as unusable and the dom0 kernel
would attempt to map this region and Xen does not permit unusable
regions to be mapped by guests.
(XEN) 0000000000000000 - 0000000000060000 (usable)
(XEN) 0000000000060000 - 0000000000068000 (reserved)
(XEN) 0000000000068000 - 000000000009e000 (usable)
(XEN) 0000000000100000 - 0000000000800000 (usable)
(XEN) 0000000000800000 - 0000000000972000 (unusable)
tboot marked this region as unusable.
(XEN) 0000000000972000 - 00000000cf200000 (usable)
(XEN) 00000000cf200000 - 00000000cf38f000 (reserved)
(XEN) 00000000cf38f000 - 00000000cf3ce000 (ACPI data)
(XEN) 00000000cf3ce000 - 00000000d0000000 (reserved)
(XEN) 00000000e0000000 - 00000000f0000000 (reserved)
(XEN) 00000000fe000000 - 0000000100000000 (reserved)
(XEN) 0000000100000000 - 0000000630000000 (usable)
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
[v1: Altered the patch and description with domU's with UNUSABLE regions]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
... because the "clock_event_device framework" already accounts for idle
time through the "event_handler" function pointer in
xen_timer_interrupt().
The patch is intended as the completion of [1]. It should fix the double
idle times seen in PV guests' /proc/stat [2]. It should be orthogonal to
stolen time accounting (the removed code seems to be isolated).
The approach may be completely misguided.
[1] https://lkml.org/lkml/2011/10/6/10
[2] http://lists.xensource.com/archives/html/xen-devel/2010-08/msg01068.html
John took the time to retest this patch on top of v3.10 and reported:
"idle time is correctly incremented for pv and hvm for the normal
case, nohz=off and nohz=idle." so lets put this patch in.
CC: stable@vger.kernel.org
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
If the per-cpu time data structure has been onlined already and
we are trying to online it again, then free the previous copy
before blindly over-writting it.
A developer naturally should not call this function multiple times
but just in case.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We don't check whether the per_cpu data structure has actually
been freed in the past. This checks it and if it has been freed
in the past then just continues on without double-freeing.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We don't do any code movement. We just encapsulate the struct clock_event_device
in a new structure which contains said structure and a pointer to
a char *name. The 'name' will be used in 'xen/time: Don't leak interrupt
name when offlining'.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
When the user does:
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online
kmemleak reports:
kmemleak: 7 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
unreferenced object 0xffff88003fa51260 (size 32):
comm "swapper/0", pid 1, jiffies 4294667339 (age 1027.789s)
hex dump (first 32 bytes):
73 70 69 6e 6c 6f 63 6b 31 00 00 00 00 00 00 00 spinlock1.......
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff81660721>] kmemleak_alloc+0x21/0x50
[<ffffffff81190aac>] __kmalloc_track_caller+0xec/0x2a0
[<ffffffff812fe1bb>] kvasprintf+0x5b/0x90
[<ffffffff812fe228>] kasprintf+0x38/0x40
[<ffffffff81663789>] xen_init_lock_cpu+0x61/0xbe
[<ffffffff816633a6>] xen_cpu_up+0x66/0x3e8
[<ffffffff8166bbf5>] _cpu_up+0xd1/0x14b
[<ffffffff8166bd48>] cpu_up+0xd9/0xec
[<ffffffff81ae6e4a>] smp_init+0x4b/0xa3
[<ffffffff81ac4981>] kernel_init_freeable+0xdb/0x1e6
[<ffffffff8165ce39>] kernel_init+0x9/0xf0
[<ffffffff8167edfc>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff
Instead of doing it like the "xen/smp: Don't leak interrupt name when offlining"
patch did (which has a per-cpu structure which contains both the
IRQ number and char*) we use a per-cpu pointers to a *char.
The reason is that the "__this_cpu_read(lock_kicker_irq);" macro
blows up with "__bad_size_call_parameter()" as the size of the
returned structure is not within the parameters of what it expects
and optimizes for.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
When we free it we want to make sure to set it to a default
value of -1 so that we don't double-free it (in case somebody
calls us twice).
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This patch adds a new structure to contain the common two things
that each of the per-cpu interrupts need:
- an interrupt number,
- and the name of the interrupt (to be added in 'xen/smp: Don't leak
interrupt name when offlining').
This allows us to carry the tuple of the per-cpu interrupt data structure
and expand it as we need in the future.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
There are two functions that do a bunch of 'free_irq' on
the per_cpu IRQ. Instead of having duplicate code just move
it to one function.
This is just code movement.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Commit f447d56d36 introduced the
implementation of the PV apic ipi interface. But there were some
odd things (it seems none of which cause really any issue but
maybe they should be cleaned up anyway):
- xen_send_IPI_mask_allbutself (and by that xen_send_IPI_allbutself)
ignore the passed in vector and only use the CALL_FUNCTION_SINGLE
vector. While xen_send_IPI_all and xen_send_IPI_mask use the vector.
- physflat_send_IPI_allbutself is declared unnecessarily. It is never
used.
This patch tries to clean up those things.
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
- More fixes in the vCPU PVHVM hotplug path.
- Add more documentation.
- Fix various ARM related issues in the Xen generic drivers.
- Updates in the xen-pciback driver per Bjorn's updates.
- Mask the x2APIC feature for PV guests.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQEcBAABAgAGBQJRjPL5AAoJEFjIrFwIi8fJdlIIANXawH+B+aFbqsFSKOOh76XN
smgICU78SVzKpW9WAPYK7YFqSdNN4AleGC2Mn2lSkiaqgciRyDb9Yt+OSMMts2Xn
ZVbFkGhEKR+DtZfTKo9YgsGatul/McTiVEkuuli+aN5dql3WXDLAaA+/b9bO3ohh
TCWtWNuSCGmlfDoJET2je+J6CgKvCErH3fvzKNxgYxytcGhAvxoVK/lC4d3pnq/m
wQUAIcF8XYENqC2m1WDR0OGveAB0Me0j9g+UkQS+TzqA8GPmxC4aptjkroFYhOz6
8nZp+LanimmTI6olVioWEXkCr5+dxb058jQKwncQfonFpl58RS0qUrz5zoe3etU=
=h9SX
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.10-rc0-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull Xen bug-fixes from Konrad Rzeszutek Wilk:
- More fixes in the vCPU PVHVM hotplug path.
- Add more documentation.
- Fix various ARM related issues in the Xen generic drivers.
- Updates in the xen-pciback driver per Bjorn's updates.
- Mask the x2APIC feature for PV guests.
* tag 'stable/for-linus-3.10-rc0-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/pci: Used cached MSI-X capability offset
xen/pci: Use PCI_MSIX_TABLE_BIR, not PCI_MSIX_FLAGS_BIRMASK
xen: clear IRQ_NOAUTOEN and IRQ_NOREQUEST
xen: mask x2APIC feature in PV
xen: SWIOTLB is only used on x86
xen/spinlock: Fix check from greater than to be also be greater or equal to.
xen/smp/pvhvm: Don't point per_cpu(xen_vpcu, 33 and larger) to shared_info
xen/vcpu: Document the xen_vcpu_info and xen_vcpu
xen/vcpu/pvhvm: Fix vcpu hotplugging hanging.
During review of git commit cb9c6f15f3
("xen/spinlock: Check against default value of -1 for IRQ line.")
Stefano pointed out a bug in the patch. Unfortunatly due to vacation
timing the fix was not applied and this patch fixes it up.
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
As it will point to some data, but not event channel data (the
shared_info has an array limited to 32).
This means that for PVHVM guests with more than 32 VCPUs without
the usage of VCPUOP_register_info any interrupts to VCPUs
larger than 32 would have gone unnoticed during early bootup.
That is OK, as during early bootup, in smp_init we end up calling
the hotplug mechanism (xen_hvm_cpu_notify) which makes the
VCPUOP_register_vcpu_info call for all VCPUs and we can receive
interrupts on VCPUs 33 and further.
This is just a cleanup.
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
They are important structures and it is not clear at first
look what they are for.
The xen_vcpu is a pointer. By default it points to the shared_info
structure (at the CPU offset location). However if the
VCPUOP_register_vcpu_info hypercall is implemented we can make the
xen_vcpu pointer point to a per-CPU location.
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
[v1: Added comments from Ian Campbell]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
If a user did:
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online
we would (this a build with DEBUG enabled) get to:
smpboot: ++++++++++++++++++++=_---CPU UP 1
.. snip..
smpboot: Stack at about ffff880074c0ff44
smpboot: CPU1: has booted.
and hang. The RCU mechanism would kick in an try to IPI the CPU1
but the IPIs (and all other interrupts) would never arrive at the
CPU1. At first glance at least. A bit digging in the hypervisor
trace shows that (using xenanalyze):
[vla] d4v1 vec 243 injecting
0.043163027 --|x d4v1 intr_window vec 243 src 5(vector) intr f3
] 0.043163639 --|x d4v1 vmentry cycles 1468
] 0.043164913 --|x d4v1 vmexit exit_reason PENDING_INTERRUPT eip ffffffff81673254
0.043164913 --|x d4v1 inj_virq vec 243 real
[vla] d4v1 vec 243 injecting
0.043164913 --|x d4v1 intr_window vec 243 src 5(vector) intr f3
] 0.043165526 --|x d4v1 vmentry cycles 1472
] 0.043166800 --|x d4v1 vmexit exit_reason PENDING_INTERRUPT eip ffffffff81673254
0.043166800 --|x d4v1 inj_virq vec 243 real
[vla] d4v1 vec 243 injecting
there is a pending event (subsequent debugging shows it is the IPI
from the VCPU0 when smpboot.c on VCPU1 has done
"set_cpu_online(smp_processor_id(), true)") and the guest VCPU1 is
interrupted with the callback IPI (0xf3 aka 243) which ends up calling
__xen_evtchn_do_upcall.
The __xen_evtchn_do_upcall seems to do *something* but not acknowledge
the pending events. And the moment the guest does a 'cli' (that is the
ffffffff81673254 in the log above) the hypervisor is invoked again to
inject the IPI (0xf3) to tell the guest it has pending interrupts.
This repeats itself forever.
The culprit was the per_cpu(xen_vcpu, cpu) pointer. At the bootup
we set each per_cpu(xen_vcpu, cpu) to point to the
shared_info->vcpu_info[vcpu] but later on use the VCPUOP_register_vcpu_info
to register per-CPU structures (xen_vcpu_setup).
This is used to allow events for more than 32 VCPUs and for performance
optimizations reasons.
When the user performs the VCPU hotplug we end up calling the
the xen_vcpu_setup once more. We make the hypercall which returns
-EINVAL as it does not allow multiple registration calls (and
already has re-assigned where the events are being set). We pick
the fallback case and set per_cpu(xen_vcpu, cpu) to point to the
shared_info->vcpu_info[vcpu] (which is a good fallback during bootup).
However the hypervisor is still setting events in the register
per-cpu structure (per_cpu(xen_vcpu_info, cpu)).
As such when the events are set by the hypervisor (such as timer one),
and when we iterate in __xen_evtchn_do_upcall we end up reading stale
events from the shared_info->vcpu_info[vcpu] instead of the
per_cpu(xen_vcpu_info, cpu) structures. Hence we never acknowledge the
events that the hypervisor has set and the hypervisor keeps on reminding
us to ack the events which we never do.
The fix is simple. Don't on the second time when xen_vcpu_setup is
called over-write the per_cpu(xen_vcpu, cpu) if it points to
per_cpu(xen_vcpu_info).
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
CC: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Pull x86 paravirt update from Ingo Molnar:
"Various paravirtualization related changes - the biggest one makes
guest support optional via CONFIG_HYPERVISOR_GUEST"
* 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, wakeup, sleep: Use pvops functions for changing GDT entries
x86, xen, gdt: Remove the pvops variant of store_gdt.
x86-32, gdt: Store/load GDT for ACPI S3 or hibernation/resume path is not needed
x86-64, gdt: Store/load GDT for ACPI S3 or hibernate/resume path is not needed.
x86: Make Linux guest support optional
x86, Kconfig: Move PARAVIRT_DEBUG into the paravirt menu
Pull perparatory x86 kasrl changes from Ingo Molnar:
"This contains changes from the ongoing KASLR work, by Kees Cook.
The main changes are the use of a read-only IDT on x86 (which
decouples the userspace visible virtual IDT address from the physical
address), and a rework of ELF relocation support, in preparation of
random, boot-time kernel image relocation."
* 'x86-kaslr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, relocs: Refactor the relocs tool to merge 32- and 64-bit ELF
x86, relocs: Build separate 32/64-bit tools
x86, relocs: Add 64-bit ELF support to relocs tool
x86, relocs: Consolidate processing logic
x86, relocs: Generalize ELF structure names
x86: Use a read-only IDT alias on all CPUs
Pull SMP/hotplug changes from Ingo Molnar:
"This is a pretty large, multi-arch series unifying and generalizing
the various disjunct pieces of idle routines that architectures have
historically copied from each other and have grown in random, wildly
inconsistent and sometimes buggy directions:
101 files changed, 455 insertions(+), 1328 deletions(-)
this went through a number of review and test iterations before it was
committed, it was tested on various architectures, was exposed to
linux-next for quite some time - nevertheless it might cause problems
on architectures that don't read the mailing lists and don't regularly
test linux-next.
This cat herding excercise was motivated by the -rt kernel, and was
brought to you by Thomas "the Whip" Gleixner."
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
idle: Remove GENERIC_IDLE_LOOP config switch
um: Use generic idle loop
ia64: Make sure interrupts enabled when we "safe_halt()"
sparc: Use generic idle loop
idle: Remove unused ARCH_HAS_DEFAULT_IDLE
bfin: Fix typo in arch_cpu_idle()
xtensa: Use generic idle loop
x86: Use generic idle loop
unicore: Use generic idle loop
tile: Use generic idle loop
tile: Enter idle with preemption disabled
sh: Use generic idle loop
score: Use generic idle loop
s390: Use generic idle loop
powerpc: Use generic idle loop
parisc: Use generic idle loop
openrisc: Use generic idle loop
mn10300: Use generic idle loop
mips: Use generic idle loop
microblaze: Use generic idle loop
...
- Populate the boot_params with EDD data.
- Cleanups in the IRQ code.
Bug-fixes:
- CPU hotplug offline/online in PVHVM mode.
- Re-upload processor PM data after ACPI S3 suspend/resume cycle.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQEcBAABAgAGBQJRcWFDAAoJEFjIrFwIi8fJQScIAJmnsPzoqd/USahqgbOvc0Nb
fJgzvno0Lfuvi5NipMpMDNCpCxRuikoJe6ocoF6E0+poucckRGDFSC3R2KVgLR2O
pXkO7bzggkUYkLehW7gXmM4IlvhLxnsfEofDWxG2VJy6RfWxFk+84v/uimQSIB0D
jI9FB3oUhfA+IT8/3Iofyv2OiPH39zNLlzidAnVXmJ8SoJFDLt3l1IymYqVdq6Lt
AKL0IXa31f+yfj9Vli1mkBsxuEIvIo5tVQNn250B4cMlR8mcWsQBnxIeA2M+8Jhl
d2ereTBwhjohCrFR/TWlXYrMlL+XVucI7J7agf3tF4kmDUqjg+c9hUkszSOPjEc=
=yJWr
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.10-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull Xen updates from Konrad Rzeszutek Wilk:
"Features:
- Populate the boot_params with EDD data.
- Cleanups in the IRQ code.
Bug-fixes:
- CPU hotplug offline/online in PVHVM mode.
- Re-upload processor PM data after ACPI S3 suspend/resume cycle."
And Konrad gets a gold star for sending the pull request early when he
thought he'd be away for the first week of the merge window (but because
of 3.9 dragging out to -rc8 he then re-sent the reminder on the first
day of the merge window anyway)
* tag 'stable/for-linus-3.10-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen: resolve section mismatch warnings in xen-acpi-processor
xen: Re-upload processor PM data to hypervisor after S3 resume (v2)
xen/smp: Unifiy some of the PVs and PVHVM offline CPU path
xen/smp/pvhvm: Don't initialize IRQ_WORKER as we are using the native one.
xen/spinlock: Disable IRQ spinlock (PV) allocation on PVHVM
xen/spinlock: Check against default value of -1 for IRQ line.
xen/time: Add default value of -1 for IRQ and check for that.
xen/events: Check that IRQ value passed in is valid.
xen/time: Fix kasprintf splat when allocating timer%d IRQ line.
xen/smp/spinlock: Fix leakage of the spinlock interrupt line for every CPU online/offline
xen/smp: Fix leakage of timer interrupt line for every CPU online/offline.
xen kconfig: fix select INPUT_XEN_KBDDEV_FRONTEND
xen: drop tracking of IRQ vector
x86/xen: populate boot_params with EDD data
There is no need to use the PV version of the IRQ_WORKER mechanism
as under PVHVM we are using the native version. The native
version is using the SMP API.
They just sit around unused:
69: 0 0 xen-percpu-ipi irqwork0
83: 0 0 xen-percpu-ipi irqwork1
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
See git commit f10cd522c5
(xen: disable PV spinlocks on HVM) for details.
But we did not disable it everywhere - which means that when
we boot as PVHVM we end up allocating per-CPU irq line for
spinlock. This fixes that.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The default (uninitialized) value of the IRQ line is -1.
Check if we already have allocated an spinlock interrupt line
and if somebody is trying to do it again. Also set it to -1
when we offline the CPU.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
If the timer interrupt has been de-init or is just now being
initialized, the default value of -1 should be preset as
interrupt line. Check for that and if something is odd
WARN us.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
When we online the CPU, we get this splat:
smpboot: Booting Node 0 Processor 1 APIC 0x2
installing Xen timer for CPU 1
BUG: sleeping function called from invalid context at /home/konrad/ssd/konrad/linux/mm/slab.c:3179
in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/1
Pid: 0, comm: swapper/1 Not tainted 3.9.0-rc6upstream-00001-g3884fad #1
Call Trace:
[<ffffffff810c1fea>] __might_sleep+0xda/0x100
[<ffffffff81194617>] __kmalloc_track_caller+0x1e7/0x2c0
[<ffffffff81303758>] ? kasprintf+0x38/0x40
[<ffffffff813036eb>] kvasprintf+0x5b/0x90
[<ffffffff81303758>] kasprintf+0x38/0x40
[<ffffffff81044510>] xen_setup_timer+0x30/0xb0
[<ffffffff810445af>] xen_hvm_setup_cpu_clockevents+0x1f/0x30
[<ffffffff81666d0a>] start_secondary+0x19c/0x1a8
The solution to that is use kasprintf in the CPU hotplug path
that 'online's the CPU. That is, do it in in xen_hvm_cpu_notify,
and remove the call to in xen_hvm_setup_cpu_clockevents.
Unfortunatly the later is not a good idea as the bootup path
does not use xen_hvm_cpu_notify so we would end up never allocating
timer%d interrupt lines when booting. As such add the check for
atomic() to continue.
CC: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
While we don't use the spinlock interrupt line (see for details
commit f10cd522c5 -
xen: disable PV spinlocks on HVM) - we should still do the proper
init / deinit sequence. We did not do that correctly and for the
CPU init for PVHVM guest we would allocate an interrupt line - but
failed to deallocate the old interrupt line.
This resulted in leakage of an irq_desc but more importantly this splat
as we online an offlined CPU:
genirq: Flags mismatch irq 71. 0002cc20 (spinlock1) vs. 0002cc20 (spinlock1)
Pid: 2542, comm: init.late Not tainted 3.9.0-rc6upstream #1
Call Trace:
[<ffffffff811156de>] __setup_irq+0x23e/0x4a0
[<ffffffff81194191>] ? kmem_cache_alloc_trace+0x221/0x250
[<ffffffff811161bb>] request_threaded_irq+0xfb/0x160
[<ffffffff8104c6f0>] ? xen_spin_trylock+0x20/0x20
[<ffffffff813a8423>] bind_ipi_to_irqhandler+0xa3/0x160
[<ffffffff81303758>] ? kasprintf+0x38/0x40
[<ffffffff8104c6f0>] ? xen_spin_trylock+0x20/0x20
[<ffffffff810cad35>] ? update_max_interval+0x15/0x40
[<ffffffff816605db>] xen_init_lock_cpu+0x3c/0x78
[<ffffffff81660029>] xen_hvm_cpu_notify+0x29/0x33
[<ffffffff81676bdd>] notifier_call_chain+0x4d/0x70
[<ffffffff810bb2a9>] __raw_notifier_call_chain+0x9/0x10
[<ffffffff8109402b>] __cpu_notify+0x1b/0x30
[<ffffffff8166834a>] _cpu_up+0xa0/0x14b
[<ffffffff816684ce>] cpu_up+0xd9/0xec
[<ffffffff8165f754>] store_online+0x94/0xd0
[<ffffffff8141d15b>] dev_attr_store+0x1b/0x20
[<ffffffff81218f44>] sysfs_write_file+0xf4/0x170
[<ffffffff811a2864>] vfs_write+0xb4/0x130
[<ffffffff811a302a>] sys_write+0x5a/0xa0
[<ffffffff8167ada9>] system_call_fastpath+0x16/0x1b
cpu 1 spinlock event irq -16
smpboot: Booting Node 0 Processor 1 APIC 0x2
And if one looks at the /proc/interrupts right after
offlining (CPU1):
70: 0 0 xen-percpu-ipi spinlock0
71: 0 0 xen-percpu-ipi spinlock1
77: 0 0 xen-percpu-ipi spinlock2
There is the oddity of the 'spinlock1' still being present.
CC: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
In the PVHVM path when we do CPU online/offline path we would
leak the timer%d IRQ line everytime we do a offline event. The
online path (xen_hvm_setup_cpu_clockevents via
x86_cpuinit.setup_percpu_clockev) would allocate a new interrupt
line for the timer%d.
But we would still use the old interrupt line leading to:
kernel BUG at /home/konrad/ssd/konrad/linux/kernel/hrtimer.c:1261!
invalid opcode: 0000 [#1] SMP
RIP: 0010:[<ffffffff810b9e21>] [<ffffffff810b9e21>] hrtimer_interrupt+0x261/0x270
.. snip..
<IRQ>
[<ffffffff810445ef>] xen_timer_interrupt+0x2f/0x1b0
[<ffffffff81104825>] ? stop_machine_cpu_stop+0xb5/0xf0
[<ffffffff8111434c>] handle_irq_event_percpu+0x7c/0x240
[<ffffffff811175b9>] handle_percpu_irq+0x49/0x70
[<ffffffff813a74a3>] __xen_evtchn_do_upcall+0x1c3/0x2f0
[<ffffffff813a760a>] xen_evtchn_do_upcall+0x2a/0x40
[<ffffffff8167c26d>] xen_hvm_callback_vector+0x6d/0x80
<EOI>
[<ffffffff81666d01>] ? start_secondary+0x193/0x1a8
[<ffffffff81666cfd>] ? start_secondary+0x18f/0x1a8
There is also the oddity (timer1) in the /proc/interrupts after
offlining CPU1:
64: 1121 0 xen-percpu-virq timer0
78: 0 0 xen-percpu-virq timer1
84: 0 2483 xen-percpu-virq timer2
This patch fixes it.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: stable@vger.kernel.org
During early setup of a dom0 kernel, populate boot_params with the
Enhanced Disk Drive (EDD) and MBR signature data. This makes
information on the BIOS boot device available in /sys/firmware/edd/.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Pull x86 fixes from Ingo Molnar:
"Misc fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Flush lazy MMU when DEBUG_PAGEALLOC is set
x86/mm/cpa/selftest: Fix false positive in CPA self test
x86/mm/cpa: Convert noop to functional fix
x86, mm: Patch out arch_flush_lazy_mmu_mode() when running on bare metal
x86, mm, paravirt: Fix vmalloc_fault oops during lazy MMU updates
The two use-cases where we needed to store the GDT were during ACPI S3 suspend
and resume. As the patches:
x86/gdt/i386: store/load GDT for ACPI S3 or hibernation/resume path is not needed
x86/gdt/64-bit: store/load GDT for ACPI S3 or hibernate/resume path is not needed.
have demonstrated - there are other mechanism by which the GDT is
saved and reloaded during early resume path.
Hence we do not need to worry about the pvops call-chain for saving the
GDT and can and can eliminate it. The other areas where the store_gdt is
used are never going to be hit when running under the pvops platforms.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1365194544-14648-4-git-send-email-konrad.wilk@oracle.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Make a copy of the IDT (as seen via the "sidt" instruction) read-only.
This primarily removes the IDT from being a target for arbitrary memory
write attacks, and has the added benefit of also not leaking the kernel
base offset, if it has been relocated.
We already did this on vendor == Intel and family == 5 because of the
F0 0F bug -- regardless of if a particular CPU had the F0 0F bug or
not. Since the workaround was so cheap, there simply was no reason to
be very specific. This patch extends the readonly alias to all CPUs,
but does not activate the #PF to #UD conversion code needed to deliver
the proper exception in the F0 0F case except on Intel family 5
processors.
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: http://lkml.kernel.org/r/20130410192422.GA17344@www.outflux.net
Cc: Eric Northup <digitaleric@google.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Invoking arch_flush_lazy_mmu_mode() results in calls to
preempt_enable()/disable() which may have performance impact.
Since lazy MMU is not used on bare metal we can patch away
arch_flush_lazy_mmu_mode() so that it is never called in such
environment.
[ hpa: the previous patch "Fix vmalloc_fault oops during lazy MMU
updates" may cause a minor performance regression on
bare metal. This patch resolves that performance regression. It is
somewhat unclear to me if this is a good -stable candidate. ]
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: http://lkml.kernel.org/r/1364045796-10720-2-git-send-email-konrad.wilk@oracle.com
Tested-by: Josh Boyer <jwboyer@redhat.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org> SEE NOTE ABOVE
Occassionaly on a DL380 G4 the guest would crash quite early with this:
(XEN) d244:v0: unhandled page fault (ec=0003)
(XEN) Pagetable walk from ffffffff84dc7000:
(XEN) L4[0x1ff] = 00000000c3f18067 0000000000001789
(XEN) L3[0x1fe] = 00000000c3f14067 000000000000178d
(XEN) L2[0x026] = 00000000dc8b2067 0000000000004def
(XEN) L1[0x1c7] = 00100000dc8da067 0000000000004dc7
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 244 (vcpu#0) crashed on cpu#3:
(XEN) ----[ Xen-4.1.3OVM x86_64 debug=n Not tainted ]----
(XEN) CPU: 3
(XEN) RIP: e033:[<ffffffff81263f22>]
(XEN) RFLAGS: 0000000000000216 EM: 1 CONTEXT: pv guest
(XEN) rax: 0000000000000000 rbx: ffffffff81785f88 rcx: 000000000000003f
(XEN) rdx: 0000000000000000 rsi: 00000000dc8da063 rdi: ffffffff84dc7000
The offending code shows it to be a loop writting the value zero
(%rax) in the %rdi (the L4 provided by Xen) register:
0: 44 00 00 add %r8b,(%rax)
3: 31 c0 xor %eax,%eax
5: b9 40 00 00 00 mov $0x40,%ecx
a: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
11: 00 00
13: ff c9 dec %ecx
15:* 48 89 07 mov %rax,(%rdi) <-- trapping instruction
18: 48 89 47 08 mov %rax,0x8(%rdi)
1c: 48 89 47 10 mov %rax,0x10(%rdi)
which fails. xen_setup_kernel_pagetable recycles some of the Xen's
page-table entries when it has switched over to its Linux page-tables.
Right before try to clear the page, we make a hypercall to change
it from _RO to _RW and that works (otherwise we would hit an BUG()).
And the _RW flag is set for that page:
(XEN) L1[0x1c7] = 001000004885f067 0000000000004dc7
The error code is 3, so PFEC_page_present and PFEC_write_access, so page is
present (correct), and we tried to write to the page, but a violation
occurred. The one theory is that the the page entries in hardware
(which are cached) are not up to date with what we just set. Especially
as we have just done an CR3 write and flushed the multicalls.
This patch does solve the problem by flusing out the TLB page
entry after changing it from _RO to _RW and we don't hit this
issue anymore.
Fixed-Oracle-Bug: 16243091 [ON OCCASIONS VM START GOES INTO
'CRASH' STATE: CLEAR_PAGE+0X12 ON HP DL380 G4]
Reported-and-Tested-by: Saar Maoz <Saar.Maoz@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We move the setting of write_cr3 from the early bootup variant
(see git commit 0cc9129d75
"x86-64, xen, mmu: Provide an early version of write_cr3.")
to a more appropiate location.
This new location sets all of the other non-early variants
of pvops calls - and most importantly is before the
alternative_asm mechanism kicks in.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Put all config options needed to run Linux as a guest behind a
CONFIG_HYPERVISOR_GUEST menu so that they don't get built-in by default
but be selectable by the user. Also, make all units which depend on
x86_hyper, depend on this new symbol so that compilation doesn't fail
when CONFIG_HYPERVISOR_GUEST is disabled but those units assume its
presence.
Sort options in the new HYPERVISOR_GUEST menu, adapt config text and
drop redundant select.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1362428421-9244-3-git-send-email-bp@alien8.de
Cc: Dmitry Torokhov <dtor@vmware.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The git commit 8eaffa67b4
(xen/pat: Disable PAT support for now) explains in details why
we want to disable PAT for right now. However that
change was not enough and we should have also disabled
the pat_enabled value. Otherwise we end up with:
mmap-example:3481 map pfn expected mapping type write-back for
[mem 0x00010000-0x00010fff], got uncached-minus
------------[ cut here ]------------
WARNING: at /build/buildd/linux-3.8.0/arch/x86/mm/pat.c:774 untrack_pfn+0xb8/0xd0()
mem 0x00010000-0x00010fff], got uncached-minus
------------[ cut here ]------------
WARNING: at /build/buildd/linux-3.8.0/arch/x86/mm/pat.c:774
untrack_pfn+0xb8/0xd0()
...
Pid: 3481, comm: mmap-example Tainted: GF 3.8.0-6-generic #13-Ubuntu
Call Trace:
[<ffffffff8105879f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff810587fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8104bcc8>] untrack_pfn+0xb8/0xd0
[<ffffffff81156c1c>] unmap_single_vma+0xac/0x100
[<ffffffff81157459>] unmap_vmas+0x49/0x90
[<ffffffff8115f808>] exit_mmap+0x98/0x170
[<ffffffff810559a4>] mmput+0x64/0x100
[<ffffffff810560f5>] dup_mm+0x445/0x660
[<ffffffff81056d9f>] copy_process.part.22+0xa5f/0x1510
[<ffffffff81057931>] do_fork+0x91/0x350
[<ffffffff81057c76>] sys_clone+0x16/0x20
[<ffffffff816ccbf9>] stub_clone+0x69/0x90
[<ffffffff816cc89d>] ? system_call_fastpath+0x1a/0x1f
---[ end trace 4918cdd0a4c9fea4 ]---
(a similar message shows up if you end up launching 'mcelog')
The call chain is (as analyzed by Liu, Jinsong):
do_fork
--> copy_process
--> dup_mm
--> dup_mmap
--> copy_page_range
--> track_pfn_copy
--> reserve_pfn_range
--> line 624: flags != want_flags
It comes from different memory types of page table (_PAGE_CACHE_WB) and MTRR
(_PAGE_CACHE_UC_MINUS).
Stefan Bader dug in this deep and found out that:
"That makes it clearer as this will do
reserve_memtype(...)
--> pat_x_mtrr_type
--> mtrr_type_lookup
--> __mtrr_type_lookup
And that can return -1/0xff in case of MTRR not being enabled/initialized. Which
is not the case (given there are no messages for it in dmesg). This is not equal
to MTRR_TYPE_WRBACK and thus becomes _PAGE_CACHE_UC_MINUS.
It looks like the problem starts early in reserve_memtype:
if (!pat_enabled) {
/* This is identical to page table setting without PAT */
if (new_type) {
if (req_type == _PAGE_CACHE_WC)
*new_type = _PAGE_CACHE_UC_MINUS;
else
*new_type = req_type & _PAGE_CACHE_MASK;
}
return 0;
}
This would be what we want, that is clearing the PWT and PCD flags from the
supported flags - if pat_enabled is disabled."
This patch does that - disabling PAT.
CC: stable@vger.kernel.org # 3.3 and further
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reported-and-Tested-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
- Xen ACPI memory and CPU hotplug drivers - allowing Xen hypervisor
to be aware of new CPU and new DIMMs
- Cleanups
Bug-fixes:
- Fixes a long-standing bug in the PV spinlock wherein we did not
kick VCPUs that were in a tight loop.
- Fixes in the error paths for the event channel machinery.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQEcBAABAgAGBQJRJS1kAAoJEFjIrFwIi8fJj2YIAMO3/LVUZyojX/d8U9pqrCly
lFfEF2UVjcxHJSj0ZFNXt1o3fnYP1SLRlT9u7ZLDjXf6Lmxmw6/C3Haw2wp3DfGq
yUR0G/X9CPTBEgMYDdX7bjeTjyURvZcUaFwr+qodaaeL3uXx2pW6621Sc6jRKuia
yAFVZMAKeaRrvUUIXjKHtlpRp9LKFdSztShMtYqmFvxEwrJPq2b37caKruoUCa6o
X/YO0fvE9QtYD/pG0jsghFmLh/mcr+n9IFMCUXo1Yc9FdQBExtKzABDS5jdpuFND
4aMDE3dqUmHmpbaQhRE7SdblvpyrGdQXL6FSTjvwBgISfLo847CrnRKRgPp0YeA=
=LQeU
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.9-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull Xen update from Konrad Rzeszutek Wilk:
"This has two new ACPI drivers for Xen - a physical CPU offline/online
and a memory hotplug. The way this works is that ACPI kicks the
drivers and they make the appropiate hypercall to the hypervisor to
tell it that there is a new CPU or memory. There also some changes to
the Xen ARM ABIs and couple of fixes. One particularly nasty bug in
the Xen PV spinlock code was fixed by Stefan Bader - and has been
there since the 2.6.32!
Features:
- Xen ACPI memory and CPU hotplug drivers - allowing Xen hypervisor
to be aware of new CPU and new DIMMs
- Cleanups
Bug-fixes:
- Fixes a long-standing bug in the PV spinlock wherein we did not
kick VCPUs that were in a tight loop.
- Fixes in the error paths for the event channel machinery"
Fix up a few semantic conflicts with the ACPI interface changes in
drivers/xen/xen-acpi-{cpu,mem}hotplug.c.
* tag 'stable/for-linus-3.9-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen: event channel arrays are xen_ulong_t and not unsigned long
xen: Send spinlock IPI to all waiters
xen: introduce xen_remap, use it instead of ioremap
xen: close evtchn port if binding to irq fails
xen-evtchn: correct comment and error output
xen/tmem: Add missing %s in the printk statement.
xen/acpi: move xen_acpi_get_pxm under CONFIG_XEN_DOM0
xen/acpi: ACPI cpu hotplug
xen/acpi: Move xen_acpi_get_pxm to Xen's acpi.h
xen/stub: driver for CPU hotplug
xen/acpi: ACPI memory hotplug
xen/stub: driver for memory hotplug
xen: implement updated XENMEM_add_to_physmap_range ABI
xen/smp: Move the common CPU init code a bit to prep for PVH patch.
With commit 8170e6bed4 ("x86, 64bit: Use a #PF handler to materialize
early mappings on demand") we started hitting an early bootup crash
where the Xen hypervisor would inform us that:
(XEN) d7:v0: unhandled page fault (ec=0000)
(XEN) Pagetable walk from ffffea000005b2d0:
(XEN) L4[0x1d4] = 0000000000000000 ffffffffffffffff
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 7 (vcpu#0) crashed on cpu#3:
(XEN) ----[ Xen-4.2.0 x86_64 debug=n Not tainted ]----
.. that Xen was unable to context switch back to dom0.
Looking at the calling stack we find:
[<ffffffff8103feba>] xen_get_user_pgd+0x5a <--
[<ffffffff8103feba>] xen_get_user_pgd+0x5a
[<ffffffff81042d27>] xen_write_cr3+0x77
[<ffffffff81ad2d21>] init_mem_mapping+0x1f9
[<ffffffff81ac293f>] setup_arch+0x742
[<ffffffff81666d71>] printk+0x48
We are trying to figure out whether we need to up-date the user PGD as
well. Please keep in mind that under 64-bit PV guests we have a limited
amount of rings: 0 for the Hypervisor, and 1 for both the Linux kernel
and user-space. As such the Linux pvops'fied version of write_cr3
checks if it has to update the user-space cr3 as well.
That clearly is not needed during early bootup. The recent changes (see
above git commit) streamline the x86 page table allocation to be much
simpler (And also incidentally the #PF handler ends up in spirit being
similar to how the Xen toolstack sets up the initial page-tables).
The fix is to have an early-bootup version of cr3 that just loads the
kernel %cr3. The later version - which also handles user-page
modifications will be used after the initial page tables have been
setup.
[ hpa: removed a redundant #ifdef and made the new function __init.
Also note that x86-32 already has such an early xen_write_cr3. ]
Tested-by: "H. Peter Anvin" <hpa@zytor.com>
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1361579812-23709-1-git-send-email-konrad.wilk@oracle.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 mm changes from Peter Anvin:
"This is a huge set of several partly interrelated (and concurrently
developed) changes, which is why the branch history is messier than
one would like.
The *really* big items are two humonguous patchsets mostly developed
by Yinghai Lu at my request, which completely revamps the way we
create initial page tables. In particular, rather than estimating how
much memory we will need for page tables and then build them into that
memory -- a calculation that has shown to be incredibly fragile -- we
now build them (on 64 bits) with the aid of a "pseudo-linear mode" --
a #PF handler which creates temporary page tables on demand.
This has several advantages:
1. It makes it much easier to support things that need access to data
very early (a followon patchset uses this to load microcode way
early in the kernel startup).
2. It allows the kernel and all the kernel data objects to be invoked
from above the 4 GB limit. This allows kdump to work on very large
systems.
3. It greatly reduces the difference between Xen and native (Xen's
equivalent of the #PF handler are the temporary page tables created
by the domain builder), eliminating a bunch of fragile hooks.
The patch series also gets us a bit closer to W^X.
Additional work in this pull is the 64-bit get_user() work which you
were also involved with, and a bunch of cleanups/speedups to
__phys_addr()/__pa()."
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (105 commits)
x86, mm: Move reserving low memory later in initialization
x86, doc: Clarify the use of asm("%edx") in uaccess.h
x86, mm: Redesign get_user with a __builtin_choose_expr hack
x86: Be consistent with data size in getuser.S
x86, mm: Use a bitfield to mask nuisance get_user() warnings
x86/kvm: Fix compile warning in kvm_register_steal_time()
x86-32: Add support for 64bit get_user()
x86-32, mm: Remove reference to alloc_remap()
x86-32, mm: Remove reference to resume_map_numa_kva()
x86-32, mm: Rip out x86_32 NUMA remapping code
x86/numa: Use __pa_nodebug() instead
x86: Don't panic if can not alloc buffer for swiotlb
mm: Add alloc_bootmem_low_pages_nopanic()
x86, 64bit, mm: hibernate use generic mapping_init
x86, 64bit, mm: Mark data/bss/brk to nx
x86: Merge early kernel reserve for 32bit and 64bit
x86: Add Crash kernel low reservation
x86, kdump: Remove crashkernel range find limit for 64bit
memblock: Add memblock_mem_size()
x86, boot: Not need to check setup_header version for setup_data
...
- Rework of the ACPI namespace scanning code from Rafael J. Wysocki
with contributions from Bjorn Helgaas, Jiang Liu, Mika Westerberg,
Toshi Kani, and Yinghai Lu.
- ACPI power resources handling and ACPI device PM update from
Rafael J. Wysocki.
- ACPICA update to version 20130117 from Bob Moore and Lv Zheng
with contributions from Aaron Lu, Chao Guan, Jesper Juhl, and
Tim Gardner.
- Support for Intel Lynxpoint LPSS from Mika Westerberg.
- cpuidle update from Len Brown including Intel Haswell support, C1
state for intel_idle, removal of global pm_idle.
- cpuidle fixes and cleanups from Daniel Lezcano.
- cpufreq fixes and cleanups from Viresh Kumar and Fabio Baltieri
with contributions from Stratos Karafotis and Rickard Andersson.
- Intel P-states driver for Sandy Bridge processors from
Dirk Brandewie.
- cpufreq driver for Marvell Kirkwood SoCs from Andrew Lunn.
- cpufreq fixes related to ordering issues between acpi-cpufreq and
powernow-k8 from Borislav Petkov and Matthew Garrett.
- cpufreq support for Calxeda Highbank processors from Mark Langsdorf
and Rob Herring.
- cpufreq driver for the Freescale i.MX6Q SoC and cpufreq-cpu0 update
from Shawn Guo.
- cpufreq Exynos fixes and cleanups from Jonghwan Choi, Sachin Kamat,
and Inderpal Singh.
- Support for "lightweight suspend" from Zhang Rui.
- Removal of the deprecated power trace API from Paul Gortmaker.
- Assorted updates from Andreas Fleig, Colin Ian King,
Davidlohr Bueso, Joseph Salisbury, Kees Cook, Li Fei,
Nishanth Menon, ShuoX Liu, Srinivas Pandruvada, Tejun Heo,
Thomas Renninger, and Yasuaki Ishimatsu.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJRIsArAAoJEKhOf7ml8uNsD6MP/j7C4NA+GTq6RdwoJt+Yki0K
9Ep8I4pEuRFoN/oskv24EyQhpGJIk6UxWcJ/DWFBc+1VhmKORta7k2Idv/wlJA77
s7AcDveA9xcDh+TVfbh87TeuiMSXiSdDZbiaQO+wMizWJAF3F84AnjiAqqqyQcSK
bA5/Siz/vWlt9PyYDaQtHTVE4lpvPuVcQdYewsdaH2PsmUjvIg/TUzg28CTrdyvv
eHOdBK9R0/OLQLhzRbL0VOGJ//wEl+HJRO0QEhTKPgdQ1e/VH/4Zu5WSzF8P/x4C
s2f8U4IKQqulDuDHXtpMpelFm7hRWgsOqZLkcyXLs+0dvSM9CTPO6P0ZaImxUctk
5daHWEsXUnCErDQawt1mcZP8l6qnxofMQIfLXyPVzvlSnHyToTmrtXa1v2u4AuL/
hOo4MYWsFNUmRdtGFFGlExGgEDZ4G5NwiYjRBl/6XJ3v4nhnnMbuzxP8scpoe5m1
8tjroJHZFUUs/mFU/H+oRbHzSzXPmp1sddNaTg4OpVmTn3DDh6ljnFhiItd1Ndw0
5ldVbSe6ETq5RoK0TbzvQOeVpa9F3JfqbrXLQPqfd2iz/No41LQYG1uShRYuXKuA
wfEcc+c9VMd3FILu05pGwBnU8VS9VbxTYMz7xDxg6b29Ywnb7u+Q1ycCk2gFYtkS
E2oZDuyewTJxaskzYsNr
=wijn
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael Wysocki:
- Rework of the ACPI namespace scanning code from Rafael J. Wysocki
with contributions from Bjorn Helgaas, Jiang Liu, Mika Westerberg,
Toshi Kani, and Yinghai Lu.
- ACPI power resources handling and ACPI device PM update from Rafael
J Wysocki.
- ACPICA update to version 20130117 from Bob Moore and Lv Zheng with
contributions from Aaron Lu, Chao Guan, Jesper Juhl, and Tim Gardner.
- Support for Intel Lynxpoint LPSS from Mika Westerberg.
- cpuidle update from Len Brown including Intel Haswell support, C1
state for intel_idle, removal of global pm_idle.
- cpuidle fixes and cleanups from Daniel Lezcano.
- cpufreq fixes and cleanups from Viresh Kumar and Fabio Baltieri with
contributions from Stratos Karafotis and Rickard Andersson.
- Intel P-states driver for Sandy Bridge processors from Dirk
Brandewie.
- cpufreq driver for Marvell Kirkwood SoCs from Andrew Lunn.
- cpufreq fixes related to ordering issues between acpi-cpufreq and
powernow-k8 from Borislav Petkov and Matthew Garrett.
- cpufreq support for Calxeda Highbank processors from Mark Langsdorf
and Rob Herring.
- cpufreq driver for the Freescale i.MX6Q SoC and cpufreq-cpu0 update
from Shawn Guo.
- cpufreq Exynos fixes and cleanups from Jonghwan Choi, Sachin Kamat,
and Inderpal Singh.
- Support for "lightweight suspend" from Zhang Rui.
- Removal of the deprecated power trace API from Paul Gortmaker.
- Assorted updates from Andreas Fleig, Colin Ian King, Davidlohr Bueso,
Joseph Salisbury, Kees Cook, Li Fei, Nishanth Menon, ShuoX Liu,
Srinivas Pandruvada, Tejun Heo, Thomas Renninger, and Yasuaki
Ishimatsu.
* tag 'pm+acpi-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (267 commits)
PM idle: remove global declaration of pm_idle
unicore32 idle: delete stray pm_idle comment
openrisc idle: delete pm_idle
mn10300 idle: delete pm_idle
microblaze idle: delete pm_idle
m32r idle: delete pm_idle, and other dead idle code
ia64 idle: delete pm_idle
cris idle: delete idle and pm_idle
ARM64 idle: delete pm_idle
ARM idle: delete pm_idle
blackfin idle: delete pm_idle
sparc idle: rename pm_idle to sparc_idle
sh idle: rename global pm_idle to static sh_idle
x86 idle: rename global pm_idle to static x86_idle
APM idle: register apm_cpu_idle via cpuidle
cpufreq / intel_pstate: Add kernel command line option disable intel_pstate.
cpufreq / intel_pstate: Change to disallow module build
tools/power turbostat: display SMI count by default
intel_idle: export both C1 and C1E
ACPI / hotplug: Fix concurrency issues and memory leaks
...
Pull x86/apic changes from Ingo Molnar:
"Main changes:
- Multiple MSI support added to the APIC, PCI and AHCI code - acked
by all relevant maintainers, by Alexander Gordeev.
The advantage is that multiple AHCI ports can have multiple MSI
irqs assigned, and can thus spread to multiple CPUs.
[ Drivers can make use of this new facility via the
pci_enable_msi_block_auto() method ]
- x86 IOAPIC code from interrupt remapping cleanups from Joerg
Roedel:
These patches move all interrupt remapping specific checks out of
the x86 core code and replaces the respective call-sites with
function pointers. As a result the interrupt remapping code is
better abstraced from x86 core interrupt handling code.
- Various smaller improvements, fixes and cleanups."
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
x86/intel/irq_remapping: Clean up x2apic opt-out security warning mess
x86, kvm: Fix intialization warnings in kvm.c
x86, irq: Move irq_remapped out of x86 core code
x86, io_apic: Introduce eoi_ioapic_pin call-back
x86, msi: Introduce x86_msi.compose_msi_msg call-back
x86, irq: Introduce setup_remapped_irq()
x86, irq: Move irq_remapped() check into free_remapped_irq
x86, io-apic: Remove !irq_remapped() check from __target_IO_APIC_irq()
x86, io-apic: Move CONFIG_IRQ_REMAP code out of x86 core
x86, irq: Add data structure to keep AMD specific irq remapping information
x86, irq: Move irq_remapping_enabled declaration to iommu code
x86, io_apic: Remove irq_remapping_enabled check in setup_timer_IRQ0_pin
x86, io_apic: Move irq_remapping_enabled checks out of check_timer()
x86, io_apic: Convert setup_ioapic_entry to function pointer
x86, io_apic: Introduce set_affinity function pointer
x86, msi: Use IRQ remapping specific setup_msi_irqs routine
x86, hpet: Introduce x86_msi_ops.setup_hpet_msi
x86, io_apic: Introduce x86_io_apic_ops.print_entries for debugging
x86, io_apic: Introduce x86_io_apic_ops.disable()
x86, apic: Mask IO-APIC and PIC unconditionally on LAPIC resume
...
There is a loophole between Xen's current implementation of
pv-spinlocks and the scheduler. This was triggerable through
a testcase until v3.6 changed the TLB flushing code. The
problem potentially is still there just not observable in the
same way.
What could happen was (is):
1. CPU n tries to schedule task x away and goes into a slow
wait for the runq lock of CPU n-# (must be one with a lower
number).
2. CPU n-#, while processing softirqs, tries to balance domains
and goes into a slow wait for its own runq lock (for updating
some records). Since this is a spin_lock_irqsave in softirq
context, interrupts will be re-enabled for the duration of
the poll_irq hypercall used by Xen.
3. Before the runq lock of CPU n-# is unlocked, CPU n-1 receives
an interrupt (e.g. endio) and when processing the interrupt,
tries to wake up task x. But that is in schedule and still
on_cpu, so try_to_wake_up goes into a tight loop.
4. The runq lock of CPU n-# gets unlocked, but the message only
gets sent to the first waiter, which is CPU n-# and that is
busily stuck.
5. CPU n-# never returns from the nested interruption to take and
release the lock because the scheduler uses a busy wait.
And CPU n never finishes the task migration because the unlock
notification only went to CPU n-#.
To avoid this and since the unlocking code has no real sense of
which waiter is best suited to grab the lock, just send the IPI
to all of them. This causes the waiters to return from the hyper-
call (those not interrupted at least) and do active spinlocking.
BugLink: http://bugs.launchpad.net/bugs/1011792
Acked-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Cc: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The PV and PVH code CPU init code share some functionality. The
PVH code ("xen/pvh: Extend vcpu_guest_context, p2m, event, and XenBus")
sets some of these up, but not all. To make it easier to read, this
patch removes the PV specific out of the generic way.
No functional change - just code movement.
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
[v2: Fixed compile errors noticed by Fengguang Wu build system]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
with older hypervisor stacks, such as Xen 4.1.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQEcBAABAgAGBQJRHZ7eAAoJEFjIrFwIi8fJZ+sH/ieMkzdBB6aqbFMcNr7mkfBo
i3swjO2JQI7REYIHfKEVoR3IgHfqKEuABdeEQrceE0XqDepFh84YiKGI2QpPRWEA
903vUV4DXVdcBrypbL45tSFZ1Jxsrzx+F7WfV/f9WHyeiwOyaZTGVQH0VuOzpcum
RvPTT7MmC7g8MJDi66SDYBaX/pBQzifQ81nMWWjXNw0w4CwWX7le1cScZEP42MR6
jTEHzYMLDojdO+2aQM5pt/0CGI5tzBHtX5nNRl6tovlPI3ckknYYx6a7RfxkfZzF
IkMIuGS32yLfsswPPIiMs47/Qgiq3BN6eSTJXMZKUwQokL9yEs8LodcnRDYfgyQ=
=fqcJ
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.8-rc7-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull xen fixes from Konrad Rzeszutek Wilk:
"Two fixes:
- A simple bug-fix for redundant NULL check.
- CVE-2013-0228/XSA-42: x86/xen: don't assume %ds is usable in
xen_iret for 32-bit PVOPS
and two reverts:
- Revert the PVonHVM kexec. The patch introduces a regression with
older hypervisor stacks, such as Xen 4.1."
* tag 'stable/for-linus-3.8-rc7-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
Revert "xen PVonHVM: use E820_Reserved area for shared_info"
Revert "xen/PVonHVM: fix compile warning in init_hvm_pv_info"
xen: remove redundant NULL check before unregister_and_remove_pcpu().
x86/xen: don't assume %ds is usable in xen_iret for 32-bit PVOPS.
This reverts commit 9d02b43dee.
We are doing this b/c on 32-bit PVonHVM with older hypervisors
(Xen 4.1) it ends up bothing up the start_info. This is bad b/c
we use it for the time keeping, and the timekeeping code loops
forever - as the version field never changes. Olaf says to
revert it, so lets do that.
Acked-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>