With this patch, we diligently set regions that will be used by the
balloon driver to be INVALID_P2M_ENTRY and under the ownership
of the balloon driver. We are OK using the __set_phys_to_machine
as we do not expect to be allocating any P2M middle or entries pages.
The set_phys_to_machine has the side-effect of potentially allocating
new pages and we do not want that at this stage.
We can do this because xen_build_mfn_list_list will have already
allocated all such pages up to xen_max_p2m_pfn.
We also move the check for auto translated physmap down the
stack so it is present in __set_phys_to_machine.
[v2: Rebased with mmu->p2m code split]
Reviewed-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
PAGE_SIZE_asm, PAGE_SHIFT_asm, THREAD_SIZE_asm can be safely removed from
asm-offsets.c, and be replaced by their non-'_asm' counterparts in the code
that uses them, since the _AC macro defined in include/linux/const.h makes
PAGE_SIZE/PAGE_SHIFT/THREAD_SIZE work with as.
Signed-off-by: Stratos Psomadakis <psomas@cslab.ece.ntua.gr>
LKML-Reference: <1298666774-17646-2-git-send-email-psomas@cslab.ece.ntua.gr>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
xen_pre_device_suspend is unused on ia64.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Initialize PV spinlocks on boot CPU right after native_smp_prepare_cpus
(that switch to APIC mode and initialize APIC routing); on secondary
CPUs on CPU_UP_PREPARE.
Enable the usage of event channels to send and receive IPIs when
running as a PV on HVM guest.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
e820_table_{start|end|top}, which are used to buffer page table
allocation during early boot, are now derived from memblock and don't
have much to do with e820. Change the names so that they reflect what
they're used for.
This patch doesn't introduce any behavior change.
-v2: Ingo found that earlier patch "x86: Use early pre-allocated page
table buffer top-down" caused crash on 32bit and needed to be
dropped. This patch was updated to reflect the change.
-tj: Updated commit description.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
With the hypervisor argument of dom0_mem=X we iterate over the physical
(only for the initial domain) E820 and subtract the the size from each
E820_RAM region the delta so that the cumulative size of all E820_RAM regions
is equal to 'X'. This sometimes ends up with E820_RAM regions with zero size
(which are removed by e820_sanitize) and E820_RAM that are smaller
than physically.
Later on the PCI API looks at the E820 and attempts to set up an
resource region for the "PCI mem". The E820 (assume dom0_mem=1GB is
set) compared to the physical looks as so:
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] Xen: 0000000000000000 - 0000000000097c00 (usable)
[ 0.000000] Xen: 0000000000097c00 - 0000000000100000 (reserved)
-[ 0.000000] Xen: 0000000000100000 - 00000000defafe00 (usable)
+[ 0.000000] Xen: 0000000000100000 - 0000000040000000 (usable)
[ 0.000000] Xen: 00000000defafe00 - 00000000defb1ea0 (ACPI NVS)
[ 0.000000] Xen: 00000000defb1ea0 - 00000000e0000000 (reserved)
[ 0.000000] Xen: 00000000f4000000 - 00000000f8000000 (reserved)
..
And we get
[ 0.000000] Allocating PCI resources starting at 40000000 (gap: 40000000:9efafe00)
while it should have started at e0000000 (a nice big gap up to
f4000000 exists). The "Allocating PCI" is part of the resource API.
The users that end up using those PCI I/O regions usually supply their
own BARs when calling the resource API (request_resource, or allocate_resource),
but there are exceptions which provide an empty 'struct resource' and
expect the API to provide the 'struct resource' to be populated with valid values.
The one that triggered this bug was the intel AGP driver that requested
a region for the flush page (intel_i9xx_setup_flush).
Before this patch, when running under Xen hypervisor, the 'struct resource'
returned could have (depending on the dom0_mem size) physical ranges of a 'System RAM'
instead of 'I/O' regions. This ended up with the Hypervisor failing a request
to populate PTE's with those PFNs as the domain did not have access to those
'System RAM' regions (rightly so).
After this patch, the left-over E820_RAM region from the truncation, will be
labeled as E820_UNUSABLE. The E820 will look as so:
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] Xen: 0000000000000000 - 0000000000097c00 (usable)
[ 0.000000] Xen: 0000000000097c00 - 0000000000100000 (reserved)
-[ 0.000000] Xen: 0000000000100000 - 00000000defafe00 (usable)
+[ 0.000000] Xen: 0000000000100000 - 0000000040000000 (usable)
+[ 0.000000] Xen: 0000000040000000 - 00000000defafe00 (unusable)
[ 0.000000] Xen: 00000000defafe00 - 00000000defb1ea0 (ACPI NVS)
[ 0.000000] Xen: 00000000defb1ea0 - 00000000e0000000 (reserved)
[ 0.000000] Xen: 00000000f4000000 - 00000000f8000000 (reserved)
For more information:
http://mid.gmane.org/1A42CE6F5F474C41B63392A5F80372B2335E978C@shsmsx501.ccr.corp.intel.com
BugLink: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1726
Signed-off-by: Fengzhe Zhang <fengzhe.zhang@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Both xen_hvm_init_shared_info and xen_build_mfn_list_list can be
called at resume time as well as at start of day but only reference
__init functions (extend_brk) at start of day. Hence annotate with
__ref.
WARNING: arch/x86/built-in.o(.text+0x4f1): Section mismatch in reference
from the function xen_hvm_init_shared_info() to the function
.init.text:extend_brk()
The function xen_hvm_init_shared_info() references
the function __init extend_brk().
This is often because xen_hvm_init_shared_info lacks a __init
annotation or the annotation of extend_brk is wrong.
xen_hvm_init_shared_info calls extend_brk() iff !shared_info_page and
initialises shared_info_page with the result. This happens at start of
day only.
WARNING: arch/x86/built-in.o(.text+0x599b): Section mismatch in reference
from the function xen_build_mfn_list_list() to the function
.init.text:extend_brk()
The function xen_build_mfn_list_list() references
the function __init extend_brk().
This is often because xen_build_mfn_list_list lacks a __init
annotation or the annotation of extend_brk is wrong.
(this warning occurs multiple times)
xen_build_mfn_list_list only calls extend_brk() at boot time, while
building the initial mfn list list
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC arch/x86/xen/p2m.o
arch/x86/xen/p2m.c: In function 'm2p_remove_override':
arch/x86/xen/p2m.c:460: warning: 'address' may be used uninitialized in this function
arch/x86/xen/p2m.c: In function 'm2p_add_override':
arch/x86/xen/p2m.c:426: warning: 'address' may be used uninitialized in this function
In actual fact address is inialised in one "if (!PageHighMem(page))"
statement and used in a second and so is always initialised before
use.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
With this patch, the cpuidle driver does not load and
does not issue the mwait operations. Instead the hypervisor
is doing them (b/c we call the safe_halt pvops call).
This fixes quite a lot of bootup issues wherein the user had
to force interrupts for the continuation of the bootup.
Details are discussed in:
http://lists.xensource.com/archives/html/xen-devel/2011-01/msg00535.html
[v2: Wrote the commit description]
Reported-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Tested-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Under Dell Inspiron 1525, and Intel SandyBridge SDP's the
BIOS e820 RAM is not page-aligned:
[ 0.000000] Xen: 0000000000100000 - 00000000df66d800 (usable)
We were not handling that and ended up setting up a pagetable
that included up to df66e000 with the disastrous effect that when
memset(NODE_DATA(nodeid), 0, sizeof(pg_data_t));
tried to clear the page it would crash at the 2K mark.
Initially reported by Michael Young @
http://lists.xensource.com/archives/html/xen-devel/2011-01/msg00108.html
The fix is to page-align the size and also take into consideration
the start of the E820 (in case that is not page-aligned either). This
fixes the bootup failure on those affected machines.
This patch is a rework of the Micheal A Young initial patch and
considers the case if the start is not page-aligned.
Reported-by: Michael A Young <m.a.young@durham.ac.uk>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Michael A Young <m.a.young@durham.ac.uk>
In case the mfn_list does not have enough entries to fill
a p2m page we do not want the entries from max_pfn up to
the boundary to be filled with unknown values. Hence
set them to INVALID_P2M_ENTRY.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
After changing the p2m mapping to a tree by
commit 58e05027b5
xen: convert p2m to a 3 level tree
and trying to boot a DomU with 615MB of memory, the following crash was
observed in the dump:
kernel direct mapping tables up to 26f00000 @ 1ec4000-1fff000
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<c0107397>] xen_set_pte+0x27/0x60
*pdpt = 0000000000000000 *pde = 0000000000000000
Adding further debug statements showed that when trying to set up
pfn=0x26700 the returned mapping was invalid.
pfn=0x266ff calling set_pte(0xc1fe77f8, 0x6b3003)
pfn=0x26700 calling set_pte(0xc1fe7800, 0x3)
Although the last_pfn obtained from the startup info is 0x26700, which
should in turn not be hit, the additional 8MB which are added as extra
memory normally seem to be ok. This lead to looking into the initial
p2m tree construction, which uses the smaller value and assuming that
there is other code handling the extra memory.
When the p2m tree is set up, the leaves are directly pointed to the
array which the domain builder set up. But if the mapping is not on a
boundary that fits into one p2m page, this will result in the last leaf
being only partially valid. And as the invalid entries are not
initialized in that case, things go badly wrong.
I am trying to fix that by checking whether the current leaf is a
complete map and if not, allocate a completely new page and copy only
the valid pointers there. This may not be the most efficient or elegant
solution, but at least it seems to allow me booting DomUs with memory
assignments all over the range.
BugLink: http://bugs.launchpad.net/bugs/686692
[v2: Redid a bit of commit wording and fixed a compile warning]
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Fix sparse warning for non-ANSI function declaration:
arch/x86/xen/irq.c:129:30: warning: non-ANSI function declaration of function 'xen_init_irq_ops'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
During early boot, local IRQ is disabled until IRQ subsystem is
properly initialized. During this time, no one should enable
local IRQ and some operations which usually are not allowed with
IRQ disabled, e.g. operations which might sleep or require
communications with other processors, are allowed.
lockdep tracked this with early_boot_irqs_off/on() callbacks.
As other subsystems need this information too, move it to
init/main.c and make it generally available. While at it,
toggle the boolean to early_boot_irqs_disabled instead of
enabled so that it can be initialized with %false and %true
indicates the exceptional condition.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <20110120110635.GB6036@htj.dyndns.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'stable/gntdev' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/p2m: Fix module linking error.
xen p2m: clear the old pte when adding a page to m2p_override
xen gntdev: use gnttab_map_refs and gnttab_unmap_refs
xen: introduce gnttab_map_refs and gnttab_unmap_refs
xen p2m: transparently change the p2m mappings in the m2p override
xen/gntdev: Fix circular locking dependency
xen/gntdev: stop using "token" argument
xen: gntdev: move use of GNTMAP_contains_pte next to the map_op
xen: add m2p override mechanism
xen: move p2m handling to separate file
xen/gntdev: add VM_PFNMAP to vma
xen/gntdev: allow usermode to map granted pages
xen: define gnttab_set_map_op/unmap_op
Fix up trivial conflict in drivers/xen/Kconfig
When adding a page to m2p_override we change the p2m of the page so we
need to also clear the old pte of the kernel linear mapping because it
doesn't correspond anymore.
When we remove the page from m2p_override we restore the original p2m of
the page and we also restore the old pte of the kernel linear mapping.
Before changing the p2m mappings in m2p_add_override and
m2p_remove_override, check that the page passed as argument is valid and
return an error if it is not.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
In m2p_add_override store the original mfn into page->index and then
change the p2m mapping, setting mfns as FOREIGN_FRAME.
In m2p_remove_override restore the original mapping.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Add a simple hashtable based mechanism to override some portions of the
m2p, so that we can find out the pfn corresponding to an mfn of a
granted page. In fact entries corresponding to granted pages in the m2p
hold the original pfn value of the page in the source domain that
granted it.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* 'stable/bug-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/event: validate irq before get evtchn by irq
xen/fb: fix potential memory leak
xen/fb: fix xenfb suspend/resume race.
xen: disable ACPI NUMA for PV guests
xen/irq: Cleanup the find_unbound_irq
* 'stable/generic' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen: HVM X2APIC support
apic: Move hypervisor detection of x2apic to hypervisor.h
Xen does not currently expose PV-NUMA information to PV
guests. Therefore disable NUMA for the time being to prevent the
kernel picking up on an host-level NUMA information which it might
come across in the firmware.
[ Added comment - Jeremy ]
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This patch is similiar to Gleb Natapov's patch for KVM, which enable the
hypervisor to emulate x2apic feature for the guest. By this way, the emulation
of lapic would be simpler with x2apic interface(MSR), and faster.
[v2: Re-organized 'xen_hvm_need_lapic' per Ian Campbell suggestion]
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Use this_cpu_ops to reduce code size and simplify things in various places.
V3->V4:
Move instance of this_cpu_inc_return to a later patchset so that
this patch can be applied without infrastructure changes.
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86/pvclock: Zero last_value on resume
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf record: Fix eternal wait for stillborn child
perf header: Don't assume there's no attr info if no sample ids is provided
perf symbols: Figure out start address of kernel map from kallsyms
perf symbols: Fix kallsyms kernel/module map splitting
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
nohz: Fix printk_needs_cpu() return value on offline cpus
printk: Fix wake_up_klogd() vs cpu hotplug
* '2.6.37-rc4-pvhvm-fixes' of git://xenbits.xen.org/people/sstabellini/linux-pvhvm:
xen: unplug the emulated devices at resume time
xen: fix save/restore for PV on HVM guests with pirq remapping
xen: resume the pv console for hvm guests too
xen: fix MSI setup and teardown for PV on HVM guests
xen: use PHYSDEVOP_get_free_pirq to implement find_unbound_pirq
* 'upstream/core' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
xen: allocate irq descs on any NUMA node
xen: prevent crashes with non-HIGHMEM 32-bit kernels with largeish memory
xen: use default_idle
xen: clean up "extra" memory handling some more
* 'upstream/bugfix' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
xen: x86/32: perform initial startup on initial_page_table
xen: don't bother to stop other cpus on shutdown/reboot
On stock 2.6.37-rc4, running:
# mount lilith:/export /mnt/lilith
# find /mnt/lilith/ -type f -print0 | xargs -0 file
crashes the machine fairly quickly under Xen. Often it results in oops
messages, but the couple of times I tried just now, it just hung quietly
and made Xen print some rude messages:
(XEN) mm.c:2389:d80 Bad type (saw 7400000000000001 != exp
3000000000000000) for mfn 1d7058 (pfn 18fa7)
(XEN) mm.c:964:d80 Attempt to create linear p.t. with write perms
(XEN) mm.c:2389:d80 Bad type (saw 7400000000000010 != exp
1000000000000000) for mfn 1d2e04 (pfn 1d1fb)
(XEN) mm.c:2965:d80 Error while pinning mfn 1d2e04
Which means the domain tried to map a pagetable page RW, which would
allow it to map arbitrary memory, so Xen stopped it. This is because
vm_unmap_ram() left some pages mapped in the vmalloc area after NFS had
finished with them, and those pages got recycled as pagetable pages
while still having these RW aliases.
Removing those mappings immediately removes the Xen-visible aliases, and
so it has no problem with those pages being reused as pagetable pages.
Deferring the TLB flush doesn't upset Xen because it can flush the TLB
itself as needed to maintain its invariants.
When unmapping a region in the vmalloc space, clear the ptes
immediately. There's no point in deferring this because there's no
amortization benefit.
The TLBs are left dirty, and they are flushed lazily to amortize the
cost of the IPIs.
This specific motivation for this patch is an oops-causing regression
since 2.6.36 when using NFS under Xen, triggered by the NFS client's use
of vm_map_ram() introduced in 56e4ebf877 ("NFS: readdir with vmapped
pages") . XFS also uses vm_map_ram() and could cause similar problems.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Bryan Schumaker <bjschuma@netapp.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only make swapper_pg_dir readonly and pinned when generic x86 architecture code
(which also starts on initial_page_table) switches to it. This helps ensure
that the generic setup paths work on Xen unmodified. In particular
clone_pgd_range writes directly to the destination pgd entries and is used to
initialise swapper_pg_dir so we need to ensure that it remains writeable until
the last possible moment during bring up.
This is complicated slightly by the need to avoid sharing kernel PMD entries
when running under Xen, therefore the Xen implementation must make a copy of
the kernel PMD (which is otherwise referred to by both intial_page_table and
swapper_pg_dir) before switching to swapper_pg_dir.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Xen will shoot all the VCPUs when we do a shutdown hypercall, so there's
no need to do it manually.
In any case it will fail because all the IPI irqs have been pulled
down by this point, so the cross-CPU calls will simply hang forever.
Until change 76fac077db the function calls
were not synchronously waited for, so this wasn't apparent. However after
that change the calls became synchronous leading to a hang on shutdown
on multi-VCPU guests.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable Kernel <stable@kernel.org>
Cc: Alok Kataria <akataria@vmware.com>
If the guest domain has been suspend/resumed or migrated, then the
system clock backing the pvclock clocksource may revert to a smaller
value (ie, can be non-monotonic across the migration/save-restore).
Make sure we zero last_value in that case so that the domain
continues to see clock updates.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Only make swapper_pg_dir readonly and pinned when generic x86 architecture code
(which also starts on initial_page_table) switches to it. This helps ensure
that the generic setup paths work on Xen unmodified. In particular
clone_pgd_range writes directly to the destination pgd entries and is used to
initialise swapper_pg_dir so we need to ensure that it remains writeable until
the last possible moment during bring up.
This is complicated slightly by the need to avoid sharing kernel PMD entries
when running under Xen, therefore the Xen implementation must make a copy of
the kernel PMD (which is otherwise referred to by both intial_page_table and
swapper_pg_dir) before switching to swapper_pg_dir.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We just need the idle loop to drop into safe_halt, which default_idle()
is perfectly capable of doing. There's no need to duplicate it.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Make sure that extra_pages is added for all E820_RAM regions beyond
mem_end - completely excluded regions as well as the remains of partially
included regions.
Also makes sure the extra region is not unnecessarily high, and simplifies
the logic to decide which regions should be added.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* upstream/core:
xen/events: Use PIRQ instead of GSI value when unmapping MSI/MSI-X irqs.
xen: set IO permission early (before early_cpu_init())
xen: re-enable boot-time ballooning
xen/balloon: make sure we only include remaining extra ram
xen/balloon: the balloon_lock is useless
xen: add extra pages to balloon
xen/events: use locked set|clear_bit() for cpu_evtchn_mask
xen/evtchn: clear secondary CPUs' cpu_evtchn_mask[] after restore
xen: implement XENMEM_machphys_mapping
* upstream/xenfs:
Revert "xen/privcmd: create address space to allow writable mmaps"
xen/xenfs: update xenfs_mount for new prototype
xen: fix header export to userspace
xen: set vma flag VM_PFNMAP in the privcmd mmap file_op
xen: xenfs: privcmd: check put_user() return code
* upstream/evtchn:
xen: make evtchn's name less generic
xen/evtchn: the evtchn device is non-seekable
xen/evtchn: add missing static
xen/evtchn: Fix name of Xen event-channel device
xen/evtchn: don't do unbind_from_irqhandler under spinlock
xen/evtchn: remove spurious barrier
xen/evtchn: ports start enabled
xen/evtchn: dynamically allocate port_user array
xen/evtchn: track enabled state for each port
This patch is based off "xen dom0: Set up basic IO permissions for dom0."
by Juan Quintela <quintela@redhat.com>.
On AMD machines when we boot the kernel as Domain 0 we get this nasty:
mapping kernel into physical memory
Xen: setup ISA identity maps
about to get started...
(XEN) traps.c:475:d0 Unhandled general protection fault fault/trap [#13] on VCPU 0 [ec=0000]
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-4.1-101116 x86_64 debug=y Not tainted ]----
(XEN) CPU: 0
(XEN) RIP: e033:[<ffffffff8130271b>]
(XEN) RFLAGS: 0000000000000282 EM: 1 CONTEXT: pv guest
(XEN) rax: 000000008000c068 rbx: ffffffff8186c680 rcx: 0000000000000068
(XEN) rdx: 0000000000000cf8 rsi: 000000000000c000 rdi: 0000000000000000
(XEN) rbp: ffffffff81801e98 rsp: ffffffff81801e50 r8: ffffffff81801eac
(XEN) r9: ffffffff81801ea8 r10: ffffffff81801eb4 r11: 00000000ffffffff
(XEN) r12: ffffffff8186c694 r13: ffffffff81801f90 r14: ffffffffffffffff
(XEN) r15: 0000000000000000 cr0: 000000008005003b cr4: 00000000000006f0
(XEN) cr3: 0000000221803000 cr2: 0000000000000000
(XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033
(XEN) Guest stack trace from rsp=ffffffff81801e50:
RIP points to read_pci_config() function.
The issue is that we don't set IO permissions for the Linux kernel early enough.
The call sequence used to be:
xen_start_kernel()
x86_init.oem.arch_setup = xen_setup_arch;
setup_arch:
- early_cpu_init
- early_init_amd
- read_pci_config
- x86_init.oem.arch_setup [ xen_arch_setup ]
- set IO permissions.
We need to set the IO permissions earlier on, which this patch does.
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Now that the balloon driver doesn't stumble over non-RAM pages, we
can enable the extra space for ballooning.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* commit 'v2.6.37-rc2': (10093 commits)
Linux 2.6.37-rc2
capabilities/syslog: open code cap_syslog logic to fix build failure
i2c: Sanity checks on adapter registration
i2c: Mark i2c_adapter.id as deprecated
i2c: Drivers shouldn't include <linux/i2c-id.h>
i2c: Delete unused adapter IDs
i2c: Remove obsolete cleanup for clientdata
include/linux/kernel.h: Move logging bits to include/linux/printk.h
Fix gcc 4.5.1 miscompiling drivers/char/i8k.c (again)
hwmon: (w83795) Check for BEEP pin availability
hwmon: (w83795) Clear intrusion alarm immediately
hwmon: (w83795) Read the intrusion state properly
hwmon: (w83795) Print the actual temperature channels as sources
hwmon: (w83795) List all usable temperature sources
hwmon: (w83795) Expose fan control method
hwmon: (w83795) Fix fan control mode attributes
hwmon: (lm95241) Check validity of input values
hwmon: Change mail address of Hans J. Koch
PCI: sysfs: fix printk warnings
GFS2: Fix inode deallocation race
...
This hypercall allows Xen to specify a non-default location for the
machine to physical mapping. This capability is used when running a 32
bit domain 0 on a 64 bit hypervisor to shrink the hypervisor hole to
exactly the size required.
[ Impact: add Xen hypercall definitions ]
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Set VM_PFNMAP in the privcmd mmap file_op, rather than later in
xen_remap_domain_mfn_range when it is too late because
vma_wants_writenotify has already been called and vm_page_prot has
already been modified.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
We already deliberately setup a 1-1 P2M for the region up to 1M in
order to allow code which assumes this region is already mapped to
work without having to convert everything to ioremap.
Domain 0 should not return any apparently unused memory regions
(reserved or otherwise) in this region to Xen since the e820 may not
accurately reflect what the BIOS has stashed in this region.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
sizeof(pmd_t *) is 4 bytes on 32-bit PAE leading to an allocation of
only 2048 bytes. The correct size is sizeof(pmd_t) giving us a full
page allocation.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
and branch 'for-linus' of git://xenbits.xen.org/people/sstabellini/linux-pvhvm
* 'for-linus' of git://xenbits.xen.org/people/sstabellini/linux-pvhvm:
xen: register xen pci notifier
xen: initialize cpu masks for pv guests in xen_smp_init
xen: add a missing #include to arch/x86/pci/xen.c
xen: mask the MTRR feature from the cpuid
xen: make hvc_xen console work for dom0.
xen: add the direct mapping area for ISA bus access
xen: Initialize xenbus for dom0.
xen: use vcpu_ops to setup cpu masks
xen: map a dummy page for local apic and ioapic in xen_set_fixmap
xen: remap MSIs into pirqs when running as initial domain
xen: remap GSIs as pirqs when running as initial domain
xen: introduce XEN_DOM0 as a silent option
xen: map MSIs into pirqs
xen: support GSI -> pirq remapping in PV on HVM guests
xen: add xen hvm acpi_register_gsi variant
acpi: use indirect call to register gsi in different modes
xen: implement xen_hvm_register_pirq
xen: get the maximum number of pirqs from xen
xen: support pirq != irq
* 'stable/xen-pcifront-0.8.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: (27 commits)
X86/PCI: Remove the dependency on isapnp_disable.
xen: Update Makefile with CONFIG_BLOCK dependency for biomerge.c
MAINTAINERS: Add myself to the Xen Hypervisor Interface and remove Chris Wright.
x86: xen: Sanitse irq handling (part two)
swiotlb-xen: On x86-32 builts, select SWIOTLB instead of depending on it.
MAINTAINERS: Add myself for Xen PCI and Xen SWIOTLB maintainer.
xen/pci: Request ACS when Xen-SWIOTLB is activated.
xen-pcifront: Xen PCI frontend driver.
xenbus: prevent warnings on unhandled enumeration values
xenbus: Xen paravirtualised PCI hotplug support.
xen/x86/PCI: Add support for the Xen PCI subsystem
x86: Introduce x86_msi_ops
msi: Introduce default_[teardown|setup]_msi_irqs with fallback.
x86/PCI: Export pci_walk_bus function.
x86/PCI: make sure _PAGE_IOMAP it set on pci mappings
x86/PCI: Clean up pci_cache_line_size
xen: fix shared irq device passthrough
xen: Provide a variant of xen_poll_irq with timeout.
xen: Find an unbound irq number in reverse order (high to low).
xen: statically initialize cpu_evtchn_mask_p
...
Fix up trivial conflicts in drivers/pci/Makefile
* akpm-incoming-2: (139 commits)
epoll: make epoll_wait() use the hrtimer range feature
select: rename estimate_accuracy() to select_estimate_accuracy()
Remove duplicate includes from many files
ramoops: use the platform data structure instead of module params
kernel/resource.c: handle reinsertion of an already-inserted resource
kfifo: fix kfifo_alloc() to return a signed int value
w1: don't allow arbitrary users to remove w1 devices
alpha: remove dma64_addr_t usage
mips: remove dma64_addr_t usage
sparc: remove dma64_addr_t usage
fuse: use release_pages()
taskstats: use real microsecond granularity for CPU times
taskstats: split fill_pid function
taskstats: separate taskstats commands
delayacct: align to 8 byte boundary on 64-bit systems
delay-accounting: reimplement -c for getdelays.c to report information on a target command
namespaces Kconfig: move namespace menu location after the cgroup
namespaces Kconfig: remove the cgroup device whitelist experimental tag
namespaces Kconfig: remove pointless cgroup dependency
namespaces Kconfig: make namespace a submenu
...
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
percpu: Remove the multi-page alignment facility
x86-32: Allocate irq stacks seperate from percpu area
x86-32, mm: Remove duplicated #include
x86, printk: Get rid of <0> from stack output
x86, kexec: Make sure to stop all CPUs before exiting the kernel
x86/vsmp: Eliminate kconfig dependency warning
* 'upstream/xenfs' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
xen/privcmd: make privcmd visible in domU
xen/privcmd: move remap_domain_mfn_range() to core xen code and export.
privcmd: MMAPBATCH: Fix error handling/reporting
xenbus: export xen_store_interface for xenfs
xen/privcmd: make sure vma is ours before doing anything to it
xen/privcmd: print SIGBUS faults
xen/xenfs: set_page_dirty is supposed to return true if it dirties
xen/privcmd: create address space to allow writable mmaps
xen: add privcmd driver
xen: add variable hypercall caller
xen: add xen_set_domain_pte()
xen: add /proc/xen/xsd_{kva,port} to xenfs
* 'upstream/core' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen: (29 commits)
xen: include xen/xen.h for definition of xen_initial_domain()
xen: use host E820 map for dom0
xen: correctly rebuild mfn list list after migration.
xen: improvements to VIRQ_DEBUG output
xen: set up IRQ before binding virq to evtchn
xen: ensure that all event channels start off bound to VCPU 0
xen/hvc: only notify if we actually sent something
xen: don't add extra_pages for RAM after mem_end
xen: add support for PAT
xen: make sure xen_max_p2m_pfn is up to date
xen: limit extra memory to a certain ratio of base
xen: add extra pages for E820 RAM regions, even if beyond mem_end
xen: make sure xen_extra_mem_start is beyond all non-RAM e820
xen: implement "extra" memory to reserve space for pages not present at boot
xen: Use host-provided E820 map
xen: don't map missing memory
xen: defer building p2m mfn structures until kernel is mapped
xen: add return value to set_phys_to_machine()
xen: convert p2m to a 3 level tree
xen: make install_p2mtop_page() static
...
Fix up trivial conflict in arch/x86/xen/mmu.c, and fix the use of
'reserve_early()' - in the new memblock world order it is now
'memblock_x86_reserve_range()' instead. Pointed out by Jeremy.
Pv guests don't have ACPI and need the cpu masks to be set
correctly as early as possible so we call xen_fill_possible_map from
xen_smp_init.
On the other hand the initial domain supports ACPI so in this case we skip
xen_fill_possible_map and rely on it. However Xen might limit the number
of cpus usable by the domain, so we filter those masks during smp
initialization using the VCPUOP_is_up hypercall.
It is important that the filtering is done before
xen_setup_vcpu_info_placement.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
CC arch/x86/xen/setup.o
arch/x86/xen/setup.c: In function 'xen_memory_setup':
arch/x86/xen/setup.c:161: error: implicit declaration of function 'xen_initial_domain'
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
We don't want Linux to think that the cpu supports MTRRs when running
under Xen because MTRR operations could only be performed through
hypercalls.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
add the direct mapping area for ISA bus access when running as initial
domain
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Add XEN_DOM0 to arch/x86/xen/Kconfig as a silent compile time option
that gets enabled when xen and basic x86, acpi and pci support are
selected.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
When running as initial domain, get the real physical memory map from
xen using the XENMEM_machine_memory_map hypercall and use it to setup
the e820 regions.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Otherwise the second migration attempt fails because the mfn_list_list
still refers to all the old mfns.
We need to update the entires in both p2m_top_mfn and the mid_mfn
pages which p2m_top_mfn refers to.
In order to do this we need to keep track of the virtual addresses
mapping the p2m_mid_mfn pages since we cannot rely on
mfn_to_virt(p2m_top_mfn[idx]) since p2m_top_mfn[idx] will still
contain the old MFN after a migration, which may now belong to another
domain and hence have a different mapping in the m2p.
Therefore add and maintain a third top level page, p2m_top_mfn_p[],
which tracks the virtual addresses of the mfns contained in
p2m_top_mfn[].
We also need to update the content of the p2m_mid_missing_mfn page on
resume to refer to the page's new mfn.
p2m_missing does not need updating since the migration process takes
care of the leaf p2m pages for us.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
If an E820 region is entirely beyond mem_end, don't attempt to truncate
it and add the truncated pages to extra_pages, as they will be negative.
Also, make sure the extra memory region starts after all BIOS provided
E820 regions (and in the case of RAM regions, post-clipping).
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Convert Linux PAT entries into Xen ones when constructing ptes. Linux
doesn't use _PAGE_PAT for ptes, so the only difference in the first 4
entries is that Linux uses _PAGE_PWT for WC, whereas Xen (and default)
use it for WT.
xen_pte_val does the inverse conversion.
We hard-code assumptions about Linux's current PAT layout, but a
warning on the wrmsr to MSR_IA32_CR_PAT should point out any problems.
If necessary we could go to a more general table-based conversion between
Linux and Xen PAT entries.
hugetlbfs poses a problem at the moment, the x86 architecture uses the
same flag for _PAGE_PAT and _PAGE_PSE, which changes meaning depending
on which pagetable level we're using. At the moment this should be OK
so long as nobody tries to do a pte_val on a hugetlbfs pte.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Keep xen_max_p2m_pfn up to date with the end of the extra memory
we're adding. It is possible that it will be too high since memory
may be truncated by a "mem=" option on the kernel command line, but
that won't matter.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
If extra memory is very much larger than the base memory size
then all of the base memory can be filled with structures reserved to
describe the extra memory, leaving no space for anything else.
Even at the maximum ratio there will be little space for anything else,
but this change is intended to at least allow the system to boot rather
than crash mysteriously.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
If an entire E820 RAM region is beyond mem_end, still add its
pages to the extra area so that space can be used by the kernel.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
If Xen gives us non-RAM E820 entries (dom0 only, typically), then
make sure the extra RAM region is beyond them. It's OK for
the extra space to grow into E820 regions, however.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
When using the e820 map to get the initial pseudo-physical address space,
look for either Xen-provided memory which doesn't lie within an E820
region, or an E820 RAM region which extends beyond the Xen-provided
memory range.
Count these pages, and add them to a new "extra memory" range. This range
has an E820 RAM range to describe it - so the kernel will allocate page
structures for it - but it is also marked reserved so that the kernel
will not attempt to use it.
The balloon driver can then add this range as a set of currently
ballooned-out pages, which can be used to extend the domain beyond its
original size.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Rather than simply using a flat memory map from Xen, use its provided
E820 map. This allows the domain builder to tell the domain to reserve
space for more pages than those initially provided at domain-build time.
It also allows the host to specify holes in the address space (for
PCI-passthrough, for example).
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
When setting up a pte for a missing pfn (no matching mfn), just create
an empty pte rather than a junk mapping.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
When building mfn parts of p2m structure, we rely on being able to
use mfn_to_virt, which in turn requires kernel to be mapped into
the linear area (which is distinct from the kernel image mapping
on 64-bit). Defer calling xen_build_mfn_list_list() until after
xen_setup_kernel_pagetable();
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
set_phys_to_machine() can return false on failure, which means a memory
allocation failure for the p2m structure. It can only fail if setting
the mfn for a pfn in previously unused address space. It is guaranteed
to succeed if you're setting a mapping to INVALID_P2M_ENTRY or updating
the mfn for an existing pfn.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Make the p2m structure a 3 level tree which covers the full possible
physical space.
The p2m structure contains mappings from the domain's pfns to system-wide
mfns. The structure has 3 levels and two roots. The first root is for
the domain's own use, and is linked with virtual addresses. The second
is all mfn references, and is used by Xen on save/restore to allow it to
update the p2m mapping for the domain.
At boot, the domain builder provides a simple flat p2m array for all the
initially present pages. We construct the two levels above that using
the early_brk allocator. After early boot time, set_phys_to_machine()
will allocate any missing levels using the normal kernel allocator
(at GFP_KERNEL, so it must be called in a normal blocking context).
Because the early_brk() API requires us to pre-reserve the maximum amount
of memory we could allocate, there is still a CONFIG_XEN_MAX_DOMAIN_MEMORY
config option, but its only negative side-effect is to increase the
kernel's apparent bss size. However, since all unused brk memory is
returned to the heap, there's no real downside to making it large.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Allocate p2m tables based on the actual runtime maximum pfn rather than
the static config-time limit.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Use early brk mechanism to allocate p2m tables, to save memory when
booting non-Xen.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
vfs: make no_llseek the default
vfs: don't use BKL in default_llseek
llseek: automatically add .llseek fop
libfs: use generic_file_llseek for simple_attr
mac80211: disallow seeks in minstrel debug code
lirc: make chardev nonseekable
viotape: use noop_llseek
raw: use explicit llseek file operations
ibmasmfs: use generic_file_llseek
spufs: use llseek in all file operations
arm/omap: use generic_file_llseek in iommu_debug
lkdtm: use generic_file_llseek in debugfs
net/wireless: use generic_file_llseek in debugfs
drm: use noop_llseek
* 'core-memblock-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (74 commits)
x86-64: Only set max_pfn_mapped to 512 MiB if we enter via head_64.S
xen: Cope with unmapped pages when initializing kernel pagetable
memblock, bootmem: Round pfn properly for memory and reserved regions
memblock: Annotate memblock functions with __init_memblock
memblock: Allow memblock_init to be called early
memblock/arm: Fix memblock_region_is_memory() typo
x86, memblock: Remove __memblock_x86_find_in_range_size()
memblock: Fix wraparound in find_region()
x86-32, memblock: Make add_highpages honor early reserved ranges
x86, memblock: Fix crashkernel allocation
arm, memblock: Fix the sparsemem build
memblock: Fix section mismatch warnings
powerpc, memblock: Fix memblock API change fallout
memblock, microblaze: Fix memblock API change fallout
x86: Remove old bootmem code
x86, memblock: Use memblock_memory_size()/memblock_free_memory_size() to get correct dma_reserve
x86: Remove not used early_res code
x86, memblock: Replace e820_/_early string with memblock_
x86: Use memblock to replace early_res
x86, memblock: Use memblock_debug to control debug message print out
...
Fix up trivial conflicts in arch/x86/kernel/setup.c and kernel/Makefile
* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-irqflags:
Fix IRQ flag handling naming
MIPS: Add missing #inclusions of <linux/irq.h>
smc91x: Add missing #inclusion of <linux/irq.h>
Drop a couple of unnecessary asm/system.h inclusions
SH: Add missing consts to sys_execve() declaration
Blackfin: Rename IRQ flags handling functions
Blackfin: Add missing dep to asm/irqflags.h
Blackfin: Rename DES PC2() symbol to avoid collision
Blackfin: Split the BF532 BFIN_*_FIO_FLAG() functions to their own header
Blackfin: Split PLL code from mach-specific cdef headers
* 'x86-vmware-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, paravirt: Remove alloc_pmd_clone hook, only used by VMI
x86, vmware: Remove deprecated VMI kernel support
Fix up trivial #include conflict in arch/x86/kernel/smpboot.c
x86 smp_ops now has a new op, stop_other_cpus which takes a parameter
"wait" this allows the caller to specify if it wants to stop until all
the cpus have processed the stop IPI. This is required specifically
for the kexec case where we should wait for all the cpus to be stopped
before starting the new kernel. We now wait for the cpus to stop in
all cases except for panic/kdump where we expect things to be broken
and we are doing our best to make things work anyway.
This patch fixes a legitimate regression, which was introduced during
2.6.30, by commit id 4ef702c10b.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
LKML-Reference: <1286833028.1372.20.camel@ank32.eng.vmware.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: <stable@kernel.org> v2.6.30-36
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This allows xenfs to be built as a module, previously it required flush_tlb_all
and arbitrary_virt_to_machine to be exported.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Add xen_set_domain_pte() to allow setting a pte mapping a page from
another domain. The common case is to map from DOMID_IO, the pseudo
domain which owns all IO pages, but will also be used in the privcmd
interface to map other domain pages.
[ Impact: new Xen-internal API for cross-domain mappings ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
It used to done in the Xen startup code but that is not really
appropiate.
[v2: Update Kconfig with PCI requirement]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The frontend stub lives in arch/x86/pci/xen.c, alongside other
sub-arch PCI init code (e.g. olpc.c).
It provides a mechanism for Xen PCI frontend to setup/destroy
legacy interrupts, MSI/MSI-X, and PCI configuration operations.
[ Impact: add core of Xen PCI support ]
[ v2: Removed the IOMMU code and only focusing on PCI.]
[ v3: removed usage of pci_scan_all_fns as that does not exist]
[ v4: introduced pci_xen value to fix compile warnings]
[ v5: squished fixes+features in one patch, changed Reviewed-by to Ccs]
[ v7: added Acked-by]
Signed-off-by: Alex Nixon <alex.nixon@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Qing He <qing.he@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
If a guest domain wants to access PCI devices through the frontend
driver (coming later in the patch series), it will need access to the
I/O space.
[ Impact: Allow for domU IO access, preparing for pci passthrough ]
Signed-off-by: Alex Nixon <alex.nixon@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Xen requires that all pages containing pagetable entries to be mapped
read-only. If pages used for the initial pagetable are already mapped
then we can change the mapping to RO. However, if they are initially
unmapped, we need to make sure that when they are later mapped, they
are also mapped RO.
We do this by knowing that the kernel pagetable memory is pre-allocated
in the range e820_table_start - e820_table_end, so any pfn within this
range should be mapped read-only. However, the pagetable setup code
early_ioremaps the pages to write their entries, so we must make sure
that mappings created in the early_ioremap fixmap area are mapped RW.
(Those mappings are removed before the pages are presented to Xen
as pagetable pages.)
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
LKML-Reference: <4CB63A80.8060702@goop.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>