Commit Graph

1246 Commits

Author SHA1 Message Date
Linus Torvalds
613d4cefbb xen: bug fixes for 3.19-rc4
- Several critical linear p2m fixes that prevented some hosts from
   booting.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJUtQHNAAoJEFxbo/MsZsTR/qgH/iiW4k2T8dBGZ7TPyzt88iyT
 4caWjuujp2OUaRqhBQdY7z05uai6XxgJLwDyqiO+qHaRUj+ZWCrjh/ZFPU1+09hK
 GdwPMWU7xMRs/7F2ANO03jJ/ktvsYXtazcVrV89Q3t+ZZJIQ/THovDkaoa+dF2lh
 W8d5H7N2UNCJLe9w2fm5iOq4SKoTsJOq6pVQ6gUBqJcgkSDWavd6bowXnTlcepZN
 tNaSMZsOt4CAvYQIa0nKPJo6Q4QN3buRQMWEOAOmGVT/RkVi68wirwk59uNzcS7E
 HjhqxFjhXYamNTuwHYZlchBrZutdbymSlucVucb1wAoxRAX+Wd1jk5EPl6zLv4w=
 =kFSE
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.19-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen bug fixes from David Vrabel:
 "Several critical linear p2m fixes that prevented some hosts from
  booting"

* tag 'stable/for-linus-3.19-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  x86/xen: properly retrieve NMI reason
  xen: check for zero sized area when invalidating memory
  xen: use correct type for physical addresses
  xen: correct race in alloc_p2m_pmd()
  xen: correct error for building p2m list on 32 bits
  x86/xen: avoid freeing static 'name' when kasprintf() fails
  x86/xen: add extra memory for remapped frames during setup
  x86/xen: don't count how many PFNs are identity mapped
  x86/xen: Free bootmem in free_p2m_page() during early boot
  x86/xen: Remove unnecessary BUG_ON(preemptible()) in xen_setup_timer()
2015-01-14 08:07:42 +13:00
Jan Beulich
f221b04fe0 x86/xen: properly retrieve NMI reason
Using the native code here can't work properly, as the hypervisor would
normally have cleared the two reason bits by the time Dom0 gets to see
the NMI (if passed to it at all). There's a shared info field for this,
and there's an existing hook to use - just fit the two together. This
is particularly relevant so that NMIs intended to be handled by APEI /
GHES actually make it to the respective handler.

Note that the hook can (and should) be used irrespective of whether
being in Dom0, as accessing port 0x61 in a DomU would be even worse,
while the shared info field would just hold zero all the time. Note
further that hardware NMI handling for PVH doesn't currently work
anyway due to missing code in the hypervisor (but it is expected to
work the native rather than the PV way).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-01-13 09:39:50 +00:00
Juergen Gross
9a17ad7f3d xen: check for zero sized area when invalidating memory
With the introduction of the linear mapped p2m list setting memory
areas to "invalid" had to be delayed. When doing the invalidation
make sure no zero sized areas are processed.

Signed-off-by: Juegren Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-01-12 10:09:55 +00:00
Juergen Gross
e86f949667 xen: use correct type for physical addresses
When converting a pfn to a physical address be sure to use 64 bit
wide types or convert the physical address to a pfn if possible.

Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-01-12 10:09:48 +00:00
Juergen Gross
f241b0b891 xen: correct race in alloc_p2m_pmd()
When allocating a new pmd for the linear mapped p2m list a check is
done for not introducing another pmd when this just happened on
another cpu. In this case the old pte pointer was returned which
points to the p2m_missing or p2m_identity page. The correct value
would be the pointer to the found new page.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-01-12 10:09:40 +00:00
Juergen Gross
82c92ed135 xen: correct error for building p2m list on 32 bits
In xen_rebuild_p2m_list() for large areas of invalid or identity
mapped memory the pmd entries on 32 bit systems are initialized
wrong. Correct this error.

Suggested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-01-12 10:09:34 +00:00
Vitaly Kuznetsov
7be0772d19 x86/xen: avoid freeing static 'name' when kasprintf() fails
In case kasprintf() fails in xen_setup_timer() we assign name to the
static string "<timer kasprintf failed>". We, however, don't check
that fact before issuing kfree() in xen_teardown_timer(), kernel is
supposed to crash with 'kernel BUG at mm/slub.c:3341!'

Solve the issue by making name a fixed length string inside struct
xen_clock_event_device. 16 bytes should be enough.

Suggested-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-01-08 13:55:25 +00:00
David Vrabel
a97dae1a2e x86/xen: add extra memory for remapped frames during setup
If the non-RAM regions in the e820 memory map are larger than the size
of the initial balloon, a BUG was triggered as the frames are remaped
beyond the limit of the linear p2m.  The frames are remapped into the
initial balloon area (xen_extra_mem) but not enough of this is
available.

Ensure enough extra memory regions are added for these remapped
frames.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
2015-01-08 13:52:37 +00:00
David Vrabel
bc7142cf79 x86/xen: don't count how many PFNs are identity mapped
This accounting is just used to print a diagnostic message that isn't
very useful.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
2015-01-08 13:52:36 +00:00
Boris Ostrovsky
701a261ad6 x86/xen: Free bootmem in free_p2m_page() during early boot
With recent changes in p2m we now have legitimate cases when
p2m memory needs to be freed during early boot (i.e. before
slab is initialized).

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-01-08 13:52:36 +00:00
Boris Ostrovsky
8b8cd8a367 x86/xen: Remove unnecessary BUG_ON(preemptible()) in xen_setup_timer()
There is no reason for having it and, with commit 250a1ac685 ("x86,
smpboot: Remove pointless preempt_disable() in
native_smp_prepare_cpus()"), it prevents HVM guests from booting.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-23 10:31:55 +00:00
Linus Torvalds
eb64c3c6cd xen: additional features for 3.19-rc0
- Linear p2m for x86 PV guests which simplifies the p2m code, improves
   performance and will allow for > 512 GB PV guests in the future.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJUjx7OAAoJEFxbo/MsZsTRXLIH/ishF/xDCL6F5r0I0SKDuaz5
 C/BediDcFzbzh4/t3x2PrPooHk4gPmeyIg688ZGgBAxHRXC5OJ2U5tdtZ/qUCnwf
 0J1pdp/yoAOVRJT+Sax10lN4+G8YV7+6Ptikz0C7glXBAg8SgFL3Y6tfBS0jNwYR
 wQph09S9n7gMZTodSBLbb0ymtJMhl16DrETJsYV73sU7bAL5sFDVkMQvY3SxkusX
 GNFeALfqM0cSK9mDI6O9avGJKoIdKlzt7VWHdlc+yKTlQsoyg/cSH3AaihhG6af9
 IElRxwH9Z40VFLKip0gNMOIrUwAjFGSw6N+Uhik27tlmvfI3Dll/+gsMz/5sHc8=
 =OyoK
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.19-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull additional xen update from David Vrabel:
 "Xen: additional features for 3.19-rc0

   - Linear p2m for x86 PV guests which simplifies the p2m code,
     improves performance and will allow for > 512 GB PV guests in the
     future.

  A last-minute, configuration specific issue was discovered with this
  change which is why it was not included in my previous pull request.
  This is now been fixed and tested"

* tag 'stable/for-linus-3.19-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen: switch to post-init routines in xen mmu.c earlier
  Revert "swiotlb-xen: pass dev_addr to swiotlb_tbl_unmap_single"
  xen: annotate xen_set_identity_and_remap_chunk() with __init
  xen: introduce helper functions to do safe read and write accesses
  xen: Speed up set_phys_to_machine() by using read-only mappings
  xen: switch to linear virtual mapped sparse p2m list
  xen: Hide get_phys_to_machine() to be able to tune common path
  x86: Introduce function to get pmd entry pointer
  xen: Delay invalidating extra memory
  xen: Delay m2p_override initialization
  xen: Delay remapping memory of pv-domain
  xen: use common page allocation function in p2m.c
  xen: Make functions static
  xen: fix some style issues in p2m.c
2014-12-16 13:23:03 -08:00
Juergen Gross
cdfa0badfc xen: switch to post-init routines in xen mmu.c earlier
With the virtual mapped linear p2m list the post-init mmu operations
must be used for setting up the p2m mappings, as in case of
CONFIG_FLATMEM the init routines may trigger BUGs.

paging_init() sets up all infrastructure needed to switch to the
post-init mmu ops done by xen_post_allocator_init(). With the virtual
mapped linear p2m list we need some mmu ops during setup of this list,
so we have to switch to the correct mmu ops as soon as possible.

The p2m list is usable from the beginning, just expansion requires to
have established the new linear mapping. So the call of
xen_remap_memory() had to be introduced, but this is not due to the
mmu ops requiring this.

Summing it up: calling xen_post_allocator_init() not directly after
paging_init() was conceptually wrong in the beginning, it just didn't
matter up to now as no functions used between the two calls needed
some critical mmu ops (e.g. alloc_pte). This has changed now, so I
corrected it.

Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-11 12:05:53 +00:00
Linus Torvalds
3100e448e7 Merge branch 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 vdso updates from Ingo Molnar:
 "Various vDSO updates from Andy Lutomirski, mostly cleanups and
  reorganization to improve maintainability, but also some
  micro-optimizations and robustization changes"

* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86_64/vsyscall: Restore orig_ax after vsyscall seccomp
  x86_64: Add a comment explaining the TASK_SIZE_MAX guard page
  x86_64,vsyscall: Make vsyscall emulation configurable
  x86_64, vsyscall: Rewrite comment and clean up headers in vsyscall code
  x86_64, vsyscall: Turn vsyscalls all the way off when vsyscall==none
  x86,vdso: Use LSL unconditionally for vgetcpu
  x86: vdso: Fix build with older gcc
  x86_64/vdso: Clean up vgetcpu init and merge the vdso initcalls
  x86_64/vdso: Remove jiffies from the vvar page
  x86/vdso: Make the PER_CPU segment 32 bits
  x86/vdso: Make the PER_CPU segment start out accessed
  x86/vdso: Change the PER_CPU segment to use struct desc_struct
  x86_64/vdso: Move getcpu code from vsyscall_64.c to vdso/vma.c
  x86_64/vsyscall: Move all of the gate_area code to vsyscall_64.c
2014-12-10 14:24:20 -08:00
Linus Torvalds
a023748d53 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm tree changes from Ingo Molnar:
 "The biggest change is full PAT support from Jürgen Gross:

     The x86 architecture offers via the PAT (Page Attribute Table) a
     way to specify different caching modes in page table entries.  The
     PAT MSR contains 8 entries each specifying one of 6 possible cache
     modes.  A pte references one of those entries via 3 bits:
     _PAGE_PAT, _PAGE_PWT and _PAGE_PCD.

     The Linux kernel currently supports only 4 different cache modes.
     The PAT MSR is set up in a way that the setting of _PAGE_PAT in a
     pte doesn't matter: the top 4 entries in the PAT MSR are the same
     as the 4 lower entries.

     This results in the kernel not supporting e.g. write-through mode.
     Especially this cache mode would speed up drivers of video cards
     which now have to use uncached accesses.

     OTOH some old processors (Pentium) don't support PAT correctly and
     the Xen hypervisor has been using a different PAT MSR configuration
     for some time now and can't change that as this setting is part of
     the ABI.

     This patch set abstracts the cache mode from the pte and introduces
     tables to translate between cache mode and pte bits (the default
     cache mode "write back" is hard-wired to PAT entry 0).  The tables
     are statically initialized with values being compatible to old
     processors and current usage.  As soon as the PAT MSR is changed
     (or - in case of Xen - is read at boot time) the tables are changed
     accordingly.  Requests of mappings with special cache modes are
     always possible now, in case they are not supported there will be a
     fallback to a compatible but slower mode.

     Summing it up, this patch set adds the following features:

      - capability to support WT and WP cache modes on processors with
        full PAT support

      - processors with no or uncorrect PAT support are still working as
        today, even if WT or WP cache mode are selected by drivers for
        some pages

      - reduction of Xen special handling regarding cache mode

  Another change is a boot speedup on ridiculously large RAM systems,
  plus other smaller fixes"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  x86: mm: Move PAT only functions to mm/pat.c
  xen: Support Xen pv-domains using PAT
  x86: Enable PAT to use cache mode translation tables
  x86: Respect PAT bit when copying pte values between large and normal pages
  x86: Support PAT bit in pagetable dump for lower levels
  x86: Clean up pgtable_types.h
  x86: Use new cache mode type in memtype related functions
  x86: Use new cache mode type in mm/ioremap.c
  x86: Use new cache mode type in setting page attributes
  x86: Remove looking for setting of _PAGE_PAT_LARGE in pageattr.c
  x86: Use new cache mode type in track_pfn_remap() and track_pfn_insert()
  x86: Use new cache mode type in mm/iomap_32.c
  x86: Use new cache mode type in asm/pgtable.h
  x86: Use new cache mode type in arch/x86/mm/init_64.c
  x86: Use new cache mode type in arch/x86/pci
  x86: Use new cache mode type in drivers/video/fbdev/vermilion
  x86: Use new cache mode type in drivers/video/fbdev/gbefb.c
  x86: Use new cache mode type in include/asm/fb.h
  x86: Make page cache mode a real type
  x86: mm: Use 2GB memory block size on large-memory x86-64 systems
  ...
2014-12-10 13:59:34 -08:00
Juergen Gross
76f0a486fa xen: annotate xen_set_identity_and_remap_chunk() with __init
Commit 5b8e7d8054 removed the __init
annotation from xen_set_identity_and_remap_chunk(). Add it again.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-08 10:55:30 +00:00
Juergen Gross
90fff3ea15 xen: introduce helper functions to do safe read and write accesses
Introduce two helper functions to safely read and write unsigned long
values from or to memory when the access may fault because the mapping
is non-present or read-only.

These helpers can be used instead of open coded uses of __get_user()
and __put_user() avoiding the need to do casts to fix sparse warnings.

Use the helpers in page.h and p2m.c. This will fix the sparse
warnings when doing "make C=1".

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-08 10:53:59 +00:00
Juergen Gross
2e917175e1 xen: Speed up set_phys_to_machine() by using read-only mappings
Instead of checking at each call of set_phys_to_machine() whether a
new p2m page has to be allocated due to writing an entry in a large
invalid or identity area, just map those areas read only and react
to a page fault on write by allocating the new page.

This change will make the common path with no allocation much
faster as it only requires a single write of the new mfn instead
of walking the address translation tables and checking for the
special cases.

Suggested-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-04 14:09:20 +00:00
Juergen Gross
054954eb05 xen: switch to linear virtual mapped sparse p2m list
At start of the day the Xen hypervisor presents a contiguous mfn list
to a pv-domain. In order to support sparse memory this mfn list is
accessed via a three level p2m tree built early in the boot process.
Whenever the system needs the mfn associated with a pfn this tree is
used to find the mfn.

Instead of using a software walked tree for accessing a specific mfn
list entry this patch is creating a virtual address area for the
entire possible mfn list including memory holes. The holes are
covered by mapping a pre-defined  page consisting only of "invalid
mfn" entries. Access to a mfn entry is possible by just using the
virtual base address of the mfn list and the pfn as index into that
list. This speeds up the (hot) path of determining the mfn of a
pfn.

Kernel build on a Dell Latitude E6440 (2 cores, HT) in 64 bit Dom0
showed following improvements:

Elapsed time: 32:50 ->  32:35
System:       18:07 ->  17:47
User:        104:00 -> 103:30

Tested with following configurations:
- 64 bit dom0, 8GB RAM
- 64 bit dom0, 128 GB RAM, PCI-area above 4 GB
- 32 bit domU, 512 MB, 8 GB, 43 GB (more wouldn't work even without
                                    the patch)
- 32 bit domU, ballooning up and down
- 32 bit domU, save and restore
- 32 bit domU with PCI passthrough
- 64 bit domU, 8 GB, 2049 MB, 5000 MB
- 64 bit domU, ballooning up and down
- 64 bit domU, save and restore
- 64 bit domU with PCI passthrough

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-04 14:09:15 +00:00
Juergen Gross
0aad568983 xen: Hide get_phys_to_machine() to be able to tune common path
Today get_phys_to_machine() is always called when the mfn for a pfn
is to be obtained. Add a wrapper __pfn_to_mfn() as inline function
to be able to avoid calling get_phys_to_machine() when possible as
soon as the switch to a linear mapped p2m list has been done.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-04 14:09:09 +00:00
Juergen Gross
5b8e7d8054 xen: Delay invalidating extra memory
When the physical memory configuration is initialized the p2m entries
for not pouplated memory pages are set to "invalid". As those pages
are beyond the hypervisor built p2m list the p2m tree has to be
extended.

This patch delays processing the extra memory related p2m entries
during the boot process until some more basic memory management
functions are callable. This removes the need to create new p2m
entries until virtual memory management is available.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-04 14:08:59 +00:00
Juergen Gross
97f4533a60 xen: Delay m2p_override initialization
The m2p overrides are used to be able to find the local pfn for a
foreign mfn mapped into the domain. They are used by driver backends
having to access frontend data.

As this functionality isn't used in early boot it makes no sense to
initialize the m2p override functions very early. It can be done
later without doing any harm, removing the need for allocating memory
via extend_brk().

While at it make some m2p override functions static as they are only
used internally.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-04 14:08:53 +00:00
Juergen Gross
1f3ac86b4c xen: Delay remapping memory of pv-domain
Early in the boot process the memory layout of a pv-domain is changed
to match the E820 map (either the host one for Dom0 or the Xen one)
regarding placement of RAM and PCI holes. This requires removing memory
pages initially located at positions not suitable for RAM and adding
them later at higher addresses where no restrictions apply.

To be able to operate on the hypervisor supported p2m list until a
virtual mapped linear p2m list can be constructed, remapping must
be delayed until virtual memory management is initialized, as the
initial p2m list can't be extended unlimited at physical memory
initialization time due to it's fixed structure.

A further advantage is the reduction in complexity and code volume as
we don't have to be careful regarding memory restrictions during p2m
updates.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-04 14:08:48 +00:00
Juergen Gross
7108c9ce8f xen: use common page allocation function in p2m.c
In arch/x86/xen/p2m.c three different allocation functions for
obtaining a memory page are used: extend_brk(), alloc_bootmem_align()
or __get_free_page().  Which of those functions is used depends on the
progress of the boot process of the system.

Introduce a common allocation routine selecting the to be called
allocation routine dynamically based on the boot progress. This allows
moving initialization steps without having to care about changing
allocation calls.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-04 14:08:42 +00:00
Juergen Gross
820c4db2be xen: Make functions static
Some functions in arch/x86/xen/p2m.c are used locally only. Make them
static. Rearrange the functions in p2m.c to avoid forward declarations.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-04 14:08:37 +00:00
Juergen Gross
6f58d89e6c xen: fix some style issues in p2m.c
The source arch/x86/xen/p2m.c has some coding style issues. Fix them.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-04 14:08:29 +00:00
Juergen Gross
47591df505 xen: Support Xen pv-domains using PAT
With the dynamical mapping between cache modes and pgprot values it is
now possible to use all cache modes via the Xen hypervisor PAT settings
in a pv domain.

All to be done is to read the PAT configuration MSR and set up the
translation tables accordingly.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stefan.bader@canonical.com
Cc: xen-devel@lists.xensource.com
Cc: ville.syrjala@linux.intel.com
Cc: jbeulich@suse.com
Cc: toshi.kani@hp.com
Cc: plagnioj@jcrosoft.com
Cc: tomi.valkeinen@ti.com
Cc: bhelgaas@google.com
Link: http://lkml.kernel.org/r/1415019724-4317-19-git-send-email-jgross@suse.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-16 11:04:26 +01:00
Boris Ostrovsky
54279552bd x86/core, x86/xen/smp: Use 'die_complete' completion when taking CPU down
Commit 2ed53c0d6c ("x86/smpboot: Speed up suspend/resume by
avoiding 100ms sleep for CPU offline during S3") introduced
completions to CPU offlining process. These completions are not
initialized on Xen kernels causing a panic in
play_dead_common().

Move handling of die_complete into common routines to make them
available to Xen guests.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: tianyu.lan@intel.com
Cc: konrad.wilk@oracle.com
Cc: xen-devel@lists.xenproject.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1414770572-7950-1-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-10 11:16:40 +01:00
Andy Lutomirski
1ad83c858c x86_64,vsyscall: Make vsyscall emulation configurable
This adds CONFIG_X86_VSYSCALL_EMULATION, guarded by CONFIG_EXPERT.
Turning it off completely disables vsyscall emulation, saving ~3.5k
for vsyscall_64.c, 4k for vsyscall_emu_64.S (the fake vsyscall
page), some tiny amount of core mm code that supports a gate area,
and possibly 4k for a wasted pagetable.  The latter is because the
vsyscall addresses are misaligned and fit poorly in the fixmap.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/406db88b8dd5f0cbbf38216d11be34bbb43c7eae.1414618407.git.luto@amacapital.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-03 21:44:57 +01:00
Martin Kelly
1ea644c8f9 x86/xen: panic on bad Xen-provided memory map
Panic if Xen provides a memory map with 0 entries. Although this is
unlikely, it is better to catch the error at the point of seeing the map
than later on as a symptom of some other crash.

Signed-off-by: Martin Kelly <martkell@amazon.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-10-23 16:24:02 +01:00
Boris Ostrovsky
3251f20b89 x86/xen: Fix incorrect per_cpu accessor in xen_clocksource_read()
Commit 89cbc76768 ("x86: Replace __get_cpu_var uses") replaced
__get_cpu_var() with this_cpu_ptr() in xen_clocksource_read() in such a
way that instead of accessing a structure pointed to by a per-cpu pointer
we are trying to get to a per-cpu structure.

__this_cpu_read() of the pointer is the more appropriate accessor.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-10-23 16:24:02 +01:00
Juergen Gross
3a0e94f8ea x86/xen: avoid race in p2m handling
When a new p2m leaf is allocated this leaf is linked into the p2m tree
via cmpxchg. Unfortunately the compare value for checking the success
of the update is read after checking for the need of a new leaf. It is
possible that a new leaf has been linked into the tree concurrently
in between. This could lead to a leaked memory page and to the loss of
some p2m entries.

Avoid the race by using the read compare value for checking the need
of a new p2m leaf and use ACCESS_ONCE() to get it.

There are other places which seem to need ACCESS_ONCE() to ensure
proper operation. Change them accordingly.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-10-23 16:24:02 +01:00
Juergen Gross
2c185687ab x86/xen: delay construction of mfn_list_list
The 3 level p2m tree for the Xen tools is constructed very early at
boot by calling xen_build_mfn_list_list(). Memory needed for this tree
is allocated via extend_brk().

As this tree (other than the kernel internal p2m tree) is only needed
for domain save/restore, live migration and crash dump analysis it
doesn't matter whether it is constructed very early or just some
milliseconds later when memory allocation is possible by other means.

This patch moves the call of xen_build_mfn_list_list() just after
calling xen_pagetable_p2m_copy() simplifying this function, too, as it
doesn't have to bother with two parallel trees now. The same applies
for some other internal functions.

While simplifying code, make early_can_reuse_p2m_middle() static and
drop the unused second parameter. p2m_mid_identity_mfn can be removed
as well, it isn't used either.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-10-23 16:24:02 +01:00
Juergen Gross
239af7c713 x86/xen: avoid writing to freed memory after race in p2m handling
In case a race was detected during allocation of a new p2m tree
element in alloc_p2m() the new allocated mid_mfn page is freed without
updating the pointer to the found value in the tree. This will result
in overwriting the just freed page with the mfn of the p2m leaf.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-10-23 16:24:01 +01:00
Linus Torvalds
0429fbc0bd Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu consistent-ops changes from Tejun Heo:
 "Way back, before the current percpu allocator was implemented, static
  and dynamic percpu memory areas were allocated and handled separately
  and had their own accessors.  The distinction has been gone for many
  years now; however, the now duplicate two sets of accessors remained
  with the pointer based ones - this_cpu_*() - evolving various other
  operations over time.  During the process, we also accumulated other
  inconsistent operations.

  This pull request contains Christoph's patches to clean up the
  duplicate accessor situation.  __get_cpu_var() uses are replaced with
  with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

  Unfortunately, the former sometimes is tricky thanks to C being a bit
  messy with the distinction between lvalues and pointers, which led to
  a rather ugly solution for cpumask_var_t involving the introduction of
  this_cpu_cpumask_var_ptr().

  This converts most of the uses but not all.  Christoph will follow up
  with the remaining conversions in this merge window and hopefully
  remove the obsolete accessors"

* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
  irqchip: Properly fetch the per cpu offset
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
  ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
  Revert "powerpc: Replace __get_cpu_var uses"
  percpu: Remove __this_cpu_ptr
  clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
  sparc: Replace __get_cpu_var uses
  avr32: Replace __get_cpu_var with __this_cpu_write
  blackfin: Replace __get_cpu_var uses
  tile: Use this_cpu_ptr() for hardware counters
  tile: Replace __get_cpu_var uses
  powerpc: Replace __get_cpu_var uses
  alpha: Replace __get_cpu_var
  ia64: Replace __get_cpu_var uses
  s390: cio driver &__get_cpu_var replacements
  s390: Replace __get_cpu_var uses
  mips: Replace __get_cpu_var uses
  MIPS: Replace __get_cpu_var uses in FPU emulator.
  arm: Replace __this_cpu_ptr with raw_cpu_ptr
  ...
2014-10-15 07:48:18 +02:00
Linus Torvalds
19e00d593e Merge branch 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 bootup updates from Ingo Molnar:
 "The changes in this cycle were:

   - Fix rare SMP-boot hang (mostly in virtual environments)

   - Fix build warning with certain (rare) toolchains"

* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/relocs: Make per_cpu_load_addr static
  x86/smpboot: Initialize secondary CPU only if master CPU will wait for it
2014-10-13 18:16:32 +02:00
Mukesh Rathor
a2ef5dc2c7 x86/xen: Set EFER.NX and EFER.SCE in PVH guests
This fixes two bugs in PVH guests:

  - Not setting EFER.NX means the NX bit in page table entries is
    ignored on Intel processors and causes reserved bit page faults on
    AMD processors.

  - After the Xen commit 7645640d6ff1 ("x86/PVH: don't set EFER_SCE for
    pvh guest") PVH guests are required to set EFER.SCE to enable the
    SYSCALL instruction.

Secondary VCPUs are started with pagetables with the NX bit set so
EFER.NX must be set before using any stack or data segment.
xen_pvh_cpu_early_init() is the new secondary VCPU entry point that
sets EFER before jumping to cpu_bringup_and_idle().

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-10-06 10:27:47 +01:00
Juergen Gross
d1e9abd630 xen: eliminate scalability issues from initrd handling
Size restrictions native kernels wouldn't have resulted from the initrd
getting mapped into the initial mapping. The kernel doesn't really need
the initrd to be mapped, so use infrastructure available in Xen to avoid
the mapping and hence the restriction.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-10-03 12:34:52 +01:00
David Vrabel
f955371ca9 x86: remove the Xen-specific _PAGE_IOMAP PTE flag
The _PAGE_IO_MAP PTE flag was only used by Xen PV guests to mark PTEs
that were used to map I/O regions that are 1:1 in the p2m.  This
allowed Xen to obtain the correct PFN when converting the MFNs read
from a PTE back to their PFN.

Xen guests no longer use _PAGE_IOMAP for this. Instead mfn_to_pfn()
returns the correct PFN by using a combination of the m2p and p2m to
determine if an MFN corresponds to a 1:1 mapping in the the p2m.

Remove _PAGE_IOMAP, replacing it with _PAGE_UNUSED2 to allow for
future uses of the PTE flag.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
2014-09-23 13:36:20 +00:00
David Vrabel
7f2f882245 x86/xen: do not use _PAGE_IOMAP PTE flag for I/O mappings
Since mfn_to_pfn() returns the correct PFN for identity mappings (as
used for MMIO regions), the use of _PAGE_IOMAP is not required in
pte_mfn_to_pfn().

Do not set the _PAGE_IOMAP flag in pte_pfn_to_mfn() and do not use it
in pte_mfn_to_pfn().

This will allow _PAGE_IOMAP to be removed, making it available for
future use.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-09-23 13:36:20 +00:00
Daniel Kiper
342cd340f6 xen/efi: Directly include needed headers
I discovered that some needed stuff is defined/declared in headers
which are not included directly. Currently it works but if somebody
remove required headers from currently included headers then build
will break. So, just in case directly include all needed headers.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-09-23 13:36:20 +00:00
Matt Rushton
4fbb67e3c8 xen/setup: Remap Xen Identity Mapped RAM
Instead of ballooning up and down dom0 memory this remaps the existing mfns
that were replaced by the identity map. The reason for this is that the
existing implementation ballooned memory up and and down which caused dom0
to have discontiguous pages. In some cases this resulted in the use of bounce
buffers which reduced network I/O performance significantly. This change will
honor the existing order of the pages with the exception of some boundary
conditions.

To do this we need to update both the Linux p2m table and the Xen m2p table.
Particular care must be taken when updating the p2m table since it's important
to limit table memory consumption and reuse the existing leaf pages which get
freed when an entire leaf page is set to the identity map. To implement this,
mapping updates are grouped into blocks with table entries getting cached
temporarily and then released.

On my test system before:
Total pages: 2105014
Total contiguous: 1640635

After:
Total pages: 2105014
Total contiguous: 2098904

Signed-off-by: Matthew Rushton <mrushton@amazon.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-09-23 13:36:18 +00:00
Igor Mammedov
ce4b1b1650 x86/smpboot: Initialize secondary CPU only if master CPU will wait for it
Hang is observed on virtual machines during CPU hotplug,
especially in big guests with many CPUs. (It reproducible
more often if host is over-committed).

It happens because master CPU gives up waiting on
secondary CPU and allows it to run wild. As result
AP causes locking or crashing system. For example
as described here:

  https://lkml.org/lkml/2014/3/6/257

If master CPU have sent STARTUP IPI successfully,
and AP signalled to master CPU that it's ready
to start initialization, make master CPU wait
indefinitely till AP is onlined.

To ensure that AP won't ever run wild, make it
wait at early startup till master CPU confirms its
intention to wait for AP. If AP doesn't respond in 10
seconds, the master CPU will timeout and cancel
AP onlining.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1403266991-12233-1-git-send-email-imammedo@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-16 11:11:32 +02:00
Stefan Bader
0b5a50635f x86/xen: don't copy bogus duplicate entries into kernel page tables
When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded
modules exceeds 512 MiB, then loading modules fails with a warning
(and hence a vmalloc allocation failure) because the PTEs for the
newly-allocated vmalloc address space are not zero.

  WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128
           vmap_page_range_noflush+0x2a1/0x360()

This is caused by xen_setup_kernel_pagetables() copying
level2_kernel_pgt into level2_fixmap_pgt, overwriting many non-present
entries.

Without KASLR, the normal kernel image size only covers the first half
of level2_kernel_pgt and module space starts after that.

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[  0..255]->kernel
                                                  [256..511]->module
                          [511]->level2_fixmap_pgt[  0..505]->module

This allows 512 MiB of of module vmalloc space to be used before
having to use the corrupted level2_fixmap_pgt entries.

With KASLR enabled, the kernel image uses the full PUD range of 1G and
module space starts in the level2_fixmap_pgt. So basically:

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel
                          [511]->level2_fixmap_pgt[0..505]->module

And now no module vmalloc space can be used without using the corrupt
level2_fixmap_pgt entries.

Fix this by properly converting the level2_fixmap_pgt entries to MFNs,
and setting level1_fixmap_pgt as read-only.

A number of comments were also using the the wrong L3 offset for
level2_kernel_pgt.  These have been corrected.

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: stable@vger.kernel.org
2014-09-10 15:23:42 +01:00
Christoph Lameter
89cbc76768 x86: Replace __get_cpu_var uses
__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x).  This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area.  __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :

#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))

__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.

This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset.  Thereby address calculations are avoided and less registers
are used when code is generated.

Transformations done to __get_cpu_var()

1. Determine the address of the percpu instance of the current processor.

	DEFINE_PER_CPU(int, y);
	int *x = &__get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(&y);

2. Same as #1 but this time an array structure is involved.

	DEFINE_PER_CPU(int, y[20]);
	int *x = __get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(y);

3. Retrieve the content of the current processors instance of a per cpu
variable.

	DEFINE_PER_CPU(int, y);
	int x = __get_cpu_var(y)

   Converts to

	int x = __this_cpu_read(y);

4. Retrieve the content of a percpu struct

	DEFINE_PER_CPU(struct mystruct, y);
	struct mystruct x = __get_cpu_var(y);

   Converts to

	memcpy(&x, this_cpu_ptr(&y), sizeof(x));

5. Assignment to a per cpu variable

	DEFINE_PER_CPU(int, y)
	__get_cpu_var(y) = x;

   Converts to

	__this_cpu_write(y, x);

6. Increment/Decrement etc of a per cpu variable

	DEFINE_PER_CPU(int, y);
	__get_cpu_var(y)++

   Converts to

	__this_cpu_inc(y)

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-26 13:45:49 -04:00
David Vrabel
7d951f3ccb x86/xen: use vmap() to map grant table pages in PVH guests
Commit b7dd0e350e (x86/xen: safely map and unmap grant frames when
in atomic context) causes PVH guests to crash in
arch_gnttab_map_shared() when they attempted to map the pages for the
grant table.

This use of a PV-specific function during the PVH grant table setup is
non-obvious and not needed.  The standard vmap() function does the
right thing.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reported-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Tested-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Cc: stable@vger.kernel.org
2014-08-11 11:59:35 +01:00
David Vrabel
8d5999df35 x86/xen: resume timer irqs early
If the timer irqs are resumed during device resume it is possible in
certain circumstances for the resume to hang early on, before device
interrupts are resumed.  For an Ubuntu 14.04 PVHVM guest this would
occur in ~0.5% of resume attempts.

It is not entirely clear what is occuring the point of the hang but I
think a task necessary for the resume calls schedule_timeout(),
waiting for a timer interrupt (which never arrives).  This failure may
require specific tasks to be running on the other VCPUs to trigger
(processes are not frozen during a suspend/resume if PREEMPT is
disabled).

Add IRQF_EARLY_RESUME to the timer interrupts so they are resumed in
syscore_resume().

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: stable@vger.kernel.org
2014-08-11 11:59:34 +01:00
Linus Torvalds
e306e3be1c - Remove unused V2 grant table support.
- Note that Konrad is xen-blkkback/front maintainer.
 - Add 'xen_nopv' option to disable PV extentions for x86 HVM guests.
 - Misc. minor cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJT4N57AAoJEFxbo/MsZsTRtfsH/2GxmloKqMZqusnz5PR/x2hd
 M3aXtDw36rxv3hEciIs/NX6obMenRdofDKXVMafnU/gw+EOBQQQ2n/nDqcLOSN+0
 hVyrKHgByYQKaeAhAbrGiGIkuoe5JAURsaggx/YlYSx3hkE0za1XmcUjkPFEVP3l
 UeXXJ40H9hHgESsDwd1UQ08YNtvwdaWVHJAjio3jSxCBAHnAPhCqPhKVy/6LOr+U
 T6HgYsX9HLQRYBy34OOYfKBFnGOJpstnZJd3hMTYtrF4xaTl/Cnf+YxKxv/XJtGD
 YHukhQaEyws7RaDAXK1Uty1hlqgzDoVcFz1TixJIrF6YaO2QhhjMa/oYkbBW09s=
 =Ojrz
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.17-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen updates from David Vrabel:
 - remove unused V2 grant table support
 - note that Konrad is xen-blkkback/front maintainer
 - add 'xen_nopv' option to disable PV extentions for x86 HVM guests
 - misc minor cleanups

* tag 'stable/for-linus-3.17-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen-pciback: Document the 'quirks' sysfs file
  xen/pciback: Fix error return code in xen_pcibk_attach()
  xen/events: drop negativity check of unsigned parameter
  xen/setup: Remove Identity Map Debug Message
  xen/events/fifo: remove a unecessary use of BM()
  xen/events/fifo: ensure all bitops are properly aligned even on x86
  xen/events/fifo: reset control block and local HEADs on resume
  xen/arm: use BUG_ON
  xen/grant-table: remove support for V2 tables
  x86/xen: safely map and unmap grant frames when in atomic context
  MAINTAINERS: Make me the Xen block subsystem (front and back) maintainer
  xen: Introduce 'xen_nopv' to disable PV extensions for HVM guests.
2014-08-07 11:33:15 -07:00
Linus Torvalds
76f09aa464 Merge branch 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI changes from Ingo Molnar:
 "Main changes in this cycle are:

   - arm64 efi stub fixes, preservation of FP/SIMD registers across
     firmware calls, and conversion of the EFI stub code into a static
     library - Ard Biesheuvel

   - Xen EFI support - Daniel Kiper

   - Support for autoloading the efivars driver - Lee, Chun-Yi

   - Use the PE/COFF headers in the x86 EFI boot stub to request that
     the stub be loaded with CONFIG_PHYSICAL_ALIGN alignment - Michael
     Brown

   - Consolidate all the x86 EFI quirks into one file - Saurabh Tangri

   - Additional error logging in x86 EFI boot stub - Ulf Winkelvos

   - Support loading initrd above 4G in EFI boot stub - Yinghai Lu

   - EFI reboot patches for ACPI hardware reduced platforms"

* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
  efi/arm64: Handle missing virtual mapping for UEFI System Table
  arch/x86/xen: Silence compiler warnings
  xen: Silence compiler warnings
  x86/efi: Request desired alignment via the PE/COFF headers
  x86/efi: Add better error logging to EFI boot stub
  efi: Autoload efivars
  efi: Update stale locking comment for struct efivars
  arch/x86: Remove efi_set_rtc_mmss()
  arch/x86: Replace plain strings with constants
  xen: Put EFI machinery in place
  xen: Define EFI related stuff
  arch/x86: Remove redundant set_bit(EFI_MEMMAP) call
  arch/x86: Remove redundant set_bit(EFI_SYSTEM_TABLES) call
  efi: Introduce EFI_PARAVIRT flag
  arch/x86: Do not access EFI memory map if it is not available
  efi: Use early_mem*() instead of early_io*()
  arch/ia64: Define early_memunmap()
  x86/reboot: Add EFI reboot quirk for ACPI Hardware Reduced flag
  efi/reboot: Allow powering off machines using EFI
  efi/reboot: Add generic wrapper around EfiResetSystem()
  ...
2014-08-04 17:13:50 -07:00
Matt Rushton
3fbdf631b7 xen/setup: Remove Identity Map Debug Message
Removing a debug message for setting the identity map since it becomes
rather noisy after rework of the identity map code.

Signed-off-by: Matthew Rushton <mrushton@amazon.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-07-31 18:40:13 +01:00
Linus Torvalds
acba648dca Fix BUG when trying to expand the grant table. This seems to occur
often during boot with Ubuntu 14.04 PV guests.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJT2PhgAAoJEFxbo/MsZsTRlzIH/1HjbkGZmRlOj5wcrYlWCUJ/
 DGLBHc76so52xd9oP8COT5tuSVP6/usPPLFaOmVZ7fMiOpoyz9d3lc0g56otw3gJ
 tTUFTyW0EoFtvmIl50OMC726p9azETjA3P2XJkV/D3GhBGGqgrP5uR+mRvisvq3y
 eGZEx1UIHv1jov47TBFR1NcckXBWw+6J9m34y9h6an9VNDCuuGwYZ8dfGAFsLrVb
 lGLTmgQQmyk4SexVINfOwL40KkVDVEq+X74HcPviyNHEIy66xLzMtKpL+Sf4xeuv
 VG3JhqAUGuRGGK48rrbpxhBbpxGp35O9RV68YrGssxfuTejSYduw5zTzzt30QIA=
 =cr8X
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.16-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen fix from David Vrabel:
 "Fix BUG when trying to expand the grant table.  This seems to occur
  often during boot with Ubuntu 14.04 PV guests"

* tag 'stable/for-linus-3.16-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  x86/xen: safely map and unmap grant frames when in atomic context
2014-07-30 09:00:20 -07:00
David Vrabel
b7dd0e350e x86/xen: safely map and unmap grant frames when in atomic context
arch_gnttab_map_frames() and arch_gnttab_unmap_frames() are called in
atomic context but were calling alloc_vm_area() which might sleep.

Also, if a driver attempts to allocate a grant ref from an interrupt
and the table needs expanding, then the CPU may already by in lazy MMU
mode and apply_to_page_range() will BUG when it tries to re-enable
lazy MMU mode.

These two functions are only used in PV guests.

Introduce arch_gnttab_init() to allocates the virtual address space in
advance.

Avoid the use of apply_to_page_range() by using saving and using the
array of PTE addresses from the alloc_vm_area() call (which ensures
that the required page tables are pre-allocated).

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-07-30 14:22:47 +01:00
Daniel Kiper
c7341d6a61 arch/x86/xen: Silence compiler warnings
Compiler complains in the following way when x86 32-bit kernel
with Xen support is build:

  CC      arch/x86/xen/enlighten.o
arch/x86/xen/enlighten.c: In function ‘xen_start_kernel’:
arch/x86/xen/enlighten.c:1726:3: warning: right shift count >= width of type [enabled by default]

Such line contains following EFI initialization code:

boot_params.efi_info.efi_systab_hi = (__u32)(__pa(efi_systab_xen) >> 32);

There is no issue if x86 64-bit kernel is build. However, 32-bit case
generate warning (even if that code will not be executed because Xen
does not work on 32-bit EFI platforms) due to __pa() returning unsigned long
type which has 32-bits width. So move whole EFI initialization stuff
to separate function and build it conditionally to avoid above mentioned
warning on x86 32-bit architecture.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-07-18 21:24:03 +01:00
Daniel Kiper
be81c8a1da xen: Put EFI machinery in place
This patch enables EFI usage under Xen dom0. Standard EFI Linux
Kernel infrastructure cannot be used because it requires direct
access to EFI data and code. However, in dom0 case it is not possible
because above mentioned EFI stuff is fully owned and controlled
by Xen hypervisor. In this case all calls from dom0 to EFI must
be requested via special hypercall which in turn executes relevant
EFI code in behalf of dom0.

When dom0 kernel boots it checks for EFI availability on a machine.
If it is detected then artificial EFI system table is filled.
Native EFI callas are replaced by functions which mimics them
by calling relevant hypercall. Later pointer to EFI system table
is passed to standard EFI machinery and it continues EFI subsystem
initialization taking into account that there is no direct access
to EFI boot services, runtime, tables, structures, etc. After that
system runs as usual.

This patch is based on Jan Beulich and Tang Liang work.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Tang Liang <liang.tang@oracle.com>
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-07-18 21:23:58 +01:00
David Vrabel
438b33c714 xen/grant-table: remove support for V2 tables
Since 11c7ff17c9 (xen/grant-table: Force
to use v1 of grants.) the code for V2 grant tables is not used.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-07-14 15:28:30 -04:00
David Vrabel
162e371712 x86/xen: safely map and unmap grant frames when in atomic context
arch_gnttab_map_frames() and arch_gnttab_unmap_frames() are called in
atomic context but were calling alloc_vm_area() which might sleep.

Also, if a driver attempts to allocate a grant ref from an interrupt
and the table needs expanding, then the CPU may already by in lazy MMU
mode and apply_to_page_range() will BUG when it tries to re-enable
lazy MMU mode.

These two functions are only used in PV guests.

Introduce arch_gnttab_init() to allocates the virtual address space in
advance.

Avoid the use of apply_to_page_range() by using saving and using the
array of PTE addresses from the alloc_vm_area() call.

N.B. 'alloc_vm_area' pre-allocates the pagetable so there is no need
to worry about having to do a PGD/PUD/PMD walk (like apply_to_page_range
does) and we can instead do set_pte.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
----
[v2: Add comment about alloc_vm_area]
[v3: Fix compile error found by 0-day bot]
2014-07-14 15:28:02 -04:00
Konrad Rzeszutek Wilk
8d693b911b xen: Introduce 'xen_nopv' to disable PV extensions for HVM guests.
By default when CONFIG_XEN and CONFIG_XEN_PVHVM kernels are
run, they will enable the PV extensions (drivers, interrupts, timers,
etc) - which is the best option for the majority of use cases.

However, in some cases (kexec not fully working, benchmarking)
we want to disable Xen PV extensions. As such introduce the
'xen_nopv' parameter that will do it.

This parameter is intended only for HVM guests as the Xen PV
guests MUST boot with PV extensions. However, even if you use
'xen_nopv' on Xen PV guests it will be ignored.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
---
[v2: s/off/xen_nopv/ per Boris Ostrovsky recommendation.]
[v3: Add Reviewed-by]
[v4: Clarify that this is only for HVM guests]
2014-07-14 12:21:59 -04:00
Linus Torvalds
3d09c62394 xen: regression and PVH fixes for 3.16-rc1
- Fix dom0 PVH memory setup on latest unstable Xen releases.
 - Fix 64-bit x86 PV guest boot failure on Xen 3.1 and earlier.
 - Fix resume regression on non-PV (auto-translated physmap) guests.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJToWpQAAoJEFxbo/MsZsTR3aIIAJksTufa6SBD/fgWUMP0jR25
 ePNSgvLt8OkDTYN4IGtQ64lGEnSyHeaSWfpuNs4tv5mynVSvT/AdK5euXPT/jju/
 Ct64okBoFiadyI0ZQ9elHh4+eXHZdoyzeVHY/Uzi4H+0viYZLrIJrDN50tcy00JN
 JQdBDenQODLG6lyAuF/DrVlN6ZXGGlGiF2qAnkV8m5dmWwuXQ9WYoU3RuzB63a+y
 VDxDCBJXwZMhBQbNhnxQ2pF6KC7yuexm1rHIpphkqJ1KWvo7p6GYkkvZv4H1Zik7
 XI0GeuXsdUWjDxO1sYTI3aJ2tgz7OIVr9gw+thbMfa8m+k7hkiUWbuk8a5miOgo=
 =W+9z
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen fixes from David Vrabel:
 "Xen regression and PVH fixes for 3.16-rc1

   - fix dom0 PVH memory setup on latest unstable Xen releases
   - fix 64-bit x86 PV guest boot failure on Xen 3.1 and earlier
   - fix resume regression on non-PV (auto-translated physmap) guests"

* tag 'stable/for-linus-3.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/grant-table: fix suspend for non-PV guests
  x86/xen: no need to explicitly register an NMI callback
  Revert "xen/pvh: Update E820 to work with PVH (v2)"
  x86/xen: fix memory setup for PVH dom0
2014-06-19 07:53:27 -10:00
David Vrabel
ea9f9274bf x86/xen: no need to explicitly register an NMI callback
Remove xen_enable_nmi() to fix a 64-bit guest crash when registering
the NMI callback on Xen 3.1 and earlier.

It's not needed since the NMI callback is set by a set_trap_table
hypercall (in xen_load_idt() or xen_write_idt_entry()).

It's also broken since it only set the current VCPU's callback.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
2014-06-18 10:57:41 +01:00
Linus Torvalds
a0abcf2e8f Merge branch 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
Pull x86 cdso updates from Peter Anvin:
 "Vdso cleanups and improvements largely from Andy Lutomirski.  This
  makes the vdso a lot less ''special''"

* 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vdso, build: Make LE access macros clearer, host-safe
  x86/vdso, build: Fix cross-compilation from big-endian architectures
  x86/vdso, build: When vdso2c fails, unlink the output
  x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
  x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls
  x86, mm: Improve _install_special_mapping and fix x86 vdso naming
  mm, fs: Add vm_ops->name as an alternative to arch_vma_name
  x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
  x86, vdso: Remove vestiges of VDSO_PRELINK and some outdated comments
  x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
  x86, vdso: Move the 32-bit vdso special pages after the text
  x86, vdso: Reimplement vdso.so preparation in build-time C
  x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c
  x86, vdso: Clean up 32-bit vs 64-bit vdso params
  x86, mm: Ensure correct alignment of the fixmap
2014-06-05 08:05:29 -07:00
David Vrabel
562658f3dc Revert "xen/pvh: Update E820 to work with PVH (v2)"
This reverts commit 9103bb0f82.

Now than xen_memory_setup() is not called for auto-translated guests,
we can remove this commit.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Roger Pau Monné <roger.pau@citrix.com>
2014-06-05 14:23:16 +01:00
David Vrabel
abacaadc41 x86/xen: fix memory setup for PVH dom0
Since af06d66ee32b (x86: fix setup of PVH Dom0 memory map) in Xen, PVH
dom0 need only use the memory memory provided by Xen which has already
setup all the correct holes.

xen_memory_setup() then ends up being trivial for a PVH guest so
introduce a new function (xen_auto_xlated_memory_setup()).

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Roger Pau Monné <roger.pau@citrix.com>
2014-06-05 14:22:27 +01:00
Linus Torvalds
9f888b3a10 xen: features and fixes for 3.16-rc0
- Support foreign mappings in PVH domains (needed when dom0 is PVH)
 
 - Fix mapping high MMIO regions in x86 PV guests (this is also the
   first half of removing the PAGE_IOMAP PTE flag).
 
 - ARM suspend/resume support.
 
 - ARM multicall support.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJTjE5MAAoJEFxbo/MsZsTRtl8H/2lfS9w05e60vRxjolPV0vRc
 5k9DcYFeJ+k2cz/2T3mNlIvKdfBTesSfgVquH+28GhQz+uKFQ1OrJpYNDTougSw5
 Wv0Ae8e+7eLABvJ9XMiZdDsPzsICw2wqWOvqrnQi2qR3SIimBc5tBigR4+Rccv+e
 btuBLlYT4WPQ8qgNyCBPgxzuyxteu5wK/0XryX6NcbrxeEbAzQAeDKkmvCD4fSvx
 KxrwTO3mwV4Lefmf/WS4Z9fDcPujQOUqKEtUWanw/2JalO1BzDPo+1wvYs0LduLC
 QI/YJN4SL3UeGOmbX2tyIaRgMsAcQVVrYkTm1cp8eD7vcRuvXaqy6dxuX05+V4g=
 =cxfG
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.16-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip into next

Pull Xen updates from David Vrabel:
 "xen: features and fixes for 3.16-rc0
   - support foreign mappings in PVH domains (needed when dom0 is PVH)

   - fix mapping high MMIO regions in x86 PV guests (this is also the
     first half of removing the PAGE_IOMAP PTE flag).

   - ARM suspend/resume support.

   - ARM multicall support"

* tag 'stable/for-linus-3.16-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  x86/xen: map foreign pfns for autotranslated guests
  xen-acpi-processor: Don't display errors when we get -ENOSYS
  xen/pciback: Document the entry points for 'pcistub_put_pci_dev'
  xen/pciback: Document when the 'unbind' and 'bind' functions are called.
  xen-pciback: Document when we FLR an PCI device.
  xen-pciback: First reset, then free.
  xen-pciback: Cleanup up pcistub_put_pci_dev
  x86/xen: do not use _PAGE_IOMAP in xen_remap_domain_mfn_range()
  x86/xen: set regions above the end of RAM as 1:1
  x86/xen: only warn once if bad MFNs are found during setup
  x86/xen: compactly store large identity ranges in the p2m
  x86/xen: fix set_phys_range_identity() if pfn_e > MAX_P2M_PFN
  x86/xen: rename early_p2m_alloc() and early_p2m_alloc_middle()
  xen/x86: set panic notifier priority to minimum
  arm,arm64/xen: introduce HYPERVISOR_suspend()
  xen: refactor suspend pre/post hooks
  arm: xen: export HYPERVISOR_multicall to modules.
  arm64: introduce virt_to_pfn
  arm/xen: Remove definiition of virt_to_pfn in asm/xen/page.h
  arm: xen: implement multicall hypercall support.
2014-06-02 08:24:12 -07:00
Mukesh Rathor
77945ca73e x86/xen: map foreign pfns for autotranslated guests
When running as a dom0 in PVH mode, foreign pfns that are accessed
must be added to our p2m which is managed by xen. This is done via
XENMEM_add_to_physmap_range hypercall. This is needed for toolstack
building guests and mapping guest memory, xentrace mapping xen pages,
etc.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-05-27 11:47:04 +01:00
H. Peter Anvin
03c1b4e8e5 Merge remote-tracking branch 'origin/x86/espfix' into x86/vdso
Merge x86/espfix into x86/vdso, due to changes in the vdso setup code
that otherwise cause conflicts.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-21 17:36:33 -07:00
David Vrabel
f59c5145dc x86/xen: do not use _PAGE_IOMAP in xen_remap_domain_mfn_range()
_PAGE_IOMAP is used in xen_remap_domain_mfn_range() to prevent the
pfn_pte() call in remap_area_mfn_pte_fn() from using the p2m to translate
the MFN.  If mfn_pte() is used instead, the p2m look up is avoided and
the use of _PAGE_IOMAP is no longer needed.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-05-15 16:16:40 +01:00
David Vrabel
25b884a83d x86/xen: set regions above the end of RAM as 1:1
PCI devices may have BARs located above the end of RAM so mark such
frames as identity frames in the p2m (instead of the default of
missing).

PFNs outside the p2m (above MAX_P2M_PFN) are also considered to be
identity frames for the same reason.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-05-15 16:15:18 +01:00
David Vrabel
2dcc9a3de1 x86/xen: only warn once if bad MFNs are found during setup
In xen_add_extra_mem(), if the WARN() checks for bad MFNs trigger it is
likely that they will trigger at lot, spamming the log.

Use WARN_ONCE() instead.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-05-15 16:15:01 +01:00
David Vrabel
3cb83e46d0 x86/xen: compactly store large identity ranges in the p2m
Large (multi-GB) identity ranges currently require a unique middle page
(filled with p2m_identity entries) per 1 GB region.

Similar to the common p2m_mid_missing middle page for large missing
regions, introduce a p2m_mid_identity page (filled with p2m_identity
entries) which can be used instead.

set_phys_range_identity() thus only needs to allocate new middle pages
at the beginning and end of the range.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-05-15 16:14:44 +01:00
David Vrabel
a9b5bff66b x86/xen: fix set_phys_range_identity() if pfn_e > MAX_P2M_PFN
Allow set_phys_range_identity() to work with a range that overlaps
MAX_P2M_PFN by clamping pfn_e to MAX_P2M_PFN.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-05-15 16:13:23 +01:00
David Vrabel
fcca2e3119 x86/xen: rename early_p2m_alloc() and early_p2m_alloc_middle()
early_p2m_alloc_middle() allocates a new leaf page and
early_p2m_alloc() allocates a new middle page.  This is confusing.

Swap the names so they match what the functions actually do.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-05-15 16:12:25 +01:00
Radim Krčmář
bc5eb20161 xen/x86: set panic notifier priority to minimum
Execution is not going to continue after telling Xen about the crash.
Let other panic notifiers run by postponing the final hypercall as much
as possible.

Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-05-15 15:54:57 +01:00
David Vrabel
aa8532c322 xen: refactor suspend pre/post hooks
New architectures currently have to provide implementations of 5 different
functions: xen_arch_pre_suspend(), xen_arch_post_suspend(),
xen_arch_hvm_post_suspend(), xen_mm_pin_all(), and xen_mm_unpin_all().

Refactor the suspend code to only require xen_arch_pre_suspend() and
xen_arch_post_suspend().

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2014-05-12 17:19:56 +01:00
Andi Kleen
2605fc216f asmlinkage, x86: Add explicit __visible to arch/x86/*
As requested by Linus add explicit __visible to the asmlinkage users.
This marks all functions visible to assembler.

Tree sweep for arch/x86/*

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1398984278-29319-3-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-05 16:07:44 -07:00
Andy Lutomirski
f40c330091 x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
This makes the 64-bit and x32 vdsos use the same mechanism as the
32-bit vdso.  Most of the churn is deleting all the old fixmap code.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/8af87023f57f6bb96ec8d17fce3f88018195b49b.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-05 13:19:01 -07:00
Andy Lutomirski
6f121e548f x86, vdso: Reimplement vdso.so preparation in build-time C
Currently, vdso.so files are prepared and analyzed by a combination
of objcopy, nm, some linker script tricks, and some simple ELF
parsers in the kernel.  Replace all of that with plain C code that
runs at build time.

All five vdso images now generate .c files that are compiled and
linked in to the kernel image.

This should cause only one userspace-visible change: the loaded vDSO
images are stripped more heavily than they used to be.  Everything
outside the loadable segment is dropped.  In particular, this causes
the section table and section name strings to be missing.  This
should be fine: real dynamic loaders don't load or inspect these
tables anyway.  The result is roughly equivalent to eu-strip's
--strip-sections option.

The purpose of this change is to enable the vvar and hpet mappings
to be moved to the page following the vDSO load segment.  Currently,
it is possible for the section table to extend into the page after
the load segment, so, if we map it, it risks overlapping the vvar or
hpet page.  This happens whenever the load segment is just under a
multiple of PAGE_SIZE.

The only real subtlety here is that the old code had a C file with
inline assembler that did 'call VDSO32_vsyscall' and a linker script
that defined 'VDSO32_vsyscall = __kernel_vsyscall'.  This most
likely worked by accident: the linker script entry defines a symbol
associated with an address as opposed to an alias for the real
dynamic symbol __kernel_vsyscall.  That caused ld to relocate the
reference at link time instead of leaving an interposable dynamic
relocation.  Since the VDSO32_vsyscall hack is no longer needed, I
now use 'call __kernel_vsyscall', and I added -Bsymbolic to make it
work.  vdso2c will generate an error and abort the build if the
resulting image contains any dynamic relocations, so we won't
silently generate bad vdso images.

(Dynamic relocations are a problem because nothing will even attempt
to relocate the vdso.)

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/2c4fcf45524162a34d87fdda1eb046b2a5cecee7.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-05 13:18:51 -07:00
Linus Torvalds
88764e0a3e Xen regression and bug fixes for 3.15-rc1.
- Fix completely broken 32-bit PV guests caused by x86 refactoring
   32-bit thread_info.
 - Only enable ticketlock slow path on Xen (not bare metal).
 - Fix two bugs with PV guests not shutting down when requested.
 - Fix a minor memory leak in xen-pciback error path.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQEcBAABAgAGBQJTT/hQAAoJEFxbo/MsZsTR6sMIAJs7mJXSqDQn3Z8O+TemRa53
 p92ZomTNYALjUMglXcxJ2Zua6IsZMWdu7jcV1GoXC70V4YLmUs8KaBgZmI5ayUQy
 bBpK+6WIAJyBkJdNH5fK3wggJ2UZjw0/twPNgd9gACwjUiYhx8iHN/hTGvu4qPBJ
 MGAIlg6wdnGwRydi72uk9Am/xpebEdQy4DRD20vjwA/qUkT4uHVv/AA4hc4AK29w
 ToK8qFSisgAlahcmq8/T4+OBFEKz78b9dQcdsGWyAk0ofWILfwD1l53xhzUin25s
 JUVevWhhLCKRZBOq4Ykc5qyqnLff4m56rm/THQ6f0oRdJn/OR+SWOImda2Qqmvs=
 =Gxpq
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.15-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen fixes from David Vrabel:
 "Xen regression and bug fixes for 3.15-rc1:

   - fix completely broken 32-bit PV guests caused by x86 refactoring
     32-bit thread_info.
   - only enable ticketlock slow path on Xen (not bare metal)
   - fix two bugs with PV guests not shutting down when requested
   - fix a minor memory leak in xen-pciback error path"

* tag 'stable/for-linus-3.15-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/manage: Poweroff forcefully if user-space is not yet up.
  xen/xenbus: Avoid synchronous wait on XenBus stalling shutdown/restart.
  xen/spinlock: Don't enable them unconditionally.
  xen-pciback: silence an unwanted debug printk
  xen: fix memory leak in __xen_pcibk_add_pci_dev()
  x86/xen: Fix 32-bit PV guests's usage of kernel_stack
2014-04-17 10:54:07 -07:00
Konrad Rzeszutek Wilk
e0fc17a936 xen/spinlock: Don't enable them unconditionally.
The git commit a945928ea2
('xen: Do not enable spinlocks before jump_label_init() has executed')
was added to deal with the jump machinery. Earlier the code
that turned on the jump label was only called by Xen specific
functions. But now that it had been moved to the initcall machinery
it gets called on Xen, KVM, and baremetal - ouch!. And the detection
machinery to only call it on Xen wasn't remembered in the heat
of merge window excitement.

This means that the slowpath is enabled on baremetal while it should
not be.

Reported-by: Waiman Long <waiman.long@hp.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
CC: stable@vger.kernel.org
CC: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-04-15 17:41:28 +01:00
Boris Ostrovsky
4461bbc05b x86/xen: Fix 32-bit PV guests's usage of kernel_stack
Commit 198d208df4 ("x86: Keep
thread_info on thread stack in x86_32") made 32-bit kernels use
kernel_stack to point to thread_info. That change missed a couple of
updates needed by Xen's 32-bit PV guests:

1. kernel_stack needs to be initialized for secondary CPUs

2. GET_THREAD_INFO() now uses %fs register which may not be the
   kernel's version when executing xen_iret().

With respect to the second issue, we don't need GET_THREAD_INFO()
anymore: we used it as an intermediate step to get to per_cpu xen_vcpu
and avoid referencing %fs. Now that we are going to use %fs anyway we
may as well go directly to xen_vcpu.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-04-15 15:00:14 +01:00
David Vrabel
2c5cb27703 Merge commit '683b6c6f82a60fabf47012581c2cfbf1b037ab95' into stable/for-linus-3.15
This merge of the irq-core-for-linus branch broke the ARM build when
Xen is enabled.

Conflicts:
	drivers/xen/events/events_base.c
2014-04-07 13:52:12 +01:00
Linus Torvalds
a372c967a3 Support PCI devices with multiple MSIs, performance improvement for
kernel-based backends (by not populated m2p overrides when mapping),
 and assorted minor bug fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQEcBAABAgAGBQJTPTGyAAoJEFxbo/MsZsTRnjgH/10j5CbOK1RFvIyCSslGTf4G
 slhK8P8dhhplGAxwXXji322lWNYEx9Jd+V0Bhxnvr4drSlsP/qkWuBWf+u1LBvRq
 AVPM99tk0XHCVAuvMMNo/lc62dTIR9IpQvnY6WhHSHnSlfqyVcdnbaGk8/LRuxWJ
 u2F0MXzDNH00b/kt6hDBt3F7CkHfjwsEn43LCkkxyHPp5MJGD7bGDIe+bKtnjv9u
 D9VJtCWQkrjWQ6jNpjdP833JCNCGQrXtVO3DeTAGs3T1tGmiEsqp6kT6Gp5zCFnh
 oaQk9jfQL2S+IVnVhHVMW9nTwNPPrnIrD69FlgTrK301mcYW1mKoFotTogzHu+0=
 =2IG+
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen features and fixes from David Vrabel:
 "Support PCI devices with multiple MSIs, performance improvement for
  kernel-based backends (by not populated m2p overrides when mapping),
  and assorted minor bug fixes and cleanups"

* tag 'stable/for-linus-3.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/acpi-processor: fix enabling interrupts on syscore_resume
  xen/grant-table: Refactor gnttab_[un]map_refs to avoid m2p_override
  xen: remove XEN_PRIVILEGED_GUEST
  xen: add support for MSI message groups
  xen-pciback: Use pci_enable_msix_exact() instead of pci_enable_msix()
  xen/xenbus: remove unused xenbus_bind_evtchn()
  xen/events: remove unnecessary call to bind_evtchn_to_cpu()
  xen/events: remove the unused resend_irq_on_evtchn()
  drivers:xen-selfballoon:reset 'frontswap_inertia_counter' after frontswap_shrink
  drivers: xen: Include appropriate header file in pcpu.c
  drivers: xen: Mark function as static in platform-pci.c
2014-04-03 14:01:37 -07:00
Linus Torvalds
467cbd207a Merge branch 'x86-nuke-platforms-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 old platform removal from Peter Anvin:
 "This patchset removes support for several completely obsolete
  platforms, where the maintainers either have completely vanished or
  acked the removal.  For some of them it is questionable if there even
  exists functional specimens of the hardware"

Geert Uytterhoeven apparently thought this was a April Fool's pull request ;)

* 'x86-nuke-platforms-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, platforms: Remove NUMAQ
  x86, platforms: Remove SGI Visual Workstation
  x86, apic: Remove support for IBM Summit/EXA chipset
  x86, apic: Remove support for ia32-based Unisys ES7000
2014-04-02 13:15:58 -07:00
Linus Torvalds
c6f21243ce Merge branch 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 vdso changes from Peter Anvin:
 "This is the revamp of the 32-bit vdso and the associated cleanups.

  This adds timekeeping support to the 32-bit vdso that we already have
  in the 64-bit vdso.  Although 32-bit x86 is legacy, it is likely to
  remain in the embedded space for a very long time to come.

  This removes the traditional COMPAT_VDSO support; the configuration
  variable is reused for simply removing the 32-bit vdso, which will
  produce correct results but obviously suffer a performance penalty.
  Only one beta version of glibc was affected, but that version was
  unfortunately included in one OpenSUSE release.

  This is not the end of the vdso cleanups.  Stefani and Andy have
  agreed to continue work for the next kernel cycle; in fact Andy has
  already produced another set of cleanups that came too late for this
  cycle.

  An incidental, but arguably important, change is that this ensures
  that unused space in the VVAR page is properly zeroed.  It wasn't
  before, and would contain whatever garbage was left in memory by BIOS
  or the bootloader.  Since the VVAR page is accessible to user space
  this had the potential of information leaks"

* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  x86, vdso: Fix the symbol versions on the 32-bit vDSO
  x86, vdso, build: Don't rebuild 32-bit vdsos on every make
  x86, vdso: Actually discard the .discard sections
  x86, vdso: Fix size of get_unmapped_area()
  x86, vdso: Finish removing VDSO32_PRELINK
  x86, vdso: Move more vdso definitions into vdso.h
  x86: Load the 32-bit vdso in place, just like the 64-bit vdsos
  x86, vdso32: handle 32 bit vDSO larger one page
  x86, vdso32: Disable stack protector, adjust optimizations
  x86, vdso: Zero-pad the VVAR page
  x86, vdso: Add 32 bit VDSO time support for 64 bit kernel
  x86, vdso: Add 32 bit VDSO time support for 32 bit kernel
  x86, vdso: Patch alternatives in the 32-bit VDSO
  x86, vdso: Introduce VVAR marco for vdso32
  x86, vdso: Cleanup __vdso_gettimeofday()
  x86, vdso: Replace VVAR(vsyscall_gtod_data) by gtod macro
  x86, vdso: __vdso_clock_gettime() cleanup
  x86, vdso: Revamp vclock_gettime.c
  mm: Add new func _install_special_mapping() to mmap.c
  x86, vdso: Make vsyscall_gtod_data handling x86 generic
  ...
2014-04-02 12:26:43 -07:00
Linus Torvalds
683b6c6f82 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq code updates from Thomas Gleixner:
 "The irq department proudly presents:

   - Another tree wide sweep of irq infrastructure abuse.  Clear winner
     of the trainwreck engineering contest was:
         #include "../../../kernel/irq/settings.h"

   - Tree wide update of irq_set_affinity() callbacks which miss a cpu
     online check when picking a single cpu out of the affinity mask.

   - Tree wide consolidation of interrupt statistics.

   - Updates to the threaded interrupt infrastructure to allow explicit
     wakeup of the interrupt thread and a variant of synchronize_irq()
     which synchronizes only the hard interrupt handler.  Both are
     needed to replace the homebrewn thread handling in the mmc/sdhci
     code.

   - New irq chip callbacks to allow proper support for GPIO based irqs.
     The GPIO based interrupts need to request/release GPIO resources
     from request/free_irq.

   - A few new ARM interrupt chips.  No revolutionary new hardware, just
     differently wreckaged variations of the scheme.

   - Small improvments, cleanups and updates all over the place"

I was hoping that that trainwreck engineering contest was a April Fools'
joke.  But no.

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (68 commits)
  irqchip: sun7i/sun6i: Disable NMI before registering the handler
  ARM: sun7i/sun6i: dts: Fix IRQ number for sun6i NMI controller
  ARM: sun7i/sun6i: irqchip: Update the documentation
  ARM: sun7i/sun6i: dts: Add NMI irqchip support
  ARM: sun7i/sun6i: irqchip: Add irqchip driver for NMI controller
  genirq: Export symbol no_action()
  arm: omap: Fix typo in ams-delta-fiq.c
  m68k: atari: Fix the last kernel_stat.h fallout
  irqchip: sun4i: Simplify sun4i_irq_ack
  irqchip: sun4i: Use handle_fasteoi_irq for all interrupts
  genirq: procfs: Make smp_affinity values go+r
  softirq: Add linux/irq.h to make it compile again
  m68k: amiga: Add linux/irq.h to make it compile again
  irqchip: sun4i: Don't ack IRQs > 0, fix acking of IRQ 0
  irqchip: sun4i: Fix a comment about mask register initialization
  irqchip: sun4i: Fix irq 0 not working
  genirq: Add a new IRQCHIP_EOI_THREADED flag
  genirq: Document IRQCHIP_ONESHOT_SAFE flag
  ARM: sunxi: dt: Convert to the new irq controller compatibles
  irqchip: sunxi: Change compatibles
  ...
2014-04-01 11:22:57 -07:00
David Vrabel
5926f87fda Revert "xen: properly account for _PAGE_NUMA during xen pte translations"
This reverts commit a9c8e4beee.

PTEs in Xen PV guests must contain machine addresses if _PAGE_PRESENT
is set and pseudo-physical addresses is _PAGE_PRESENT is clear.

This is because during a domain save/restore (migration) the page
table entries are "canonicalised" and uncanonicalised". i.e., MFNs are
converted to PFNs during domain save so that on a restore the page
table entries may be rewritten with the new MFNs on the destination.
This canonicalisation is only done for PTEs that are present.

This change resulted in writing PTEs with MFNs if _PAGE_PROTNONE (or
_PAGE_NUMA) was set but _PAGE_PRESENT was clear.  These PTEs would be
migrated as-is which would result in unexpected behaviour in the
destination domain.  Either a) the MFN would be translated to the
wrong PFN/page; b) setting the _PAGE_PRESENT bit would clear the PTE
because the MFN is no longer owned by the domain; or c) the present
bit would not get set.

Symptoms include "Bad page" reports when munmapping after migrating a
domain.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org>        [3.12+]
2014-03-25 11:11:42 +00:00
Zoltan Kiss
1429d46df4 xen/grant-table: Refactor gnttab_[un]map_refs to avoid m2p_override
The grant mapping API does m2p_override unnecessarily: only gntdev needs it,
for blkback and future netback patches it just cause a lock contention, as
those pages never go to userspace. Therefore this series does the following:
- the bulk of the original function (everything after the mapping hypercall)
  is moved to arch-dependent set/clear_foreign_p2m_mapping
- the "if (xen_feature(XENFEAT_auto_translated_physmap))" branch goes to ARM
- therefore the ARM function could be much smaller, the m2p_override stubs
  could be also removed
- on x86 the set_phys_to_machine calls were moved up to this new funcion
  from m2p_override functions
- and m2p_override functions are only called when there is a kmap_ops param

It also removes a stray space from arch/x86/include/asm/xen/page.h.

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Suggested-by: Anthony Liguori <aliguori@amazon.com>
Suggested-by: David Vrabel <david.vrabel@citrix.com>
Suggested-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2014-03-18 14:40:19 +00:00
Michael Opdenacker
395edbb80b xen: remove XEN_PRIVILEGED_GUEST
This patch removes the Kconfig symbol XEN_PRIVILEGED_GUEST which is
used nowhere in the tree.

We do know grub2 has a script that greps kernel configuration files for
its macro. It shouldn't do that. As Linus summarized:
    This is a grub bug. It really is that simple. Treat it as one.

Besides, grub2's grepping for that macro is actually superfluous. See,
that script currently contains this test (simplified):
    grep -x CONFIG_XEN_DOM0=y $config || grep -x CONFIG_XEN_PRIVILEGED_GUEST=y $config

But since XEN_DOM0 and XEN_PRIVILEGED_GUEST are by definition equal,
removing XEN_PRIVILEGED_GUEST cannot influence this test.

So there's no reason to not remove this symbol, like we do with all
unused Kconfig symbols.

[pebolle@tiscali.nl: rewrote commit explanation.]
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-03-18 14:40:19 +00:00
H. Peter Anvin
1f2cbcf648 x86, vdso, xen: Remove stray reference to FIX_VDSO
Checkin

    b0b49f2673 x86, vdso: Remove compat vdso support

... removed the VDSO from the fixmap, and thus FIX_VDSO; remove a
stray reference in Xen.

Found by Fengguang Wu's test robot.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Link: http://lkml.kernel.org/r/4bb4690899106eb11430b1186d5cc66ca9d1660c.1394751608.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-03-13 19:44:47 -07:00
Thomas Gleixner
770144ea7b x86: Xen: Use the core irq stats function
Let the core do the irq_desc resolution.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Xen <xen-devel@lists.xenproject.org>
Cc: x86 <x86@kernel.org>
Link: http://lkml.kernel.org/r/20140223212737.869264085@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-03-04 17:37:53 +01:00
H. Peter Anvin
c5f9ee3d66 x86, platforms: Remove SGI Visual Workstation
The SGI Visual Workstation seems to be dead; remove support so we
don't have to continue maintaining it.

Cc: Andrey Panin <pazke@donpac.ru>
Cc: Michael Reed <mdr@sgi.com>
Link: http://lkml.kernel.org/r/530CFD6C.7040705@zytor.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-27 08:07:39 -08:00
Mel Gorman
a9c8e4beee xen: properly account for _PAGE_NUMA during xen pte translations
Steven Noonan forwarded a users report where they had a problem starting
vsftpd on a Xen paravirtualized guest, with this in dmesg:

  BUG: Bad page map in process vsftpd  pte:8000000493b88165 pmd:e9cc01067
  page:ffffea00124ee200 count:0 mapcount:-1 mapping:     (null) index:0x0
  page flags: 0x2ffc0000000014(referenced|dirty)
  addr:00007f97eea74000 vm_flags:00100071 anon_vma:ffff880e98f80380 mapping:          (null) index:7f97eea74
  CPU: 4 PID: 587 Comm: vsftpd Not tainted 3.12.7-1-ec2 #1
  Call Trace:
    dump_stack+0x45/0x56
    print_bad_pte+0x22e/0x250
    unmap_single_vma+0x583/0x890
    unmap_vmas+0x65/0x90
    exit_mmap+0xc5/0x170
    mmput+0x65/0x100
    do_exit+0x393/0x9e0
    do_group_exit+0xcc/0x140
    SyS_exit_group+0x14/0x20
    system_call_fastpath+0x1a/0x1f
  Disabling lock debugging due to kernel taint
  BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:0 val:-1
  BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:1 val:1

The issue could not be reproduced under an HVM instance with the same
kernel, so it appears to be exclusive to paravirtual Xen guests.  He
bisected the problem to commit 1667918b64 ("mm: numa: clear numa
hinting information on mprotect") that was also included in 3.12-stable.

The problem was related to how xen translates ptes because it was not
accounting for the _PAGE_NUMA bit.  This patch splits pte_present to add
a pteval_present helper for use by xen so both bare metal and xen use
the same code when checking if a PTE is present.

[mgorman@suse.de: wrote changelog, proposed minor modifications]
[akpm@linux-foundation.org: fix typo in comment]
Reported-by: Steven Noonan <steven@uplinklabs.net>
Tested-by: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org>	[3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10 16:01:41 -08:00
Linus Torvalds
1cd731df09 Bug-fixes:
- Revert "xen/grant-table: Avoid m2p_override during mapping" as it broke Xen ARM build.
  - Fix CR4 not being set on AP processors in Xen PVH mode.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJS8AyQAAoJEFjIrFwIi8fJbD4IAJssMuaLI5CRsSWBgDFHHDFt
 srVJpDOYQiDr/TxkwFCVcL4sFy9Htb3KMArU4eIBl6uMqQbGa+3rHyXcHYI219YY
 XH3D8RG+9JChwsxtaeUEzwx1C8ehcygD34vtdcoQXa7eBuEi4TL3HeLifR+HrXKO
 UdFrTA34FmvpVFbSuRXkZh5sd6ca9et9xHuQHM8SIY6pVokY6xaEYOp17tfPZpwM
 7A6LFjUjXeugHC2L3+/H8UOHA9nSZQvnMiZOWq2Cusc2Dt2V7emzgk2wcc2CHttf
 EA6GbtiJzHqMPmt5EjubI9hHdSMB31HpY4hnQE38+ucl+BwiSdRE9z2Rm4TYClg=
 =IX4M
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen fixes from Konrad Rzeszutek Wilk:
 "Bug-fixes:
   - Revert "xen/grant-table: Avoid m2p_override during mapping" as it
     broke Xen ARM build.
   - Fix CR4 not being set on AP processors in Xen PVH mode"

* tag 'stable/for-linus-3.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/pvh: set CR4 flags for APs
  Revert "xen/grant-table: Avoid m2p_override during mapping"
2014-02-05 16:01:11 -08:00
Mukesh Rathor
afca50132c xen/pvh: set CR4 flags for APs
During bootup in the 'probe_page_size_mask' these CR4 flags are
set in there. But for AP processors they are not set as we do not
use 'secondary_startup_64' which the baremetal kernels uses.
Instead do it in this function which we use in Xen PVH during our
startup for AP processors.

As such fix it up to make sure we have that flag set.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-02-03 15:44:18 -05:00
Konrad Rzeszutek Wilk
e85fc98055 Revert "xen/grant-table: Avoid m2p_override during mapping"
This reverts commit 08ece5bb23.

As it breaks ARM builds and needs more attention
on the ARM side.

Acked-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-02-03 06:44:49 -05:00
Linus Torvalds
14164b46fc Bug-fixes:
- Xen ARM couldn't use the new FIFO events
  - Xen ARM couldn't use the SWIOTLB if compiled as 32-bit with 64-bit PCIe devices.
  - Grant table were doing needless M2P operations.
  - Ratchet down the self-balloon code so it won't OOM.
  - Fix misplaced kfree in Xen PVH error code paths.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJS68IQAAoJEFjIrFwIi8fJAWgH/j4HStEey3rgGcqwIWSHkkap
 +t55wsrT8Ylq6CzZjaUtCo3pB7HotW526x/0rA2pxVqHn/8oCN/1EtdrNtYm/umX
 qOoda+db5NIjAEGVLWSLqGyokJQDrX/brXIWfYR300e9fnJi7yT/rFC4QHoZVUYl
 5LME8XH/jE012vvYelNu6DbbodlRmVCT8hctJS+eB5ER2WmtD9Pkw4GybEXPVYJz
 hE0Ts1DN91nKP2FGJb+mfB9UFT5X8i00akAK+Qc1R3sRnRh6eRoNV8dgyCnudKpO
 UPEdiAZvgij+mzlgIYSz6nKH0U/VbvRsG3lc3i5Si3o+vR3CYPCkvzOGX2d0rjw=
 =7cxW
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.14-rc0-late-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen bugfixes from Konrad Rzeszutek Wilk:
 "Bug-fixes for the new features that were added during this cycle.

  There are also two fixes for long-standing issues for which we have a
  solution: grant-table operations extra work that was not needed
  causing performance issues and the self balloon code was too
  aggressive causing OOMs.

  Details:
   - Xen ARM couldn't use the new FIFO events
   - Xen ARM couldn't use the SWIOTLB if compiled as 32-bit with 64-bit PCIe devices.
   - Grant table were doing needless M2P operations.
   - Ratchet down the self-balloon code so it won't OOM.
   - Fix misplaced kfree in Xen PVH error code paths"

* tag 'stable/for-linus-3.14-rc0-late-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/pvh: Fix misplaced kfree from xlated_setup_gnttab_pages
  drivers: xen: deaggressive selfballoon driver
  xen/grant-table: Avoid m2p_override during mapping
  xen/gnttab: Use phys_addr_t to describe the grant frame base address
  xen: swiotlb: handle sizeof(dma_addr_t) != sizeof(phys_addr_t)
  arm/xen: Initialize event channels earlier
2014-01-31 08:38:18 -08:00
Dave Jones
f93576e1ac xen/pvh: Fix misplaced kfree from xlated_setup_gnttab_pages
Passing a freed 'pages' to free_xenballooned_pages will end badly
on kernels with slub debug enabled.

This looks out of place between the rc assign and the check, but
we do want to kfree pages regardless of which path we take.

Signed-off-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-01-31 09:48:58 -05:00
Zoltan Kiss
08ece5bb23 xen/grant-table: Avoid m2p_override during mapping
The grant mapping API does m2p_override unnecessarily: only gntdev needs it,
for blkback and future netback patches it just cause a lock contention, as
those pages never go to userspace. Therefore this series does the following:
- the original functions were renamed to __gnttab_[un]map_refs, with a new
  parameter m2p_override
- based on m2p_override either they follow the original behaviour, or just set
  the private flag and call set_phys_to_machine
- gnttab_[un]map_refs are now a wrapper to call __gnttab_[un]map_refs with
  m2p_override false
- a new function gnttab_[un]map_refs_userspace provides the old behaviour

It also removes a stray space from page.h and change ret to 0 if
XENFEAT_auto_translated_physmap, as that is the only possible return value
there.

v2:
- move the storing of the old mfn in page->index to gnttab_map_refs
- move the function header update to a separate patch

v3:
- a new approach to retain old behaviour where it needed
- squash the patches into one

v4:
- move out the common bits from m2p* functions, and pass pfn/mfn as parameter
- clear page->private before doing anything with the page, so m2p_find_override
  won't race with this

v5:
- change return value handling in __gnttab_[un]map_refs
- remove a stray space in page.h
- add detail why ret = 0 now at some places

v6:
- don't pass pfn to m2p* functions, just get it locally

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Suggested-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-01-31 09:48:32 -05:00
Linus Torvalds
12f2bbd609 Merge branch 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asmlinkage (LTO) changes from Peter Anvin:
 "This patchset adds more infrastructure for link time optimization
  (LTO).

  This patchset was pulled into my tree late because of a
  miscommunication (part of the patchset was picked up by other
  maintainers).  However, the patchset is strictly build-related and
  seems to be okay in testing"

* 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, asmlinkage, xen: Fix type of NMI
  x86, asmlinkage, xen, kvm: Make {xen,kvm}_lock_spinning global and visible
  x86: Use inline assembler instead of global register variable to get sp
  x86, asmlinkage, paravirt: Make paravirt thunks global
  x86, asmlinkage, paravirt: Don't rely on local assembler labels
  x86, asmlinkage, lguest: Fix C functions used by inline assembler
2014-01-30 18:15:32 -08:00
Andi Kleen
07ba06d9d2 x86, asmlinkage, xen: Fix type of NMI
LTO requires consistent types of symbols over all files.

So "nmi" cannot be declared as a char [] here, need to use the
correct function type.

Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1382458079-24450-8-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-01-29 22:17:18 -08:00
Andi Kleen
dd41f818e5 x86, asmlinkage, xen, kvm: Make {xen,kvm}_lock_spinning global and visible
These functions are called from inline assembler stubs, thus
need to be global and visible.

Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1382458079-24450-7-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-01-29 22:17:18 -08:00
Andi Kleen
a2e7f0e3a4 x86, asmlinkage, paravirt: Make paravirt thunks global
The paravirt thunks use a hack of using a static reference to a static
function to reference that function from the top level statement.

This assumes that gcc always generates static function names in a specific
format, which is not necessarily true.

Simply make these functions global and asmlinkage or __visible. This way the
static __used variables are not needed and everything works.

Functions with arguments are __visible to keep the register calling
convention on 32bit.

Changed in paravirt and in all users (Xen and vsmp)

v2: Use __visible for functions with arguments

Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Ido Yariv <ido@wizery.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1382458079-24450-5-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-01-29 22:17:17 -08:00
Roger Pau Monne
c9f6e9977e xen/pvh: Set X86_CR0_WP and others in CR0 (v2)
otherwise we will get for some user-space applications
that use 'clone' with CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID
end up hitting an assert in glibc manifested by:

general protection ip:7f80720d364c sp:7fff98fd8a80 error:0 in
libc-2.13.so[7f807209e000+180000]

This is due to the nature of said operations which sets and clears
the PID.  "In the successful one I can see that the page table of
the parent process has been updated successfully to use a
different physical page, so the write of the tid on
that page only affects the child...

On the other hand, in the failed case, the write seems to happen before
the copy of the original page is done, so both the parent and the child
end up with the same value (because the parent copies the page after
the write of the child tid has already happened)."
(Roger's analysis). The nature of this is due to the Xen's commit
of 51e2cac257ec8b4080d89f0855c498cbbd76a5e5
"x86/pvh: set only minimal cr0 and cr4 flags in order to use paging"
the CR0_WP was removed so COW features of the Linux kernel were not
operating properly.

While doing that also update the rest of the CR0 flags to be inline
with what a baremetal Linux kernel would set them to.

In 'secondary_startup_64' (baremetal Linux) sets:

X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP |
X86_CR0_AM | X86_CR0_PG

The hypervisor for HVM type guests (which PVH is a bit) sets:
X86_CR0_PE | X86_CR0_ET | X86_CR0_TS
For PVH it specifically sets:
X86_CR0_PG

Which means we need to set the rest: X86_CR0_MP | X86_CR0_NE  |
X86_CR0_WP | X86_CR0_AM to have full parity.

Signed-off-by: Roger Pau Monne <roger.pau@citrix.com>
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v1: Took out the cr4 writes to be a seperate patch]
[v2: 0-DAY kernel found xen_setup_gdt to be missing a static]
2014-01-21 13:26:05 -05:00
Konrad Rzeszutek Wilk
54d44eb3c7 xen/pvh: Use 'depend' instead of 'select'.
The usage of 'select' means it will enable the CONFIG
options without checking their dependencies. That meant
we would inadvertently turn on CONFIG_XEN_PVHM while its
core dependency (CONFIG_PCI) was turned off.

This patch fixes the warnings and compile failures:

warning: (XEN_PVH) selects XEN_PVHVM which has unmet direct
dependencies (HYPERVISOR_GUEST && XEN && PCI && X86_LOCAL_APIC)

Reported-by: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-01-10 10:45:35 -05:00
Wei Yongjun
5602aba808 xen/pvh: remove duplicated include from enlighten.c
Remove duplicated include.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-01-07 09:59:49 -05:00
Konrad Rzeszutek Wilk
0869a64232 xen/pvh: Fix compile issues with xen_pvh_domain()
Oddly enough it compiles for my ancient compiler but with
the supplied .config it does blow up. Fix is easy enough.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-01-07 09:59:28 -05:00
Mukesh Rathor
4e903a20da xen/pvh: Support ParaVirtualized Hardware extensions (v3).
PVH allows PV linux guest to utilize hardware extended capabilities,
such as running MMU updates in a HVM container.

The Xen side defines PVH as (from docs/misc/pvh-readme.txt,
with modifications):

"* the guest uses auto translate:
 - p2m is managed by Xen
 - pagetables are owned by the guest
 - mmu_update hypercall not available
* it uses event callback and not vlapic emulation,
* IDT is native, so set_trap_table hcall is also N/A for a PVH guest.

For a full list of hcalls supported for PVH, see pvh_hypercall64_table
in arch/x86/hvm/hvm.c in xen.  From the ABI prespective, it's mostly a
PV guest with auto translate, although it does use hvm_op for setting
callback vector."

Use .ascii and .asciz to define xen feature string. Note, the PVH
string must be in a single line (not multiple lines with \) to keep the
assembler from putting null char after each string before \.
This patch allows it to be configured and enabled.

We also use introduce the 'XEN_ELFNOTE_SUPPORTED_FEATURES' ELF note to
tell the hypervisor that 'hvm_callback_vector' is what the kernel
needs. We can not put it in 'XEN_ELFNOTE_FEATURES' as older hypervisor
parse fields they don't understand as errors and refuse to load
the kernel. This work-around fixes the problem.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2014-01-06 10:44:24 -05:00
Konrad Rzeszutek Wilk
6926f6d610 xen/pvh: Piggyback on PVHVM for grant driver (v4)
In PVH the shared grant frame is the PFN and not MFN,
hence its mapped via the same code path as HVM.

The allocation of the grant frame is done differently - we
do not use the early platform-pci driver and have an
ioremap area - instead we use balloon memory and stitch
all of the non-contingous pages in a virtualized area.

That means when we call the hypervisor to replace the GMFN
with a XENMAPSPACE_grant_table type, we need to lookup the
old PFN for every iteration instead of assuming a flat
contingous PFN allocation.

Lastly, we only use v1 for grants. This is because PVHVM
is not able to use v2 due to no XENMEM_add_to_physmap
calls on the error status page (see commit
69e8f430e2
 xen/granttable: Disable grant v2 for HVM domains.)

Until that is implemented this workaround has to
be in place.

Also per suggestions by Stefano utilize the PVHVM paths
as they share common functionality.

v2 of this patch moves most of the PVH code out in the
arch/x86/xen/grant-table driver and touches only minimally
the generic driver.

v3, v4: fixes us some of the code due to earlier patches.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2014-01-06 10:44:21 -05:00
Mukesh Rathor
2771374d47 xen/pvh: Piggyback on PVHVM for event channels (v2)
PVH is a PV guest with a twist - there are certain things
that work in it like HVM and some like PV. There is
a similar mode - PVHVM where we run in HVM mode with
PV code enabled - and this patch explores that.

The most notable PV interfaces are the XenBus and event channels.

We will piggyback on how the event channel mechanism is
used in PVHVM - that is we want the normal native IRQ mechanism
and we will install a vector (hvm callback) for which we
will call the event channel mechanism.

This means that from a pvops perspective, we can use
native_irq_ops instead of the Xen PV specific. Albeit in the
future we could support pirq_eoi_map. But that is
a feature request that can be shared with PVHVM.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2014-01-06 10:44:15 -05:00
Mukesh Rathor
9103bb0f82 xen/pvh: Update E820 to work with PVH (v2)
In xen_add_extra_mem() we can skip updating P2M as it's managed
by Xen. PVH maps the entire IO space, but only RAM pages need
to be repopulated.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2014-01-06 10:44:13 -05:00
Mukesh Rathor
5840c84b16 xen/pvh: Secondary VCPU bringup (non-bootup CPUs)
The VCPU bringup protocol follows the PV with certain twists.
From xen/include/public/arch-x86/xen.h:

Also note that when calling DOMCTL_setvcpucontext and VCPU_initialise
for HVM and PVH guests, not all information in this structure is updated:

 - For HVM guests, the structures read include: fpu_ctxt (if
 VGCT_I387_VALID is set), flags, user_regs, debugreg[*]

 - PVH guests are the same as HVM guests, but additionally use ctrlreg[3] to
 set cr3. All other fields not used should be set to 0.

This is what we do. We piggyback on the 'xen_setup_gdt' - but modify
a bit - we need to call 'load_percpu_segment' so that 'switch_to_new_gdt'
can load per-cpu data-structures. It has no effect on the VCPU0.

We also piggyback on the %rdi register to pass in the CPU number - so
that when we bootup a new CPU, the cpu_bringup_and_idle will have
passed as the first parameter the CPU number (via %rdi for 64-bit).

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-01-06 10:44:12 -05:00
Mukesh Rathor
8d656bbe43 xen/pvh: Load GDT/GS in early PV bootup code for BSP.
During early bootup we start life using the Xen provided
GDT, which means that we are running with %cs segment set
to FLAT_KERNEL_CS (FLAT_RING3_CS64 0xe033, GDT index 261).

But for PVH we want to be use HVM type mechanism for
segment operations. As such we need to switch to the HVM
one and also reload ourselves with the __KERNEL_CS:eip
to run in the proper GDT and segment.

For HVM this is usually done in 'secondary_startup_64' in
(head_64.S) but since we are not taking that bootup
path (we start in PV - xen_start_kernel) we need to do
that in the early PV bootup paths.

For good measure we also zero out the %fs, %ds, and %es
(not strictly needed as Xen has already cleared them
for us). The %gs is loaded by 'switch_to_new_gdt'.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
2014-01-06 10:44:10 -05:00
Mukesh Rathor
4dd322bc3b xen/pvh: Setup up shared_info.
For PVHVM the shared_info structure is provided via the same way
as for normal PV guests (see include/xen/interface/xen.h).

That is during bootup we get 'xen_start_info' via the %esi register
in startup_xen. Then later we extract the 'shared_info' from said
structure (in xen_setup_shared_info) and start using it.

The 'xen_setup_shared_info' is all setup to work with auto-xlat
guests, but there are two functions which it calls that are not:
xen_setup_mfn_list_list and xen_setup_vcpu_info_placement.
This patch modifies the P2M code (xen_setup_mfn_list_list)
while the "Piggyback on PVHVM for event channels" modifies
the xen_setup_vcpu_info_placement.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-01-06 10:44:09 -05:00
Mukesh Rathor
76bcceff0b xen/pvh/mmu: Use PV TLB instead of native.
We also optimize one - the TLB flush. The native operation would
needlessly IPI offline VCPUs causing extra wakeups. Using the
Xen one avoids that and lets the hypervisor determine which
VCPU needs the TLB flush.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-01-06 10:44:07 -05:00
Mukesh Rathor
4e44e44b0b xen/pvh: MMU changes for PVH (v2)
.. which are surprisingly small compared to the amount for PV code.

PVH uses mostly native mmu ops, we leave the generic (native_*) for
the majority and just overwrite the baremetal with the ones we need.

At startup, we are running with pre-allocated page-tables
courtesy of the tool-stack. But we still need to graft them
in the Linux initial pagetables. However there is no need to
unpin/pin and change them to R/O or R/W.

Note that the xen_pagetable_init due to 7836fec9d0994cc9c9150c5a33f0eb0eb08a335a
"xen/mmu/p2m: Refactor the xen_pagetable_init code." does not
need any changes - we just need to make sure that xen_post_allocator_init
does not alter the pvops from the default native one.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2014-01-06 10:44:05 -05:00
Konrad Rzeszutek Wilk
b621e157ba xen/mmu: Cleanup xen_pagetable_p2m_copy a bit.
Stefano noticed that the code runs only under 64-bit so
the comments about 32-bit are pointless.

Also we change the condition for xen_revector_p2m_tree
returning the same value (because it could not allocate
a swath of space to put the new P2M in) or it had been
called once already. In such we return early from the
function.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2014-01-06 10:44:04 -05:00
Konrad Rzeszutek Wilk
32df75cd14 xen/mmu/p2m: Refactor the xen_pagetable_init code (v2).
The revectoring and copying of the P2M only happens when
!auto-xlat and on 64-bit builds. It is not obvious from
the code, so lets have seperate 32 and 64-bit functions.

We also invert the check for auto-xlat to make the code
flow simpler.

Suggested-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-01-06 10:44:02 -05:00
Konrad Rzeszutek Wilk
696fd7c5b2 xen/pvh: Don't setup P2M tree.
P2M is not available for PVH. Fortunatly for us the
P2M code already has mostly the support for auto-xlat guest thanks to
commit 3d24bbd7dd
"grant-table: call set_phys_to_machine after mapping grant refs"
which: "
introduces set_phys_to_machine calls for auto_translated guests
(even on x86) in gnttab_map_refs and gnttab_unmap_refs.
translated by swiotlb-xen... " so we don't need to muck much.

with above mentioned "commit you'll get set_phys_to_machine calls
from gnttab_map_refs and gnttab_unmap_refs but PVH guests won't do
anything with them " (Stefano Stabellini) which is OK - we want
them to be NOPs.

This is because we assume that an "IOMMU is always present on the
plaform and Xen is going to make the appropriate IOMMU pagetable
changes in the hypercall implementation of GNTTABOP_map_grant_ref
and GNTTABOP_unmap_grant_ref, then eveything should be transparent
from PVH priviligied point of view and DMA transfers involving
foreign pages keep working with no issues[sp]

Otherwise we would need a P2M (and an M2P) for PVH priviligied to
track these foreign pages .. (see arch/arm/xen/p2m.c)."
(Stefano Stabellini).

We still have to inhibit the building of the P2M tree.
That had been done in the past by not calling
xen_build_dynamic_phys_to_machine (which setups the P2M tree
and gives us virtual address to access them). But we are missing
a check for xen_build_mfn_list_list - which was continuing to setup
the P2M tree and would blow up at trying to get the virtual
address of p2m_missing (which would have been setup by
xen_build_dynamic_phys_to_machine).

Hence a check is needed to not call xen_build_mfn_list_list when
running in auto-xlat mode.

Instead of replicating the check for auto-xlat in enlighten.c
do it in the p2m.c code. The reason is that the xen_build_mfn_list_list
is called also in xen_arch_post_suspend without any checks for
auto-xlat. So for PVH or PV with auto-xlat - we would needlessly
allocate space for an P2M tree.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2014-01-06 10:44:01 -05:00
Mukesh Rathor
d285d68314 xen/pvh: Early bootup changes in PV code (v4).
We don't use the filtering that 'xen_cpuid' is doing
because the hypervisor treats 'XEN_EMULATE_PREFIX' as
an invalid instruction. This means that all of the filtering
will have to be done in the hypervisor/toolstack.

Without the filtering we expose to the guest the:

 - cpu topology (sockets, cores, etc);
 - the APERF (which the generic scheduler likes to
    use), see  5e62625420
    "xen/setup: filter APERFMPERF cpuid feature out"
 - and the inability to figure out whether MWAIT_LEAF
   should be exposed or not. See
   df88b2d96e
   "xen/enlighten: Disable MWAIT_LEAF so that acpi-pad won't be loaded."
 - x2apic, see  4ea9b9aca9
   "xen: mask x2APIC feature in PV"

We also check for vector callback early on, as it is a required
feature. PVH also runs at default kernel IOPL.

Finally, pure PV settings are moved to a separate function that are
only called for pure PV, ie, pv with pvmmu. They are also #ifdef
with CONFIG_XEN_PVMMU.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2014-01-06 10:43:59 -05:00
Mukesh Rathor
ddc416cbc4 xen/pvh/x86: Define what an PVH guest is (v3).
Which is a PV guest with auto page translation enabled
and with vector callback. It is a cross between PVHVM and PV.

The Xen side defines PVH as (from docs/misc/pvh-readme.txt,
with modifications):

"* the guest uses auto translate:
 - p2m is managed by Xen
 - pagetables are owned by the guest
 - mmu_update hypercall not available
* it uses event callback and not vlapic emulation,
* IDT is native, so set_trap_table hcall is also N/A for a PVH guest.

For a full list of hcalls supported for PVH, see pvh_hypercall64_table
in arch/x86/hvm/hvm.c in xen.  From the ABI prespective, it's mostly a
PV guest with auto translate, although it does use hvm_op for setting
callback vector."

Also we use the PV cpuid, albeit we can use the HVM (native) cpuid.
However, we do have a fair bit of filtering in the xen_cpuid and
we can piggyback on that until the hypervisor/toolstack filters
the appropiate cpuids. Once that is done we can swap over to
use the native one.

We setup a Kconfig entry that is disabled by default and
cannot be enabled.

Note that on ARM the concept of PVH is non-existent. As Ian
put it: "an ARM guest is neither PV nor HVM nor PVHVM.
It's a bit like PVH but is different also (it's further towards
the H end of the spectrum than even PVH).". As such these
options (PVHVM, PVH) are never enabled nor seen on ARM
compilations.

Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-01-06 10:43:58 -05:00
David Vrabel
8785c67663 xen/x86: set VIRQ_TIMER priority to maximum
Commit bee980d9e (xen/events: Handle VIRQ_TIMER before any other hardirq
in event loop) effectively made the VIRQ_TIMER the highest priority event
when using the 2-level ABI.

Set the VIRQ_TIMER priority to the highest so this behaviour is retained
when using the FIFO-based ABI.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2014-01-06 10:07:55 -05:00
Konrad Rzeszutek Wilk
6f6c15ef91 xen/pvhvm: Remove the xen_platform_pci int.
Since we have  xen_has_pv_devices,xen_has_pv_disk_devices,
xen_has_pv_nic_devices, and xen_has_pv_and_legacy_disk_devices
to figure out the different 'unplug' behaviors - lets
use those instead of this single int.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-01-03 14:54:53 -05:00
Konrad Rzeszutek Wilk
51c71a3bba xen/pvhvm: If xen_platform_pci=0 is set don't blow up (v4).
The user has the option of disabling the platform driver:
00:02.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)

which is used to unplug the emulated drivers (IDE, Realtek 8169, etc)
and allow the PV drivers to take over. If the user wishes
to disable that they can set:

  xen_platform_pci=0
  (in the guest config file)

or
  xen_emul_unplug=never
  (on the Linux command line)

except it does not work properly. The PV drivers still try to
load and since the Xen platform driver is not run - and it
has not initialized the grant tables, most of the PV drivers
stumble upon:

input: Xen Virtual Keyboard as /devices/virtual/input/input5
input: Xen Virtual Pointer as /devices/virtual/input/input6M
------------[ cut here ]------------
kernel BUG at /home/konrad/ssd/konrad/linux/drivers/xen/grant-table.c:1206!
invalid opcode: 0000 [#1] SMP
Modules linked in: xen_kbdfront(+) xenfs xen_privcmd
CPU: 6 PID: 1389 Comm: modprobe Not tainted 3.13.0-rc1upstream-00021-ga6c892b-dirty #1
Hardware name: Xen HVM domU, BIOS 4.4-unstable 11/26/2013
RIP: 0010:[<ffffffff813ddc40>]  [<ffffffff813ddc40>] get_free_entries+0x2e0/0x300
Call Trace:
 [<ffffffff8150d9a3>] ? evdev_connect+0x1e3/0x240
 [<ffffffff813ddd0e>] gnttab_grant_foreign_access+0x2e/0x70
 [<ffffffffa0010081>] xenkbd_connect_backend+0x41/0x290 [xen_kbdfront]
 [<ffffffffa0010a12>] xenkbd_probe+0x2f2/0x324 [xen_kbdfront]
 [<ffffffff813e5757>] xenbus_dev_probe+0x77/0x130
 [<ffffffff813e7217>] xenbus_frontend_dev_probe+0x47/0x50
 [<ffffffff8145e9a9>] driver_probe_device+0x89/0x230
 [<ffffffff8145ebeb>] __driver_attach+0x9b/0xa0
 [<ffffffff8145eb50>] ? driver_probe_device+0x230/0x230
 [<ffffffff8145eb50>] ? driver_probe_device+0x230/0x230
 [<ffffffff8145cf1c>] bus_for_each_dev+0x8c/0xb0
 [<ffffffff8145e7d9>] driver_attach+0x19/0x20
 [<ffffffff8145e260>] bus_add_driver+0x1a0/0x220
 [<ffffffff8145f1ff>] driver_register+0x5f/0xf0
 [<ffffffff813e55c5>] xenbus_register_driver_common+0x15/0x20
 [<ffffffff813e76b3>] xenbus_register_frontend+0x23/0x40
 [<ffffffffa0015000>] ? 0xffffffffa0014fff
 [<ffffffffa001502b>] xenkbd_init+0x2b/0x1000 [xen_kbdfront]
 [<ffffffff81002049>] do_one_initcall+0x49/0x170

.. snip..

which is hardly nice. This patch fixes this by having each
PV driver check for:
 - if running in PV, then it is fine to execute (as that is their
   native environment).
 - if running in HVM, check if user wanted 'xen_emul_unplug=never',
   in which case bail out and don't load any PV drivers.
 - if running in HVM, and if PCI device 5853:0001 (xen_platform_pci)
   does not exist, then bail out and not load PV drivers.
 - (v2) if running in HVM, and if the user wanted 'xen_emul_unplug=ide-disks',
   then bail out for all PV devices _except_ the block one.
   Ditto for the network one ('nics').
 - (v2) if running in HVM, and if the user wanted 'xen_emul_unplug=unnecessary'
   then load block PV driver, and also setup the legacy IDE paths.
   In (v3) make it actually load PV drivers.

Reported-by: Sander Eikelenboom <linux@eikelenboom.it
Reported-by: Anthony PERARD <anthony.perard@citrix.com>
Reported-and-Tested-by: Fabio Fantoni <fabio.fantoni@m2r.biz>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v2: Add extra logic to handle the myrid ways 'xen_emul_unplug'
can be used per Ian and Stefano suggestion]
[v3: Make the unnecessary case work properly]
[v4: s/disks/ide-disks/ spotted by Fabio]
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> [for PCI parts]
CC: stable@vger.kernel.org
2014-01-03 14:54:18 -05:00
Linus Torvalds
eda670c626 Features:
- SWIOTLB has tracing added when doing bounce buffer.
  - Xen ARM/ARM64 can use Xen-SWIOTLB. This work allows Linux to
    safely program real devices for DMA operations when running as
    a guest on Xen on ARM, without IOMMU support.*1
  - xen_raw_printk works with PVHVM guests if needed.
 Bug-fixes:
  - Make memory ballooning work under HVM with large MMIO region.
  - Inform hypervisor of MCFG regions found in ACPI DSDT.
  - Remove deprecated IRQF_DISABLED.
  - Remove deprecated __cpuinit.
 
 [*1]:
 "On arm and arm64 all Xen guests, including dom0, run with second stage
 translation enabled. As a consequence when dom0 programs a device for a
 DMA operation is going to use (pseudo) physical addresses instead
 machine addresses. This work introduces two trees to track physical to
 machine and machine to physical mappings of foreign pages. Local pages
 are assumed mapped 1:1 (physical address == machine address).  It
 enables the SWIOTLB-Xen driver on ARM and ARM64, so that Linux can
 translate physical addresses to machine addresses for dma operations
 when necessary. " (Stefano).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQEcBAABAgAGBQJSgS86AAoJEFjIrFwIi8fJpY4H/R2gke1A1p9UvTwbkaDhgPs/
 u/mkI6aH+ktgvu5QZNprki660uydtc4Ck7y8leeLGYw+ed1Ys559SJhRc/x8jBYZ
 Hh2chnplld0LAjSpdIDTTePArE1xBo4Gz+fT0zc5cVh0leJwOXn92Kx8N5AWD/T3
 gwH4Ok4K1dzZBIls7imM2AM/L1xcApcx3Dl/QpNcoePQtR4yLuPWMUbb3LM8pbUY
 0B6ZVN4GOhtJ84z8HRKnh4uMnBYmhmky6laTlHVa6L+j1fv7aAPCdNbePjIt/Pvj
 HVYB1O/ht73yHw0zGfK6lhoGG8zlu+Q7sgiut9UsGZZfh34+BRKzNTypqJ3ezQo=
 =xc43
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.13-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen updates from Konrad Rzeszutek Wilk:
 "This has tons of fixes and two major features which are concentrated
  around the Xen SWIOTLB library.

  The short <blurb> is that the tracing facility (just one function) has
  been added to SWIOTLB to make it easier to track I/O progress.
  Additionally under Xen and ARM (32 & 64) the Xen-SWIOTLB driver
  "is used to translate physical to machine and machine to physical
  addresses of foreign[guest] pages for DMA operations" (Stefano) when
  booting under hardware without proper IOMMU.

  There are also bug-fixes, cleanups, compile warning fixes, etc.

  The commit times for some of the commits is a bit fresh - that is b/c
  we wanted to make sure we have the Ack's from the ARM folks - which
  with the string of back-to-back conferences took a bit of time.  Rest
  assured - the code has been stewing in #linux-next for some time.

  Features:
   - SWIOTLB has tracing added when doing bounce buffer.
   - Xen ARM/ARM64 can use Xen-SWIOTLB.  This work allows Linux to
     safely program real devices for DMA operations when running as a
     guest on Xen on ARM, without IOMMU support. [*1]
   - xen_raw_printk works with PVHVM guests if needed.

  Bug-fixes:
   - Make memory ballooning work under HVM with large MMIO region.
   - Inform hypervisor of MCFG regions found in ACPI DSDT.
   - Remove deprecated IRQF_DISABLED.
   - Remove deprecated __cpuinit.

  [*1]:
  "On arm and arm64 all Xen guests, including dom0, run with second
   stage translation enabled.  As a consequence when dom0 programs a
   device for a DMA operation is going to use (pseudo) physical
   addresses instead machine addresses.  This work introduces two trees
   to track physical to machine and machine to physical mappings of
   foreign pages.  Local pages are assumed mapped 1:1 (physical address
   == machine address).  It enables the SWIOTLB-Xen driver on ARM and
   ARM64, so that Linux can translate physical addresses to machine
   addresses for dma operations when necessary.  " (Stefano)"

* tag 'stable/for-linus-3.13-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (32 commits)
  xen/arm: pfn_to_mfn and mfn_to_pfn return the argument if nothing is in the p2m
  arm,arm64/include/asm/io.h: define struct bio_vec
  swiotlb-xen: missing include dma-direction.h
  pci-swiotlb-xen: call pci_request_acs only ifdef CONFIG_PCI
  arm: make SWIOTLB available
  xen: delete new instances of added __cpuinit
  xen/balloon: Set balloon's initial state to number of existing RAM pages
  xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved for MCFG areas.
  xen: remove deprecated IRQF_DISABLED
  x86/xen: remove deprecated IRQF_DISABLED
  swiotlb-xen: fix error code returned by xen_swiotlb_map_sg_attrs
  swiotlb-xen: static inline xen_phys_to_bus, xen_bus_to_phys, xen_virt_to_bus and range_straddles_page_boundary
  grant-table: call set_phys_to_machine after mapping grant refs
  arm,arm64: do not always merge biovec if we are running on Xen
  swiotlb: print a warning when the swiotlb is full
  swiotlb-xen: use xen_dma_map/unmap_page, xen_dma_sync_single_for_cpu/device
  xen: introduce xen_dma_map/unmap_page and xen_dma_sync_single_for_cpu/device
  tracing/events: Fix swiotlb tracepoint creation
  swiotlb-xen: use xen_alloc/free_coherent_pages
  xen: introduce xen_alloc/free_coherent_pages
  ...
2013-11-15 13:34:37 +09:00
Kirill A. Shutemov
49076ec2cc mm: dynamically allocate page->ptl if it cannot be embedded to struct page
If split page table lock is in use, we embed the lock into struct page
of table's page.  We have to disable split lock, if spinlock_t is too
big be to be embedded, like when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC
enabled.

This patch add support for dynamic allocation of split page table lock
if we can't embed it to struct page.

page->ptl is unsigned long now and we use it as spinlock_t if
sizeof(spinlock_t) <= sizeof(long), otherwise it's pointer to spinlock_t.

The spinlock_t allocated in pgtable_page_ctor() for PTE table and in
pgtable_pmd_page_ctor() for PMD table.  All other helpers converted to
support dynamically allocated page->ptl.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:20 +09:00
Kirill A. Shutemov
57c1ffcefb mm: rename USE_SPLIT_PTLOCKS to USE_SPLIT_PTE_PTLOCKS
We're going to introduce split page table lock for PMD level.  Let's
rename existing split ptlock for PTE level to avoid confusion.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:14 +09:00
Konrad Rzeszutek Wilk
e1d8f62ad4 Merge remote-tracking branch 'stefano/swiotlb-xen-9.1' into stable/for-linus-3.13
* stefano/swiotlb-xen-9.1:
  swiotlb-xen: fix error code returned by xen_swiotlb_map_sg_attrs
  swiotlb-xen: static inline xen_phys_to_bus, xen_bus_to_phys, xen_virt_to_bus and range_straddles_page_boundary
  grant-table: call set_phys_to_machine after mapping grant refs
  arm,arm64: do not always merge biovec if we are running on Xen
  swiotlb: print a warning when the swiotlb is full
  swiotlb-xen: use xen_dma_map/unmap_page, xen_dma_sync_single_for_cpu/device
  xen: introduce xen_dma_map/unmap_page and xen_dma_sync_single_for_cpu/device
  swiotlb-xen: use xen_alloc/free_coherent_pages
  xen: introduce xen_alloc/free_coherent_pages
  arm64/xen: get_dma_ops: return xen_dma_ops if we are running as xen_initial_domain
  arm/xen: get_dma_ops: return xen_dma_ops if we are running as xen_initial_domain
  swiotlb-xen: introduce xen_swiotlb_set_dma_mask
  xen/arm,arm64: enable SWIOTLB_XEN
  xen: make xen_create_contiguous_region return the dma address
  xen/x86: allow __set_phys_to_machine for autotranslate guests
  arm/xen,arm64/xen: introduce p2m
  arm64: define DMA_ERROR_CODE
  arm: make SWIOTLB available

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Conflicts:
	arch/arm/include/asm/dma-mapping.h
	drivers/xen/swiotlb-xen.c

[Conflicts arose b/c "arm: make SWIOTLB available" v8 was in Stefano's
branch, while I had v9 + Ack from Russel. I also fixed up white-space
issues]
2013-11-08 16:10:48 -05:00
Stefano Stabellini
92c0fd17c0 pci-swiotlb-xen: call pci_request_acs only ifdef CONFIG_PCI
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-11-08 15:21:44 -05:00
Paul Gortmaker
3b284bde70 xen: delete new instances of added __cpuinit
commit 6efa20e49b
("xen: Support 64-bit PV guest receiving NMIs") and
commit cd9151e26d
( "xen/balloon: set a mapping for ballooned out pages")
added new instances of __cpuinit usage.

We removed this a couple versions ago; we now want to remove
the compat no-op stubs.  Introducing new users is not what
we want to see at this point in time, as it will break once
the stubs are gone.

Cc: Konrad Rzeszutek Wilk <konrad@kernel.org>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-11-08 15:13:16 -05:00
Michael Opdenacker
9d71cee667 x86/xen: remove deprecated IRQF_DISABLED
This patch proposes to remove the IRQF_DISABLED flag from x86/xen
code. It's a NOOP since 2.6.35 and it will be removed one day.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-11-06 15:31:01 -05:00
Frediano Ziglio
7cde9b27e7 xen: Fix possible user space selector corruption
Due to the way kernel is initialized under Xen is possible that the
ring1 selector used by the kernel for the boot cpu end up to be copied
to userspace leading to segmentation fault in the userspace.

Xen code in the kernel initialize no-boot cpus with correct selectors (ds
and es set to __USER_DS) but the boot one keep the ring1 (passed by Xen).
On task context switch (switch_to) we assume that ds, es and cs already
point to __USER_DS and __KERNEL_CSso these selector are not changed.

If processor is an Intel that support sysenter instruction sysenter/sysexit
is used so ds and es are not restored switching back from kernel to
userspace. In the case the selectors point to a ring1 instead of __USER_DS
the userspace code will crash on first memory access attempt (to be
precise Xen on the emulated iret used to do sysexit will detect and set ds
and es to zero which lead to GPF anyway).

Now if an userspace process call kernel using sysenter and get rescheduled
(for me it happen on a specific init calling wait4) could happen that the
ring1 selector is set to ds and es.

This is quite hard to detect cause after a while these selectors are fixed
(__USER_DS seems sticky).

Bisecting the code commit 7076aada10 appears
to be the first one that have this issue.

Signed-off-by: Frediano Ziglio <frediano.ziglio@citrix.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
2013-10-10 14:39:37 +00:00
Stefano Stabellini
1b65c4e5a9 swiotlb-xen: use xen_alloc/free_coherent_pages
Use xen_alloc_coherent_pages and xen_free_coherent_pages to allocate or
free coherent pages.

We need to be careful handling the pointer returned by
xen_alloc_coherent_pages, because on ARM the pointer is not equal to
phys_to_virt(*dma_handle). In fact virt_to_phys only works for kernel
direct mapped RAM memory.
In ARM case the pointer could be an ioremap address, therefore passing
it to virt_to_phys would give you another physical address that doesn't
correspond to it.

Make xen_create_contiguous_region take a phys_addr_t as start parameter to
avoid the virt_to_phys calls which would be incorrect.

Changes in v6:
- remove extra spaces.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-10-10 13:41:10 +00:00
Stefano Stabellini
69908907b0 xen: make xen_create_contiguous_region return the dma address
Modify xen_create_contiguous_region to return the dma address of the
newly contiguous buffer.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>


Changes in v4:
- use virt_to_machine instead of virt_to_bus.
2013-10-09 16:56:32 +00:00
Stefano Stabellini
2f558d4091 xen/x86: allow __set_phys_to_machine for autotranslate guests
Allow __set_phys_to_machine to be called for autotranslate guests.
It can be used to keep track of phys_to_machine changes, however we
don't do anything with the information at the moment.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2013-10-09 20:39:01 +00:00
Konrad Rzeszutek Wilk
b1922a519e xen/mmu: Correct PAT MST setting.
Jan Beulich spotted that the PAT MSR settings in the Xen public
document that "the first (PAT6) column was wrong across the
board, and the column for PAT7 was missing altogether."

This updates it to be in sync.

CC: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2013-09-27 09:04:45 -04:00
Linus Torvalds
4b97280675 Bug-fixes:
- Fix PV spinlocks triggering jump_label code bug
  - Remove extraneous code in the tpm front driver
  - Fix ballooning out of pages when non-preemptible
  - Fix deadlock when using a 32-bit initial domain with large amount of memory.
  - Add xen_nopvpsin parameter to the documentation
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQEcBAABAgAGBQJSQvzCAAoJEFjIrFwIi8fJyCIIAMENABapdLhrOiRdQ1Y7T5v1
 4bogPDLwpVxHzwo/vnHcNpl35/dUZrC6wQa51Bkoqq0V8o1XmjFy3SY/EBGjEAvw
 hh4qxGY0p0NNi6hKrWC8mH9u2TcluZGm1uecabkXUhl9mrAB5oBsfJdbBZ5N69gO
 QXXt0j7Xwv1APwH86T0e1Lz+lulhdw2ItXP4osYkEbRYNSaaGnuwsd0Jxcb4DeMk
 qhKgP7QMn3C7zDDaapJo1axeYQRBNEtv5M8+0wwMleX4yX1+IBRZeQTsRfMr7RB/
 8FhssWiH15xU6Gmzgi/VR8xhTEIbQh5GWsVReGf6pqIYSxGSYTvvyhm0bVRH4JI=
 =c+7u
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen fixes from Konrad Rzeszutek Wilk:
 "Bug-fixes and one update to the kernel-paramters.txt documentation.

   - Fix PV spinlocks triggering jump_label code bug
   - Remove extraneous code in the tpm front driver
   - Fix ballooning out of pages when non-preemptible
   - Fix deadlock when using a 32-bit initial domain with large amount
     of memory
   - Add xen_nopvpsin parameter to the documentation"

* tag 'stable/for-linus-3.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/spinlock: Document the xen_nopvspin parameter.
  xen/p2m: check MFN is in range before using the m2p table
  xen/balloon: don't alloc page while non-preemptible
  xen: Do not enable spinlocks before jump_label_init() has executed
  tpm: xen-tpmfront: Remove the locality sysfs attribute
  tpm: xen-tpmfront: Fix default durations
2013-09-25 15:50:53 -07:00
David Vrabel
0160676bba xen/p2m: check MFN is in range before using the m2p table
On hosts with more than 168 GB of memory, a 32-bit guest may attempt
to grant map an MFN that is error cannot lookup in its mapping of the
m2p table.  There is an m2p lookup as part of m2p_add_override() and
m2p_remove_override().  The lookup falls off the end of the mapped
portion of the m2p and (because the mapping is at the highest virtual
address) wraps around and the lookup causes a fault on what appears to
be a user space address.

do_page_fault() (thinking it's a fault to a userspace address), tries
to lock mm->mmap_sem.  If the gntdev device is used for the grant map,
m2p_add_override() is called from from gnttab_mmap() with mm->mmap_sem
already locked.  do_page_fault() then deadlocks.

The deadlock would most commonly occur when a 64-bit guest is started
and xenconsoled attempts to grant map its console ring.

Introduce mfn_to_pfn_no_overrides() which checks the MFN is within the
mapped portion of the m2p table before accessing the table and use
this in m2p_add_override(), m2p_remove_override(), and mfn_to_pfn()
(which already had the correct range check).

All faults caused by accessing the non-existant parts of the m2p are
thus within the kernel address space and exception_fixup() is called
without trying to lock mm->mmap_sem.

This means that for MFNs that are outside the mapped range of the m2p
then mfn_to_pfn() will always look in the m2p overrides.  This is
correct because it must be a foreign MFN (and the PFN in the m2p in
this case is only relevant for the other domain).

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
--
v3: check for auto_translated_physmap in mfn_to_pfn_no_overrides()
v2: in mfn_to_pfn() look in m2p_overrides if the MFN is out of
    range as it's probably foreign.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2013-09-25 09:00:03 -04:00
Konrad Rzeszutek Wilk
a945928ea2 xen: Do not enable spinlocks before jump_label_init() has executed
xen_init_spinlocks() currently calls static_key_slow_inc() before
jump_label_init() is invoked. When CONFIG_JUMP_LABEL is set (which usually is
the case) the effect of this static_key_slow_inc() is deferred until after
jump_label_init(). This is different from when CONFIG_JUMP_LABEL is not set, in
which case the key is set immediately. Thus, depending on the value of config
option, we may observe different behavior.

In addition, when we come to __jump_label_transform() from jump_label_init(),
the key (paravirt_ticketlocks_enabled) is already enabled. On processors where
ideal_nop is not the same as default_nop this will cause a BUG() since it is
expected that before a key is enabled the latter is replaced by the former
during initialization.

To address this problem we need to move
static_key_slow_inc(&paravirt_ticketlocks_enabled) so that it is called
after jump_label_init(). We also need to make sure that this is done before
other cpus start to boot. early_initcall appears to be  a good place to do so.
(Note that we cannot move whole xen_init_spinlocks() there since pv_lock_ops
need to be set before alternative_instructions() runs.)

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v2: Added extra comments in the code]
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
2013-09-24 16:22:26 -04:00
Linus Torvalds
a60d4b9874 Bug-fixes:
- Boot on ARM without using Xen unconditionally
  - On Xen ARM don't run cpuidle/cpufreq
  - Fix regression in balloon driver, preempt count warnings
  - Fixes to make PVHVM able to use pv ticketlock.
  - Revert Xen PVHVM disabling pv ticketlock (aka, re-enable pv ticketlocks)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQEcBAABAgAGBQJSLhcPAAoJEFjIrFwIi8fJGq8IAIxI9zcnY9N6eaD3DepdZlz9
 AMT8/k7afau1rDMk5r3HaUjAkdeEvCgeWw8W6tJ+OK19AmFTVEvoO803MSzYkDol
 6XoknSoU9UnE+/w4FF1FttWmRxkZ8Op/hcs9435q7o+L0zlk9CbbkxFlzUKf5yVD
 KfvQED4D/ShmPj2f+jYLCtsIi1m/AJ36BsfaUtJo3QVKvJIFFbT6F1AJ4tlbmvC0
 FOaHYl9cTlPXfrpwviIP0+W8RVmcWreLqSOKsdHuWzB//MSvZDVmLGc7JPorblfe
 cuME/tF/Y5bnxHKp8Es2MczpdvS6yp/HNoe0g6AdLVPL7dvGPqKpf2uZZNXUawQ=
 =qpSg
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.12-rc0-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen bug-fixes from Konrad Rzeszutek Wilk:
 "This pull I usually do after rc1 is out but because we have a nice
  amount of fixes, some bootup related fixes for ARM, and it is early in
  the cycle we figured to do it now to help with tracking of potential
  regressions.

  The simple ones are the ARM ones - one of the patches fell through the
  cracks, other fixes a bootup issue (unconditionally using Xen
  functions).  Then a fix for a regression causing preempt count being
  off (patch causing this went in v3.12).

  Lastly are the fixes to make Xen PVHVM guests use PV ticketlocks (Xen
  PV already does).

  The enablement of that was supposed to be part of the x86 spinlock
  merge in commit 816434ec4a ("The biggest change here are
  paravirtualized ticket spinlocks (PV spinlocks), which bring a nice
  speedup on various benchmarks...") but unfortunatly it would cause
  hang when booting Xen PVHVM guests.  Yours truly got all of the bugs
  fixed last week and they (six of them) are included in this pull.

  Bug-fixes:
   - Boot on ARM without using Xen unconditionally
   - On Xen ARM don't run cpuidle/cpufreq
   - Fix regression in balloon driver, preempt count warnings
   - Fixes to make PVHVM able to use pv ticketlock.
   - Revert Xen PVHVM disabling pv ticketlock (aka, re-enable pv ticketlocks)"

* tag 'stable/for-linus-3.12-rc0-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/spinlock: Don't use __initdate for xen_pv_spin
  Revert "xen/spinlock: Disable IRQ spinlock (PV) allocation on PVHVM"
  xen/spinlock: Don't setup xen spinlock IPI kicker if disabled.
  xen/smp: Update pv_lock_ops functions before alternative code starts under PVHVM
  xen/spinlock: We don't need the old structure anymore
  xen/spinlock: Fix locking path engaging too soon under PVHVM.
  xen/arm: disable cpuidle and cpufreq when linux is running as dom0
  xen/p2m: Don't call get_balloon_scratch_page() twice, keep interrupts disabled for multicalls
  ARM: xen: only set pm function ptrs for Xen guests
2013-09-10 20:07:04 -07:00
Konrad Rzeszutek Wilk
c3b7cb1fd8 xen/spinlock: Don't use __initdate for xen_pv_spin
As we get compile warnings about .init.data being
used by non-init functions.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-09-09 13:08:49 -04:00
Konrad Rzeszutek Wilk
fb78e58c27 Revert "xen/spinlock: Disable IRQ spinlock (PV) allocation on PVHVM"
This reverts commit 70dd4998cb.

Now that the bugs have been resolved we can re-enable the
PV ticketlock implementation under PVHVM Xen guests.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
2013-09-09 12:06:45 -04:00
Konrad Rzeszutek Wilk
3310bbedac xen/spinlock: Don't setup xen spinlock IPI kicker if disabled.
There is no need to setup this kicker IPI if we are never going
to use the paravirtualized ticketlock mechanism.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
2013-09-09 12:06:38 -04:00
Konrad Rzeszutek Wilk
26a7999527 xen/smp: Update pv_lock_ops functions before alternative code starts under PVHVM
Before this patch we would patch all of the pv_lock_ops sites
using alternative assembler. Then later in the bootup cycle
change the unlock_kick and lock_spinning to the Xen specific -
without re patching.

That meant that for the core of the kernel we would be running
with the baremetal version of unlock_kick and lock_spinning while
for modules we would have the proper Xen specific slowpaths.

As most of the module uses some API from the core kernel that ended
up with slowpath lockers waiting forever to be kicked (b/c they
would be using the Xen specific slowpath logic). And the
kick never came b/c the unlock path that was taken was the
baremetal one.

On PV we do not have the problem as we initialise before the
alternative code kicks in.

The fix is to make the updating of the pv_lock_ops function
be done before the alternative code starts patching.

Note that this patch fixes issues discovered by commit
f10cd522c5.
("xen: disable PV spinlocks on HVM") wherein it mentioned

   PV spinlocks cannot possibly work with the current code because they are
   enabled after pvops patching has already been done, and because PV
   spinlocks use a different data structure than native spinlocks so we
   cannot switch between them dynamically.

The first problem is solved by this patch.

The second problem has been solved by commit
816434ec4a
(Merge branch 'x86-spinlocks-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip)

P.S.
There is still the commit 70dd4998cb
(xen/spinlock: Disable IRQ spinlock (PV) allocation on PVHVM) to
revert but that can be done later after all other bugs have been
fixed.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
2013-09-09 12:06:31 -04:00
Konrad Rzeszutek Wilk
6055aaf87d xen/spinlock: We don't need the old structure anymore
As we are using the generic ticketlock structs and these
old structures are not needed anymore.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
2013-09-09 12:06:24 -04:00
Konrad Rzeszutek Wilk
1fb3a8b2cf xen/spinlock: Fix locking path engaging too soon under PVHVM.
The xen_lock_spinning has a check for the kicker interrupts
and if it is not initialized it will spin normally (not enter
the slowpath).

But for PVHVM case we would initialize the kicker interrupt
before the CPU came online. This meant that if the booting
CPU used a spinlock and went in the slowpath - it would
enter the slowpath and block forever. The forever part because
during bootup: the spinlock would be taken _before_ the CPU
sets itself to be online (more on this further), and we enter
to poll on the event channel forever.

The bootup CPU (see commit fc78d343fa
"xen/smp: initialize IPI vectors before marking CPU online"
for details) and the CPU that started the bootup consult
the cpu_online_mask to determine whether the booting CPU should
get an IPI. The booting CPU has to set itself in this mask via:

  set_cpu_online(smp_processor_id(), true);

However, if the spinlock is taken before this (and it is) and
it polls on an event channel - it will never be woken up as
the kernel will never send an IPI to an offline CPU.

Note that the PVHVM logic in sending IPIs is using the HVM
path which has numerous checks using the cpu_online_mask
and cpu_active_mask. See above mention git commit for details.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
2013-09-09 12:06:16 -04:00
Konrad Rzeszutek Wilk
65320fceda Linux 3.11-rc7
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQEcBAABAgAGBQJSGqS5AAoJEHm+PkMAQRiGFxEH/3VrqF6WAkcviNiW/0DCdO8k
 v6Wi7Sp5LxVkwzmOCHCV1tTHwLRlH3cB9YmJlGQ0kHCREaAuEQAB0xJXIW7dnyYj
 Qq7KoRZEMe3wizmjEsj8qsrhfMLzHjBw67hBz2znwW/4P7YdgzwD7KRiEat+yRC9
 ON3nNL2zIqpfk92RXvVrSVl4KMEM+WNbOfiffgBiEP24Ja1MJMFH1d4i6hNOaB0x
 9Pb3Lw8let92x+8Ao5jnjKdKMgVsoZWbN/TgQR8zZOHM38AGGiDgk18vMz+L+hpS
 jqfjckxj1m30jGq0qZ9ZbMZx3IGif4KccVr30MqNHJpwi6Q24qXvT3YfA3HkstM=
 =nAab
 -----END PGP SIGNATURE-----

Merge tag 'v3.11-rc7' into stable/for-linus-3.12

Linux 3.11-rc7

As we need the git commit 28817e9de4f039a1a8c1fe1df2fa2df524626b9e
Author: Chuck Anderson <chuck.anderson@oracle.com>
Date:   Tue Aug 6 15:12:19 2013 -0700

    xen/smp: initialize IPI vectors before marking CPU online

* tag 'v3.11-rc7': (443 commits)
  Linux 3.11-rc7
  ARC: [lib] strchr breakage in Big-endian configuration
  VFS: collect_mounts() should return an ERR_PTR
  bfs: iget_locked() doesn't return an ERR_PTR
  efs: iget_locked() doesn't return an ERR_PTR()
  proc: kill the extra proc_readfd_common()->dir_emit_dots()
  cope with potentially long ->d_dname() output for shmem/hugetlb
  usb: phy: fix build breakage
  USB: OHCI: add missing PCI PM callbacks to ohci-pci.c
  staging: comedi: bug-fix NULL pointer dereference on failed attach
  lib/lz4: correct the LZ4 license
  memcg: get rid of swapaccount leftovers
  nilfs2: fix issue with counting number of bio requests for BIO_EOPNOTSUPP error detection
  nilfs2: remove double bio_put() in nilfs_end_bio_write() for BIO_EOPNOTSUPP error
  drivers/platform/olpc/olpc-ec.c: initialise earlier
  ipv4: expose IPV4_DEVCONF
  ipv6: handle Redirect ICMP Message with no Redirected Header option
  be2net: fix disabling TX in be_close()
  Revert "ACPI / video: Always call acpi_video_init_brightness() on init"
  Revert "genetlink: fix family dump race"
  ...

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-09-09 12:05:37 -04:00
Konrad Rzeszutek Wilk
c3f31f6a6f Merge branch 'x86/spinlocks' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into stable/for-linus-3.12
* 'x86/spinlocks' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/kvm/guest: Fix sparse warning: "symbol 'klock_waiting' was not declared as static"
  kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor
  kvm guest: Add configuration support to enable debug information for KVM Guests
  kvm uapi: Add KICK_CPU and PV_UNHALT definition to uapi
  xen, pvticketlock: Allow interrupts to be enabled while blocking
  x86, ticketlock: Add slowpath logic
  jump_label: Split jumplabel ratelimit
  x86, pvticketlock: When paravirtualizing ticket locks, increment by 2
  x86, pvticketlock: Use callee-save for lock_spinning
  xen, pvticketlocks: Add xen_nopvspin parameter to disable xen pv ticketlocks
  xen, pvticketlock: Xen implementation for PV ticket locks
  xen: Defer spinlock setup until boot CPU setup
  x86, ticketlock: Collapse a layer of functions
  x86, ticketlock: Don't inline _spin_unlock when using paravirt spinlocks
  x86, spinlock: Replace pv spinlocks with pv ticketlocks
2013-09-09 12:01:15 -04:00
Boris Ostrovsky
d7f8f48d1e xen/p2m: Don't call get_balloon_scratch_page() twice, keep interrupts disabled for multicalls
m2p_remove_override() calls get_balloon_scratch_page() in
MULTI_update_va_mapping() even though it already has pointer to this page from
the earlier call (in scratch_page). This second call doesn't have a matching
put_balloon_scratch_page() thus not restoring preempt count back. (Also, there
is no put_balloon_scratch_page() in the error path.)

In addition, the second multicall uses __xen_mc_entry() which does not disable
interrupts. Rearrange xen_mc_* calls to keep interrupts off while performing
multicalls.

This commit fixes a regression introduced by:

commit ee0726407f
Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Date:   Tue Jul 23 17:23:54 2013 +0000

    xen/m2p: use GNTTABOP_unmap_and_replace to reinstate the original mapping

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2013-09-09 10:50:52 +00:00
Linus Torvalds
cf39c8e535 Features:
- Xen Trusted Platform Module (TPM) frontend driver - with the backend in MiniOS.
  - Scalability improvements in event channel.
  - Two extra Xen co-maintainers (David, Boris) and one going away (Jeremy)
 Bug-fixes:
  - Make the 1:1 mapping work during early bootup on selective regions.
  - Add scratch page to balloon driver to deal with unexpected code still holding
    on stale pages.
  - Allow NMIs on PV guests (64-bit only)
  - Remove unnecessary TLB flush in M2P code.
  - Fixes duplicate callbacks in Xen granttable code.
  - Fixes in PRIVCMD_MMAPBATCH ioctls to allow retries
  - Fix for events being lost due to rescheduling on different VCPUs.
  - More documentation.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQEcBAABAgAGBQJSJgGgAAoJEFjIrFwIi8fJ4asH+gKp0aauPEdHtmn7rLfZUUJ5
 uuvWBiXiVVYMFz81NXlZ1WoAMuDuVA45Eu785uPRb9oUHDi0W8LO4Dqr+9lJTrXJ
 KiMvTXmOLSfSdjRlDI4jCoxBdg8tpbT3oJkXsFcHnrd5d4oTFGb0uuo5nFYPDicZ
 BGogDclzcqtlYl/2LUb+6vUXUQd77n0oW7RQ4yAaw3Qdj381om3Dmoeat8QU9Kdo
 Q4dhsHS6YAGR5R+G0zPfVOoKvSGoGV0NUdXr19QpYArGxKXcmiPjrgAJ/NGLsxvm
 8AbPjmQzOFJmUclHiiej6kvBsh2ZTYAesJMSAFLWD7EndXii7zljyJv0PIJ//uQ=
 =hNDW
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.12-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen updates from Konrad Rzeszutek Wilk:
 "A couple of features and a ton of bug-fixes.  There is also some
  maintership changes.  Jeremy is enjoying the full-time work at the
  startup and as much as he would love to help - he can't find the time.
  I have a bunch of other things that I promised to work on - paravirt
  diet, get SWIOTLB working everywhere, etc, but haven't been able to
  find the time.

  As such both David Vrabel and Boris Ostrovsky have graciously
  volunteered to help with the maintership role.  They will keep the lid
  on regressions, bug-fixes, etc.  I will be in the background to help -
  but eventually there will be less of me doing the Xen GIT pulls and
  more of them.  Stefano is still doing the ARM/ARM64 and will continue
  on doing so.

  Features:
   - Xen Trusted Platform Module (TPM) frontend driver - with the
     backend in MiniOS.
   - Scalability improvements in event channel.
   - Two extra Xen co-maintainers (David, Boris) and one going away (Jeremy)

  Bug-fixes:
   - Make the 1:1 mapping work during early bootup on selective regions.
   - Add scratch page to balloon driver to deal with unexpected code
     still holding on stale pages.
   - Allow NMIs on PV guests (64-bit only)
   - Remove unnecessary TLB flush in M2P code.
   - Fixes duplicate callbacks in Xen granttable code.
   - Fixes in PRIVCMD_MMAPBATCH ioctls to allow retries
   - Fix for events being lost due to rescheduling on different VCPUs.
   - More documentation"

* tag 'stable/for-linus-3.12-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (23 commits)
  hvc_xen: Remove unnecessary __GFP_ZERO from kzalloc
  drivers/xen-tpmfront: Fix compile issue with missing option.
  xen/balloon: don't set P2M entry for auto translated guest
  xen/evtchn: double free on error
  Xen: Fix retry calls into PRIVCMD_MMAPBATCH*.
  xen/pvhvm: Initialize xen panic handler for PVHVM guests
  xen/m2p: use GNTTABOP_unmap_and_replace to reinstate the original mapping
  xen: fix ARM build after 6efa20e4
  MAINTAINERS: Remove Jeremy from the Xen subsystem.
  xen/events: document behaviour when scanning the start word for events
  x86/xen: during early setup, only 1:1 map the ISA region
  x86/xen: disable premption when enabling local irqs
  swiotlb-xen: replace dma_length with sg_dma_len() macro
  swiotlb: replace dma_length with sg_dma_len() macro
  xen/balloon: set a mapping for ballooned out pages
  xen/evtchn: improve scalability by using per-user locks
  xen/p2m: avoid unneccesary TLB flush in m2p_remove_override()
  MAINTAINERS: Add in two extra co-maintainers of the Xen tree.
  MAINTAINERS: Update the Xen subsystem's with proper mailing list.
  xen: replace strict_strtoul() with kstrtoul()
  ...
2013-09-04 17:45:39 -07:00
Linus Torvalds
816434ec4a Merge branch 'x86-spinlocks-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 spinlock changes from Ingo Molnar:
 "The biggest change here are paravirtualized ticket spinlocks (PV
  spinlocks), which bring a nice speedup on various benchmarks.

  The KVM host side will come to you via the KVM tree"

* 'x86-spinlocks-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/kvm/guest: Fix sparse warning: "symbol 'klock_waiting' was not declared as static"
  kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor
  kvm guest: Add configuration support to enable debug information for KVM Guests
  kvm uapi: Add KICK_CPU and PV_UNHALT definition to uapi
  xen, pvticketlock: Allow interrupts to be enabled while blocking
  x86, ticketlock: Add slowpath logic
  jump_label: Split jumplabel ratelimit
  x86, pvticketlock: When paravirtualizing ticket locks, increment by 2
  x86, pvticketlock: Use callee-save for lock_spinning
  xen, pvticketlocks: Add xen_nopvspin parameter to disable xen pv ticketlocks
  xen, pvticketlock: Xen implementation for PV ticket locks
  xen: Defer spinlock setup until boot CPU setup
  x86, ticketlock: Collapse a layer of functions
  x86, ticketlock: Don't inline _spin_unlock when using paravirt spinlocks
  x86, spinlock: Replace pv spinlocks with pv ticketlocks
2013-09-04 11:55:10 -07:00
Linus Torvalds
05eebfb26b Merge branch 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 paravirt changes from Ingo Molnar:
 "Hypervisor signature detection cleanup and fixes - the goal is to make
  KVM guests run better on MS/Hyperv and to generalize and factor out
  the code a bit"

* 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Correctly detect hypervisor
  x86, kvm: Switch to use hypervisor_cpuid_base()
  xen: Switch to use hypervisor_cpuid_base()
  x86: Introduce hypervisor_cpuid_base()
2013-09-04 11:05:13 -07:00
Linus Torvalds
2a475501b8 Merge branch 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/asmlinkage changes from Ingo Molnar:
 "As a preparation for Andi Kleen's LTO patchset (link time
  optimizations using GCC's -flto which build time optimization has
  steadily increased in quality over the past few years and might
  eventually be usable for the kernel too) this tree includes a handful
  of preparatory patches that make function calling convention
  annotations consistent again:

   - Mark every function without arguments (or 64bit only) that is used
     by assembly code with asmlinkage()

   - Mark every function with parameters or variables that is used by
     assembly code as __visible.

  For the vanilla kernel this has documentation, consistency and
  debuggability advantages, for the time being"

* 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/asmlinkage: Fix warning in xen asmlinkage change
  x86, asmlinkage, vdso: Mark vdso variables __visible
  x86, asmlinkage, power: Make various symbols used by the suspend asm code visible
  x86, asmlinkage: Make dump_stack visible
  x86, asmlinkage: Make 64bit checksum functions visible
  x86, asmlinkage, paravirt: Add __visible/asmlinkage to xen paravirt ops
  x86, asmlinkage, apm: Make APM data structure used from assembler visible
  x86, asmlinkage: Make syscall tables visible
  x86, asmlinkage: Make several variables used from assembler/linker script visible
  x86, asmlinkage: Make kprobes code visible and fix assembler code
  x86, asmlinkage: Make various syscalls asmlinkage
  x86, asmlinkage: Make 32bit/64bit __switch_to visible
  x86, asmlinkage: Make _*_start_kernel visible
  x86, asmlinkage: Make all interrupt handlers asmlinkage / __visible
  x86, asmlinkage: Change dotraplinkage into __visible on 32bit
  x86: Fix sys_call_table type in asm/syscall.h
2013-09-04 08:42:44 -07:00
Andi Kleen
eb86b5fd50 x86/asmlinkage: Fix warning in xen asmlinkage change
Current code uses asmlinkage for functions without arguments.
This adds an implicit regparm(0) which creates a warning
when assigning the function to pointers.

Use __visible for the functions without arguments.
This avoids having to add regparm(0) to function pointers.
Since they have no arguments it does not make any difference.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1377115662-4865-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-22 09:17:19 +02:00
Linus Torvalds
d936d2d452 Bug-fixes:
- On ARM did not have balanced calls to get/put_cpu.
  - Fix to make tboot + Xen + Linux correctly.
  - Fix events VCPU binding issues.
  - Fix a vCPU online race where IPIs are sent to not-yet-online vCPU.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQEcBAABAgAGBQJSFMaJAAoJEFjIrFwIi8fJ+/0H/32rLj60FpKXcPDCvID+9p8T
 XDGnFNttsxyhuzEzetOAd0aLKYKGnUaTDZBHfgSNipGCxjMLYgz84phRmHAYEj8u
 kai1Ag1WjhZilCmImzFvdHFiUwtvKwkeBIL/cZtKr1BetpnuuFsoVnwbH9FVjMpr
 TCg6sUwFq7xRyD1azo/cTLZFeiUqq0aQLw8J72YaapdS3SztHPeDHXlPpmLUdb6+
 hiSYveJMYp2V0SW8g8eLKDJxVr2QdPEfl9WpBzpLlLK8GrNw8BEU6hSOSLzxB7z/
 hDATAuZ5iHiIEi1uGfVjOyDws2ngUhmBKUH5x5iVIZd2P5c/ffLh2ePDVWGO5RI=
 =yMuS
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.11-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen bug-fixes from Konrad Rzeszutek Wilk:
 - On ARM did not have balanced calls to get/put_cpu.
 - Fix to make tboot + Xen + Linux correctly.
 - Fix events VCPU binding issues.
 - Fix a vCPU online race where IPIs are sent to not-yet-online vCPU.

* tag 'stable/for-linus-3.11-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/smp: initialize IPI vectors before marking CPU online
  xen/events: mask events when changing their VCPU binding
  xen/events: initialize local per-cpu mask for all possible events
  x86/xen: do not identity map UNUSABLE regions in the machine E820
  xen/arm: missing put_cpu in xen_percpu_init
2013-08-21 16:38:33 -07:00
Vaughan Cao
669b0ae961 xen/pvhvm: Initialize xen panic handler for PVHVM guests
kernel use callback linked in panic_notifier_list to notice others when panic
happens.
NORET_TYPE void panic(const char * fmt, ...){
    ...
    atomic_notifier_call_chain(&panic_notifier_list, 0, buf);
}
When Xen becomes aware of this, it will call xen_reboot(SHUTDOWN_crash) to
send out an event with reason code - SHUTDOWN_crash.

xen_panic_handler_init() is defined to register on panic_notifier_list but
we only call it in xen_arch_setup which only be called by PV, this patch is
necessary for PVHVM.

Without this patch, setting 'on_crash=coredump-restart' in PVHVM guest config
file won't lead a vmcore to be generate when the guest panics. It can be
reproduced with 'echo c > /proc/sysrq-trigger'.

Signed-off-by: Vaughan Cao <vaughan.cao@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Joe Jin <joe.jin@oracle.com>
2013-08-20 15:37:03 -04:00
Stefano Stabellini
ee0726407f xen/m2p: use GNTTABOP_unmap_and_replace to reinstate the original mapping
GNTTABOP_unmap_grant_ref unmaps a grant and replaces it with a 0
mapping instead of reinstating the original mapping.
Doing so separately would be racy.

To unmap a grant and reinstate the original mapping atomically we use
GNTTABOP_unmap_and_replace.
GNTTABOP_unmap_and_replace doesn't work with GNTMAP_contains_pte, so
don't use it for kmaps.  GNTTABOP_unmap_and_replace zeroes the mapping
passed in new_addr so we have to reinstate it, however that is a
per-cpu mapping only used for balloon scratch pages, so we can be sure that
it's not going to be accessed while the mapping is not valid.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: alex@alex.org.uk
CC: dcrisan@flexiant.com

[v1: Konrad fixed up the conflicts]
Conflicts:
	arch/x86/xen/p2m.c
2013-08-20 10:25:35 -04:00
Chuck Anderson
fc78d343fa xen/smp: initialize IPI vectors before marking CPU online
An older PVHVM guest (v3.0 based) crashed during vCPU hot-plug with:

	kernel BUG at drivers/xen/events.c:1328!

RCU has detected that a CPU has not entered a quiescent state within the
grace period.  It needs to send the CPU a reschedule IPI if it is not
offline.  rcu_implicit_offline_qs() does this check:

	/*
	 * If the CPU is offline, it is in a quiescent state.  We can
	 * trust its state not to change because interrupts are disabled.
	 */
	if (cpu_is_offline(rdp->cpu)) {
		rdp->offline_fqs++;
		return 1;
	}

	Else the CPU is online.  Send it a reschedule IPI.

The CPU is in the middle of being hot-plugged and has been marked online
(!cpu_is_offline()).  See start_secondary():

	set_cpu_online(smp_processor_id(), true);
	...
	per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE;

start_secondary() then waits for the CPU bringing up the hot-plugged CPU to
mark it as active:

	/*
	 * Wait until the cpu which brought this one up marked it
	 * online before enabling interrupts. If we don't do that then
	 * we can end up waking up the softirq thread before this cpu
	 * reached the active state, which makes the scheduler unhappy
	 * and schedule the softirq thread on the wrong cpu. This is
	 * only observable with forced threaded interrupts, but in
	 * theory it could also happen w/o them. It's just way harder
	 * to achieve.
	 */
	while (!cpumask_test_cpu(smp_processor_id(), cpu_active_mask))
		cpu_relax();

	/* enable local interrupts */
	local_irq_enable();

The CPU being hot-plugged will be marked active after it has been fully
initialized by the CPU managing the hot-plug.  In the Xen PVHVM case
xen_smp_intr_init() is called to set up the hot-plugged vCPU's
XEN_RESCHEDULE_VECTOR.

The hot-plugging CPU is marked online, not marked active and does not have
its IPI vectors set up.  rcu_implicit_offline_qs() sees the hot-plugging
cpu is !cpu_is_offline() and tries to send it a reschedule IPI:
This will lead to:

	kernel BUG at drivers/xen/events.c:1328!

	xen_send_IPI_one()
	xen_smp_send_reschedule()
	rcu_implicit_offline_qs()
	rcu_implicit_dynticks_qs()
	force_qs_rnp()
	force_quiescent_state()
	__rcu_process_callbacks()
	rcu_process_callbacks()
	__do_softirq()
	call_softirq()
	do_softirq()
	irq_exit()
	xen_evtchn_do_upcall()

because xen_send_IPI_one() will attempt to use an uninitialized IRQ for
the XEN_RESCHEDULE_VECTOR.

There is at least one other place that has caused the same crash:

	xen_smp_send_reschedule()
	wake_up_idle_cpu()
	add_timer_on()
	clocksource_watchdog()
	call_timer_fn()
	run_timer_softirq()
	__do_softirq()
	call_softirq()
	do_softirq()
	irq_exit()
	xen_evtchn_do_upcall()
	xen_hvm_callback_vector()

clocksource_watchdog() uses cpu_online_mask to pick the next CPU to handle
a watchdog timer:

	/*
	 * Cycle through CPUs to check if the CPUs stay synchronized
	 * to each other.
	 */
	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
	if (next_cpu >= nr_cpu_ids)
		next_cpu = cpumask_first(cpu_online_mask);
	watchdog_timer.expires += WATCHDOG_INTERVAL;
	add_timer_on(&watchdog_timer, next_cpu);

This resulted in an attempt to send an IPI to a hot-plugging CPU that
had not initialized its reschedule vector. One option would be to make
the RCU code check to not check for CPU offline but for CPU active.
As becoming active is done after a CPU is online (in older kernels).

But Srivatsa pointed out that "the cpu_active vs cpu_online ordering has been
completely reworked - in the online path, cpu_active is set *before* cpu_online,
and also, in the cpu offline path, the cpu_active bit is reset in the CPU_DYING
notification instead of CPU_DOWN_PREPARE." Drilling in this the bring-up
path: "[brought up CPU].. send out a CPU_STARTING notification, and in response
to that, the scheduler sets the CPU in the cpu_active_mask. Again, this mask
is better left to the scheduler alone, since it has the intelligence to use it
judiciously."

The conclusion was that:
"
1. At the IPI sender side:

   It is incorrect to send an IPI to an offline CPU (cpu not present in
   the cpu_online_mask). There are numerous places where we check this
   and warn/complain.

2. At the IPI receiver side:

   It is incorrect to let the world know of our presence (by setting
   ourselves in global bitmasks) until our initialization steps are complete
   to such an extent that we can handle the consequences (such as
   receiving interrupts without crashing the sender etc.)
" (from Srivatsa)

As the native code enables the interrupts at some point we need to be
able to service them. In other words a CPU must have valid IPI vectors
if it has been marked online.

It doesn't need to handle the IPI (interrupts may be disabled) but needs
to have valid IPI vectors because another CPU may find it in cpu_online_mask
and attempt to send it an IPI.

This patch will change the order of the Xen vCPU bring-up functions so that
Xen vectors have been set up before start_secondary() is called.
It also will not continue to bring up a Xen vCPU if xen_smp_intr_init() fails
to initialize it.

Orabug 13823853
Signed-off-by Chuck Anderson <chuck.anderson@oracle.com>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-08-20 10:13:05 -04:00
David Vrabel
e201bfcc5c x86/xen: during early setup, only 1:1 map the ISA region
During early setup, when the reserved regions and MMIO holes are being
setup as 1:1 in the p2m, clear any mappings instead of making them 1:1
(execept for the ISA region which is expected to be mapped).

This fixes a regression introduced in 3.5 by 83d51ab473 (xen/setup:
update VA mapping when releasing memory during setup) which caused
hosts with tboot to fail to boot.

tboot marks a region in the e820 map as unusable and the dom0 kernel
would attempt to map this region and Xen does not permit unusable
regions to be mapped by guests.

(XEN)  0000000000000000 - 0000000000060000 (usable)
(XEN)  0000000000060000 - 0000000000068000 (reserved)
(XEN)  0000000000068000 - 000000000009e000 (usable)
(XEN)  0000000000100000 - 0000000000800000 (usable)
(XEN)  0000000000800000 - 0000000000972000 (unusable)

tboot marked this region as unusable.

(XEN)  0000000000972000 - 00000000cf200000 (usable)
(XEN)  00000000cf200000 - 00000000cf38f000 (reserved)
(XEN)  00000000cf38f000 - 00000000cf3ce000 (ACPI data)
(XEN)  00000000cf3ce000 - 00000000d0000000 (reserved)
(XEN)  00000000e0000000 - 00000000f0000000 (reserved)
(XEN)  00000000fe000000 - 0000000100000000 (reserved)
(XEN)  0000000100000000 - 0000000630000000 (usable)

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-08-20 10:12:29 -04:00
David Vrabel
fb58e30091 x86/xen: disable premption when enabling local irqs
If CONFIG_PREEMPT is enabled then xen_enable_irq() (and
xen_restore_fl()) could be preempted and rescheduled on a different
VCPU in between the clear of the mask and the check for pending
events.  This may result in events being lost as the upcall will check
for pending events on the wrong VCPU.

Fix this by disabling preemption around the unmask and check for
events.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-08-20 10:02:03 -04:00
David Vrabel
3bc38cbceb x86/xen: do not identity map UNUSABLE regions in the machine E820
If there are UNUSABLE regions in the machine memory map, dom0 will
attempt to map them 1:1 which is not permitted by Xen and the kernel
will crash.

There isn't anything interesting in the UNUSABLE region that the dom0
kernel needs access to so we can avoid making the 1:1 mapping and
treat it as RAM.

We only do this for dom0, as that is where tboot case shows up.
A PV domU could have an UNUSABLE region in its pseudo-physical map
and would need to be handled in another patch.

This fixes a boot failure on hosts with tboot.

tboot marks a region in the e820 map as unusable and the dom0 kernel
would attempt to map this region and Xen does not permit unusable
regions to be mapped by guests.

  (XEN)  0000000000000000 - 0000000000060000 (usable)
  (XEN)  0000000000060000 - 0000000000068000 (reserved)
  (XEN)  0000000000068000 - 000000000009e000 (usable)
  (XEN)  0000000000100000 - 0000000000800000 (usable)
  (XEN)  0000000000800000 - 0000000000972000 (unusable)

tboot marked this region as unusable.

  (XEN)  0000000000972000 - 00000000cf200000 (usable)
  (XEN)  00000000cf200000 - 00000000cf38f000 (reserved)
  (XEN)  00000000cf38f000 - 00000000cf3ce000 (ACPI data)
  (XEN)  00000000cf3ce000 - 00000000d0000000 (reserved)
  (XEN)  00000000e0000000 - 00000000f0000000 (reserved)
  (XEN)  00000000fe000000 - 0000000100000000 (reserved)
  (XEN)  0000000100000000 - 0000000630000000 (usable)

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
[v1: Altered the patch and description with domU's with UNUSABLE regions]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-08-20 09:46:06 -04:00
David Vrabel
65a45fa2f6 xen/p2m: avoid unneccesary TLB flush in m2p_remove_override()
In m2p_remove_override() when removing the grant map from the kernel
mapping and replacing with a mapping to the original page, the grant
unmap will already have flushed the TLB and it is not necessary to do
it again after updating the mapping.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2013-08-09 11:06:44 -04:00
Konrad Rzeszutek Wilk
6efa20e49b xen: Support 64-bit PV guest receiving NMIs
This is based on a patch that Zhenzhong Duan had sent - which
was missing some of the remaining pieces. The kernel has the
logic to handle Xen-type-exceptions using the paravirt interface
in the assembler code (see PARAVIRT_ADJUST_EXCEPTION_FRAME -
pv_irq_ops.adjust_exception_frame and and INTERRUPT_RETURN -
pv_cpu_ops.iret).

That means the nmi handler (and other exception handlers) use
the hypervisor iret.

The other changes that would be neccessary for this would
be to translate the NMI_VECTOR to one of the entries on the
ipi_vector and make xen_send_IPI_mask_allbutself use different
events.

Fortunately for us commit 1db01b4903
(xen: Clean up apic ipi interface) implemented this and we piggyback
on the cleanup such that the apic IPI interface will pass the right
vector value for NMI.

With this patch we can trigger NMIs within a PV guest (only tested
x86_64).

For this to work with normal PV guests (not initial domain)
we need the domain to be able to use the APIC ops - they are
already implemented to use the Xen event channels. For that
to be turned on in a PV domU we need to remove the masking
of X86_FEATURE_APIC.

Incidentally that means kgdb will also now work within
a PV guest without using the 'nokgdbroundup' workaround.

Note that the 32-bit version is different and this patch
does not enable that.

CC: Lisa Nguyen <lisa@xenapiadmin.com>
CC: Ben Guthro <benjamin.guthro@citrix.com>
CC: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v1: Fixed up per David Vrabel comments]
Reviewed-by: Ben Guthro <benjamin.guthro@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
2013-08-09 10:55:47 -04:00
Jeremy Fitzhardinge
1ed7bf5f52 xen, pvticketlock: Allow interrupts to be enabled while blocking
If interrupts were enabled when taking the spinlock, we can leave them
enabled while blocking to get the lock.

If we can enable interrupts while waiting for the lock to become
available, and we take an interrupt before entering the poll,
and the handler takes a spinlock which ends up going into
the slow state (invalidating the per-cpu "lock" and "want" values),
then when the interrupt handler returns the event channel will
remain pending so the poll will return immediately, causing it to
return out to the main spinlock loop.

Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Link: http://lkml.kernel.org/r/1376058122-8248-12-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-09 07:54:03 -07:00
Jeremy Fitzhardinge
96f853eaa8 x86, ticketlock: Add slowpath logic
Maintain a flag in the LSB of the ticket lock tail which indicates
whether anyone is in the lock slowpath and may need kicking when
the current holder unlocks.  The flags are set when the first locker
enters the slowpath, and cleared when unlocking to an empty queue (ie,
no contention).

In the specific implementation of lock_spinning(), make sure to set
the slowpath flags on the lock just before blocking.  We must do
this before the last-chance pickup test to prevent a deadlock
with the unlocker:

Unlocker			Locker
				test for lock pickup
					-> fail
unlock
test slowpath
	-> false
				set slowpath flags
				block

Whereas this works in any ordering:

Unlocker			Locker
				set slowpath flags
				test for lock pickup
					-> fail
				block
unlock
test slowpath
	-> true, kick

If the unlocker finds that the lock has the slowpath flag set but it is
actually uncontended (ie, head == tail, so nobody is waiting), then it
clears the slowpath flag.

The unlock code uses a locked add to update the head counter.  This also
acts as a full memory barrier so that its safe to subsequently
read back the slowflag state, knowing that the updated lock is visible
to the other CPUs.  If it were an unlocked add, then the flag read may
just be forwarded from the store buffer before it was visible to the other
CPUs, which could result in a deadlock.

Unfortunately this means we need to do a locked instruction when
unlocking with PV ticketlocks.  However, if PV ticketlocks are not
enabled, then the old non-locked "add" is the only unlocking code.

Note: this code relies on gcc making sure that unlikely() code is out of
line of the fastpath, which only happens when OPTIMIZE_SIZE=n.  If it
doesn't the generated code isn't too bad, but its definitely suboptimal.

Thanks to Srivatsa Vaddagiri for providing a bugfix to the original
version of this change, which has been folded in.
Thanks to Stephan Diestelhorst for commenting on some code which relied
on an inaccurate reading of the x86 memory ordering rules.

Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Link: http://lkml.kernel.org/r/1376058122-8248-11-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stephan Diestelhorst <stephan.diestelhorst@amd.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-09 07:54:00 -07:00
Jeremy Fitzhardinge
354714dd26 x86, pvticketlock: Use callee-save for lock_spinning
Although the lock_spinning calls in the spinlock code are on the
uncommon path, their presence can cause the compiler to generate many
more register save/restores in the function pre/postamble, which is in
the fast path.  To avoid this, convert it to using the pvops callee-save
calling convention, which defers all the save/restores until the actual
function is called, keeping the fastpath clean.

Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Link: http://lkml.kernel.org/r/1376058122-8248-8-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Attilio Rao <attilio.rao@citrix.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-09 07:53:44 -07:00
Jeremy Fitzhardinge
b8fa70b51a xen, pvticketlocks: Add xen_nopvspin parameter to disable xen pv ticketlocks
Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Link: http://lkml.kernel.org/r/1376058122-8248-7-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-09 07:53:37 -07:00
Jeremy Fitzhardinge
80bd58fef4 xen, pvticketlock: Xen implementation for PV ticket locks
Replace the old Xen implementation of PV spinlocks with and implementation
of xen_lock_spinning and xen_unlock_kick.

xen_lock_spinning simply registers the cpu in its entry in lock_waiting,
adds itself to the waiting_cpus set, and blocks on an event channel
until the channel becomes pending.

xen_unlock_kick searches the cpus in waiting_cpus looking for the one
which next wants this lock with the next ticket, if any.  If found,
it kicks it by making its event channel pending, which wakes it up.

We need to make sure interrupts are disabled while we're relying on the
contents of the per-cpu lock_waiting values, otherwise an interrupt
handler could come in, try to take some other lock, block, and overwrite
our values.

Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Link: http://lkml.kernel.org/r/1376058122-8248-6-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
 [ Raghavendra:  use function + enum instead of macro, cmpxchg for zero status reset
Reintroduce break since we know the exact vCPU to send IPI as suggested by Konrad.]
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-09 07:53:23 -07:00
Jeremy Fitzhardinge
bf7aab3ad4 xen: Defer spinlock setup until boot CPU setup
There's no need to do it at very early init, and doing it there
makes it impossible to use the jump_label machinery.

Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Link: http://lkml.kernel.org/r/1376058122-8248-5-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-09 07:53:18 -07:00
Jeremy Fitzhardinge
545ac13892 x86, spinlock: Replace pv spinlocks with pv ticketlocks
Rather than outright replacing the entire spinlock implementation in
order to paravirtualize it, keep the ticket lock implementation but add
a couple of pvops hooks on the slow patch (long spin on lock, unlocking
a contended lock).

Ticket locks have a number of nice properties, but they also have some
surprising behaviours in virtual environments.  They enforce a strict
FIFO ordering on cpus trying to take a lock; however, if the hypervisor
scheduler does not schedule the cpus in the correct order, the system can
waste a huge amount of time spinning until the next cpu can take the lock.

(See Thomas Friebel's talk "Prevent Guests from Spinning Around"
http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)

To address this, we add two hooks:
 - __ticket_spin_lock which is called after the cpu has been
   spinning on the lock for a significant number of iterations but has
   failed to take the lock (presumably because the cpu holding the lock
   has been descheduled).  The lock_spinning pvop is expected to block
   the cpu until it has been kicked by the current lock holder.
 - __ticket_spin_unlock, which on releasing a contended lock
   (there are more cpus with tail tickets), it looks to see if the next
   cpu is blocked and wakes it if so.

When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
functions causes all the extra code to go away.

Results:
=======
setup: 32 core machine with 32 vcpu KVM guest (HT off)  with 8GB RAM
base = 3.11-rc
patched = base + pvspinlock V12

+-----------------+----------------+--------+
 dbench (Throughput in MB/sec. Higher is better)
+-----------------+----------------+--------+
|   base (stdev %)|patched(stdev%) | %gain  |
+-----------------+----------------+--------+
| 15035.3   (0.3) |15150.0   (0.6) |   0.8  |
|  1470.0   (2.2) | 1713.7   (1.9) |  16.6  |
|   848.6   (4.3) |  967.8   (4.3) |  14.0  |
|   652.9   (3.5) |  685.3   (3.7) |   5.0  |
+-----------------+----------------+--------+

pvspinlock shows benefits for overcommit ratio > 1 for PLE enabled cases,
and undercommits results are flat

Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Link: http://lkml.kernel.org/r/1376058122-8248-2-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Attilio Rao <attilio.rao@citrix.com>
[ Raghavendra: Changed SPIN_THRESHOLD, fixed redefinition of arch_spinlock_t]
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-09 07:53:05 -07:00
Andi Kleen
9a55fdbe94 x86, asmlinkage, paravirt: Add __visible/asmlinkage to xen paravirt ops
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1375740170-7446-13-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-06 14:20:56 -07:00
Jason Wang
9df56f19a5 x86: Correctly detect hypervisor
We try to handle the hypervisor compatibility mode by detecting hypervisor
through a specific order. This is not robust, since hypervisors may implement
each others features.

This patch tries to handle this situation by always choosing the last one in the
CPUID leaves. This is done by letting .detect() return a priority instead of
true/false and just re-using the CPUID leaf where the signature were found as
the priority (or 1 if it was found by DMI). Then we can just pick hypervisor who
has the highest priority. Other sophisticated detection method could also be
implemented on top.

Suggested by H. Peter Anvin and Paolo Bonzini.

Acked-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Doug Covelli <dcovelli@vmware.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dan Hecht <dhecht@vmware.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Link: http://lkml.kernel.org/r/1374742475-2485-4-git-send-email-jasowang@redhat.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-08-05 06:35:33 -07:00
Paul Gortmaker
148f9bb877 x86: delete __cpuinit usage from all x86 files
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications.  For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out.  Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

Note that some harmless section mismatch warnings may result, since
notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c)
are flagged as __cpuinit  -- so if we remove the __cpuinit from
arch specific callers, we will also get section mismatch warnings.
As an intermediate step, we intend to turn the linux/init.h cpuinit
content into no-ops as early as possible, since that will get rid
of these warnings.  In any case, they are temporary and harmless.

This removes all the arch/x86 uses of the __cpuinit macros from
all C files.  x86 only had the one __CPUINIT used in assembly files,
and it wasn't paired off with a .previous or a __FINIT, so we can
delete it directly w/o any corresponding additional change there.

[1] https://lkml.org/lkml/2013/5/20/589

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2013-07-14 19:36:56 -04:00
Linus Torvalds
21884a83b2 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner:
 "The timer changes contain:

   - posix timer code consolidation and fixes for odd corner cases

   - sched_clock implementation moved from ARM to core code to avoid
     duplication by other architectures

   - alarm timer updates

   - clocksource and clockevents unregistration facilities

   - clocksource/events support for new hardware

   - precise nanoseconds RTC readout (Xen feature)

   - generic support for Xen suspend/resume oddities

   - the usual lot of fixes and cleanups all over the place

  The parts which touch other areas (ARM/XEN) have been coordinated with
  the relevant maintainers.  Though this results in an handful of
  trivial to solve merge conflicts, which we preferred over nasty cross
  tree merge dependencies.

  The patches which have been committed in the last few days are bug
  fixes plus the posix timer lot.  The latter was in akpms queue and
  next for quite some time; they just got forgotten and Frederic
  collected them last minute."

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (59 commits)
  hrtimer: Remove unused variable
  hrtimers: Move SMP function call to thread context
  clocksource: Reselect clocksource when watchdog validated high-res capability
  posix-cpu-timers: don't account cpu timer after stopped thread runtime accounting
  posix_timers: fix racy timer delta caching on task exit
  posix-timers: correctly get dying task time sample in posix_cpu_timer_schedule()
  selftests: add basic posix timers selftests
  posix_cpu_timers: consolidate expired timers check
  posix_cpu_timers: consolidate timer list cleanups
  posix_cpu_timer: consolidate expiry time type
  tick: Sanitize broadcast control logic
  tick: Prevent uncontrolled switch to oneshot mode
  tick: Make oneshot broadcast robust vs. CPU offlining
  x86: xen: Sync the CMOS RTC as well as the Xen wallclock
  x86: xen: Sync the wallclock when the system time is set
  timekeeping: Indicate that clock was set in the pvclock gtod notifier
  timekeeping: Pass flags instead of multiple bools to timekeeping_update()
  xen: Remove clock_was_set() call in the resume path
  hrtimers: Support resuming with two or more CPUs online (but stopped)
  timer: Fix jiffies wrap behavior of round_jiffies_common()
  ...
2013-07-06 14:09:38 -07:00
Thomas Gleixner
2b0f89317e Merge branch 'timers/posix-cpu-timers-for-tglx' of
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/core

Frederic sayed: "Most of these patches have been hanging around for
several month now, in -mmotm for a significant chunk. They already
missed a few releases."

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-07-04 23:11:22 +02:00
Linus Torvalds
3e34131a65 Bug-fixes:
* Fix memory leak when CPU hotplugging.
 * Compile bugs with various #ifdefs
 * Fix state changes in Xen PCI front not dealing well with new toolstack.
 * Cleanups in code (use pr_*, fix 80 characters splits, etc)
 * Long standing bug in double-reporting the steal time
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQEcBAABAgAGBQJR0Xc1AAoJEFjIrFwIi8fJo68H/jZaJJmytDI7exHTyq8fSGXQ
 5OERw5YZeM5jZQG55YC5hmGS5oIpKEdBt+aAEpuofYUhrR/ZFqDr0j+QEiqC36bl
 cl0/IAnMBGnyyO6FYY4Sut2H+S5BGYQNbwo9YAtgKtZANr2eLABxYUfMU44I/jCW
 M7DAojME9OZLBW3ORYZTGf1A0T8hJINhxZIWhtLMrkckCb9AZMieKdMOHJEWq2jl
 aPxx78U+2CLTdLquOLIiBEiTO3vAzx2Wt4prD+uCizWna45H9gBj7GFwBYH7p9Ry
 TFyzmDc5PufThTilfDyQW1y4yiNLpFDC/67a1wpwObEBn87hstHgwHQQ5INzqwY=
 =Ej7M
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.11-rc0-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull Xen bugfixes from Konrad Rzeszutek Wilk:
 - Fix memory leak when CPU hotplugging.
 - Compile bugs with various #ifdefs
 - Fix state changes in Xen PCI front not dealing well with new
   toolstack.
 - Cleanups in code (use pr_*, fix 80 characters splits, etc)
 - Long standing bug in double-reporting the steal time

* tag 'stable/for-linus-3.11-rc0-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/time: remove blocked time accounting from xen "clockchip"
  xen: Convert printks to pr_<level>
  xen: ifdef CONFIG_HIBERNATE_CALLBACKS xen_*_suspend
  xen/pcifront: Deal with toolstack missing 'XenbusStateClosing' state.
  xen/time: Free onlined per-cpu data structure if we want to online it again.
  xen/time: Check that the per_cpu data structure has data before freeing.
  xen/time: Don't leak interrupt name when offlining.
  xen/time: Encapsulate the struct clock_event_device in another structure.
  xen/spinlock: Don't leak interrupt name when offlining.
  xen/smp: Don't leak interrupt name when offlining.
  xen/smp: Set the per-cpu IRQ number to a valid default.
  xen/smp: Introduce a common structure to contain the IRQ name and interrupt line.
  xen/smp: Coalesce the free_irq calls in one function.
  xen-pciback: fix error return code in pcistub_irq_handler_switch()
2013-07-03 13:12:42 -07:00
Linus Torvalds
d652df0b2f Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FPU changes from Ingo Molnar:
 "There are two bigger changes in this tree:

   - Add an [early-use-]safe static_cpu_has() variant and other
     robustness improvements, including the new X86_DEBUG_STATIC_CPU_HAS
     configurable debugging facility, motivated by recent obscure FPU
     code bugs, by Borislav Petkov

   - Reimplement FPU detection code in C and drop the old asm code, by
     Peter Anvin."

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, fpu: Use static_cpu_has_safe before alternatives
  x86: Add a static_cpu_has_safe variant
  x86: Sanity-check static_cpu_has usage
  x86, cpu: Add a synthetic, always true, cpu feature
  x86: Get rid of ->hard_math and all the FPU asm fu
2013-07-02 16:26:44 -07:00
David Vrabel
47433b8c9d x86: xen: Sync the CMOS RTC as well as the Xen wallclock
Adjustments to Xen's persistent clock via update_persistent_clock()
don't actually persist, as the Xen wallclock is a software only clock
and modifications to it do not modify the underlying CMOS RTC.

The x86_platform.set_wallclock hook is there to keep the hardware RTC
synchronized. On a guest this is pointless.

On Dom0 we can use the native implementaion which actually updates the
hardware RTC, but we still need to keep the software emulation of RTC
for the guests up to date. The subscription to the pvclock_notifier
allows us to emulate this easily. The notifier is called at every tick
and when the clock was set.

Right now we only use that notifier when the clock was set, but due to
the fact that it is called periodically from the timekeeping update
code, we can utilize it to emulate the NTP driven drift compensation
of update_persistant_clock() for the Xen wall (software) clock.

Add a 11 minutes periodic update to the pvclock_gtod notifier callback
to achieve that. The static variable 'next' which maintains that 11
minutes update cycle is protected by the core code serialization so
there is no need to add a Xen specific serialization mechanism.

[ tglx: Massaged changelog and added a few comments ]

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: <xen-devel@lists.xen.org>
Link: http://lkml.kernel.org/r/1372329348-20841-6-git-send-email-david.vrabel@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-28 23:15:07 +02:00
David Vrabel
5584880e44 x86: xen: Sync the wallclock when the system time is set
Currently the Xen wallclock is only updated every 11 minutes if NTP is
synchronized to its clock source (using the sync_cmos_clock() work).
If a guest is started before NTP is synchronized it may see an
incorrect wallclock time.

Use the pvclock_gtod notifier chain to receive a notification when the
system time has changed and update the wallclock to match.

This chain is called on every timer tick and we want to avoid an extra
(expensive) hypercall on every tick.  Because dom0 has historically
never provided a very accurate wallclock and guests do not expect one,
we can do this simply: the wallclock is only updated if the clock was
set.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: <xen-devel@lists.xen.org>
Link: http://lkml.kernel.org/r/1372329348-20841-5-git-send-email-david.vrabel@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-06-28 23:15:06 +02:00
Laszlo Ersek
0b0c002c34 xen/time: remove blocked time accounting from xen "clockchip"
... because the "clock_event_device framework" already accounts for idle
time through the "event_handler" function pointer in
xen_timer_interrupt().

The patch is intended as the completion of [1]. It should fix the double
idle times seen in PV guests' /proc/stat [2]. It should be orthogonal to
stolen time accounting (the removed code seems to be isolated).

The approach may be completely misguided.

[1] https://lkml.org/lkml/2011/10/6/10
[2] http://lists.xensource.com/archives/html/xen-devel/2010-08/msg01068.html

John took the time to retest this patch on top of v3.10 and reported:
"idle time is correctly incremented for pv and hvm for the normal
case, nohz=off and nohz=idle." so lets put this patch in.

CC: stable@vger.kernel.org
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-28 12:11:39 -04:00
Konrad Rzeszutek Wilk
09e99da766 xen/time: Free onlined per-cpu data structure if we want to online it again.
If the per-cpu time data structure has been onlined already and
we are trying to online it again, then free the previous copy
before blindly over-writting it.

A developer naturally should not call this function multiple times
but just in case.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-10 08:43:37 -04:00
Konrad Rzeszutek Wilk
a05e2c371f xen/time: Check that the per_cpu data structure has data before freeing.
We don't check whether the per_cpu data structure has actually
been freed in the past. This checks it and if it has been freed
in the past then just continues on without double-freeing.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-10 08:43:36 -04:00
Konrad Rzeszutek Wilk
c9d76a24a2 xen/time: Don't leak interrupt name when offlining.
When the user does:
    echo 0 > /sys/devices/system/cpu/cpu1/online
    echo 1 > /sys/devices/system/cpu/cpu1/online

kmemleak reports:
kmemleak: 7 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

One of the leaks is from xen/time:

unreferenced object 0xffff88003fa51280 (size 32):
  comm "swapper/0", pid 1, jiffies 4294667339 (age 1027.789s)
  hex dump (first 32 bytes):
    74 69 6d 65 72 31 00 00 00 00 00 00 00 00 00 00  timer1..........
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81660721>] kmemleak_alloc+0x21/0x50
    [<ffffffff81190aac>] __kmalloc_track_caller+0xec/0x2a0
    [<ffffffff812fe1bb>] kvasprintf+0x5b/0x90
    [<ffffffff812fe228>] kasprintf+0x38/0x40
    [<ffffffff81041ec1>] xen_setup_timer+0x51/0xf0
    [<ffffffff8166339f>] xen_cpu_up+0x5f/0x3e8
    [<ffffffff8166bbf5>] _cpu_up+0xd1/0x14b
    [<ffffffff8166bd48>] cpu_up+0xd9/0xec
    [<ffffffff81ae6e4a>] smp_init+0x4b/0xa3
    [<ffffffff81ac4981>] kernel_init_freeable+0xdb/0x1e6
    [<ffffffff8165ce39>] kernel_init+0x9/0xf0
    [<ffffffff8167edfc>] ret_from_fork+0x7c/0xb0
    [<ffffffffffffffff>] 0xffffffffffffffff

This patch fixes it by stashing away the 'name' in the per-cpu
data structure and freeing it when offlining the CPU.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-10 08:43:35 -04:00
Konrad Rzeszutek Wilk
31620a198c xen/time: Encapsulate the struct clock_event_device in another structure.
We don't do any code movement. We just encapsulate the struct clock_event_device
in a new structure which contains said structure and a pointer to
a char *name. The 'name' will be used in 'xen/time: Don't leak interrupt
name when offlining'.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-10 08:43:34 -04:00
Konrad Rzeszutek Wilk
354e7b7619 xen/spinlock: Don't leak interrupt name when offlining.
When the user does:
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

kmemleak reports:
kmemleak: 7 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

unreferenced object 0xffff88003fa51260 (size 32):
  comm "swapper/0", pid 1, jiffies 4294667339 (age 1027.789s)
  hex dump (first 32 bytes):
    73 70 69 6e 6c 6f 63 6b 31 00 00 00 00 00 00 00  spinlock1.......
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81660721>] kmemleak_alloc+0x21/0x50
    [<ffffffff81190aac>] __kmalloc_track_caller+0xec/0x2a0
    [<ffffffff812fe1bb>] kvasprintf+0x5b/0x90
    [<ffffffff812fe228>] kasprintf+0x38/0x40
    [<ffffffff81663789>] xen_init_lock_cpu+0x61/0xbe
    [<ffffffff816633a6>] xen_cpu_up+0x66/0x3e8
    [<ffffffff8166bbf5>] _cpu_up+0xd1/0x14b
    [<ffffffff8166bd48>] cpu_up+0xd9/0xec
    [<ffffffff81ae6e4a>] smp_init+0x4b/0xa3
    [<ffffffff81ac4981>] kernel_init_freeable+0xdb/0x1e6
    [<ffffffff8165ce39>] kernel_init+0x9/0xf0
    [<ffffffff8167edfc>] ret_from_fork+0x7c/0xb0
    [<ffffffffffffffff>] 0xffffffffffffffff

Instead of doing it like the "xen/smp: Don't leak interrupt name when offlining"
patch did (which has a per-cpu structure which contains both the
IRQ number and char*) we use a per-cpu pointers to a *char.

The reason is that the "__this_cpu_read(lock_kicker_irq);" macro
blows up with "__bad_size_call_parameter()" as the size of the
returned structure is not within the parameters of what it expects
and optimizes for.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-10 08:43:33 -04:00
Konrad Rzeszutek Wilk
b85fffec7f xen/smp: Don't leak interrupt name when offlining.
When the user does:
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

kmemleak reports:
kmemleak: 7 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

unreferenced object 0xffff88003fa51240 (size 32):
  comm "swapper/0", pid 1, jiffies 4294667339 (age 1027.789s)
  hex dump (first 32 bytes):
    72 65 73 63 68 65 64 31 00 00 00 00 00 00 00 00  resched1........
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81660721>] kmemleak_alloc+0x21/0x50
    [<ffffffff81190aac>] __kmalloc_track_caller+0xec/0x2a0
    [<ffffffff812fe1bb>] kvasprintf+0x5b/0x90
    [<ffffffff812fe228>] kasprintf+0x38/0x40
    [<ffffffff81047ed1>] xen_smp_intr_init+0x41/0x2c0
    [<ffffffff816636d3>] xen_cpu_up+0x393/0x3e8
    [<ffffffff8166bbf5>] _cpu_up+0xd1/0x14b
    [<ffffffff8166bd48>] cpu_up+0xd9/0xec
    [<ffffffff81ae6e4a>] smp_init+0x4b/0xa3
    [<ffffffff81ac4981>] kernel_init_freeable+0xdb/0x1e6
    [<ffffffff8165ce39>] kernel_init+0x9/0xf0
    [<ffffffff8167edfc>] ret_from_fork+0x7c/0xb0
    [<ffffffffffffffff>] 0xffffffffffffffff

This patch fixes some of it by using the 'struct xen_common_irq->name'
field to stash away the char so that it can be freed when
the interrupt line is destroyed.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-10 08:43:32 -04:00
Konrad Rzeszutek Wilk
ee336e10d5 xen/smp: Set the per-cpu IRQ number to a valid default.
When we free it we want to make sure to set it to a default
value of -1 so that we don't double-free it (in case somebody
calls us twice).

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-10 08:43:31 -04:00
Konrad Rzeszutek Wilk
9547689fcd xen/smp: Introduce a common structure to contain the IRQ name and interrupt line.
This patch adds a new structure to contain the common two things
that each of the per-cpu interrupts need:
 - an interrupt number,
 - and the name of the interrupt (to be added in 'xen/smp: Don't leak
   interrupt name when offlining').

This allows us to carry the tuple of the per-cpu interrupt data structure
and expand it as we need in the future.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-10 08:43:30 -04:00
Konrad Rzeszutek Wilk
53b94fdc8f xen/smp: Coalesce the free_irq calls in one function.
There are two functions that do a bunch of 'free_irq' on
the per_cpu IRQ. Instead of having duplicate code just move
it to one function.

This is just code movement.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-10 08:43:29 -04:00
H. Peter Anvin
60e019eb37 x86: Get rid of ->hard_math and all the FPU asm fu
Reimplement FPU detection code in C and drop old, not-so-recommended
detection method in asm. Move all the relevant stuff into i387.c where
it conceptually belongs. Finally drop cpuinfo_x86.hard_math.

[ hpa: huge thanks to Borislav for taking my original concept patch
  and productizing it ]

[ Boris, note to self: do not use static_cpu_has before alternatives! ]

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1367244262-29511-2-git-send-email-bp@alien8.de
Link: http://lkml.kernel.org/r/1365436666-9837-2-git-send-email-bp@alien8.de
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-06-06 14:32:04 -07:00
Konrad Rzeszutek Wilk
466318a87f xen/smp: Fixup NOHZ per cpu data when onlining an offline CPU.
The xen_play_dead is an undead function. When the vCPU is told to
offline it ends up calling xen_play_dead wherin it calls the
VCPUOP_down hypercall which offlines the vCPU. However, when the
vCPU is onlined back, it resumes execution right after
VCPUOP_down hypercall.

That was OK (albeit the API for play_dead assumes that the CPU
stays dead and never returns) but with commit 4b0c0f294
(tick: Cleanup NOHZ per cpu data on cpu down) that is no longer safe
as said commit resets the ts->inidle which at the start of the
cpu_idle loop was set.

The net effect is that we get this warn:

Broke affinity for irq 16
installing Xen timer for CPU 1
cpu 1 spinlock event irq 48
------------[ cut here ]------------
WARNING: at /home/konrad/linux-linus/kernel/time/tick-sched.c:935 tick_nohz_idle_exit+0x195/0x1b0()
Modules linked in: dm_multipath dm_mod xen_evtchn iscsi_boot_sysfs
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.0-rc3upstream-00068-gdcdbe33 #1
Hardware name: BIOSTAR Group N61PB-M2S/N61PB-M2S, BIOS 6.00 PG 09/03/2009
 ffffffff8193b448 ffff880039da5e60 ffffffff816707c8 ffff880039da5ea0
 ffffffff8108ce8b ffff880039da4010 ffff88003fa8e500 ffff880039da4010
 0000000000000001 ffff880039da4000 ffff880039da4010 ffff880039da5eb0
Call Trace:
 [<ffffffff816707c8>] dump_stack+0x19/0x1b
 [<ffffffff8108ce8b>] warn_slowpath_common+0x6b/0xa0
 [<ffffffff8108ced5>] warn_slowpath_null+0x15/0x20
 [<ffffffff810e4745>] tick_nohz_idle_exit+0x195/0x1b0
 [<ffffffff810da755>] cpu_startup_entry+0x205/0x250
 [<ffffffff81661070>] cpu_bringup_and_idle+0x13/0x15
---[ end trace 915c8c486004dda1 ]---

b/c ts_inidle is set to zero. Thomas suggested that we just add a workaround
to call tick_nohz_idle_enter before returning from xen_play_dead() - and
that is what this patch does and fixes the issue.

We also add the stable part b/c git commit 4b0c0f294 is on the stable
tree.

CC: stable@vger.kernel.org
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-06-04 09:03:18 -04:00
Stefan Bader
1db01b4903 xen: Clean up apic ipi interface
Commit f447d56d36 introduced the
implementation of the PV apic ipi interface. But there were some
odd things (it seems none of which cause really any issue but
maybe they should be cleaned up anyway):
 - xen_send_IPI_mask_allbutself (and by that xen_send_IPI_allbutself)
   ignore the passed in vector and only use the CALL_FUNCTION_SINGLE
   vector. While xen_send_IPI_all and xen_send_IPI_mask use the vector.
 - physflat_send_IPI_allbutself is declared unnecessarily. It is never
   used.

This patch tries to clean up those things.

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-05-29 09:04:21 -04:00
David Vrabel
3565184ed0 x86: Increase precision of x86_platform.get/set_wallclock()
All the virtualized platforms (KVM, lguest and Xen) have persistent
wallclocks that have more than one second of precision.

read_persistent_wallclock() and update_persistent_wallclock() allow
for nanosecond precision but their implementation on x86 with
x86_platform.get/set_wallclock() only allows for one second precision.
This means guests may see a wallclock time that is off by up to 1
second.

Make set_wallclock() and get_wallclock() take a struct timespec
parameter (which allows for nanosecond precision) so KVM and Xen
guests may start with a more accurate wallclock time and a Xen dom0
can maintain a more accurate wallclock for guests.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-05-28 14:00:59 -07:00
Linus Torvalds
607eeb0b83 Bug-fixes:
- More fixes in the vCPU PVHVM hotplug path.
  - Add more documentation.
  - Fix various ARM related issues in the Xen generic drivers.
  - Updates in the xen-pciback driver per Bjorn's updates.
  - Mask the x2APIC feature for PV guests.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQEcBAABAgAGBQJRjPL5AAoJEFjIrFwIi8fJdlIIANXawH+B+aFbqsFSKOOh76XN
 smgICU78SVzKpW9WAPYK7YFqSdNN4AleGC2Mn2lSkiaqgciRyDb9Yt+OSMMts2Xn
 ZVbFkGhEKR+DtZfTKo9YgsGatul/McTiVEkuuli+aN5dql3WXDLAaA+/b9bO3ohh
 TCWtWNuSCGmlfDoJET2je+J6CgKvCErH3fvzKNxgYxytcGhAvxoVK/lC4d3pnq/m
 wQUAIcF8XYENqC2m1WDR0OGveAB0Me0j9g+UkQS+TzqA8GPmxC4aptjkroFYhOz6
 8nZp+LanimmTI6olVioWEXkCr5+dxb058jQKwncQfonFpl58RS0qUrz5zoe3etU=
 =h9SX
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.10-rc0-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull Xen bug-fixes from Konrad Rzeszutek Wilk:
 - More fixes in the vCPU PVHVM hotplug path.
 - Add more documentation.
 - Fix various ARM related issues in the Xen generic drivers.
 - Updates in the xen-pciback driver per Bjorn's updates.
 - Mask the x2APIC feature for PV guests.

* tag 'stable/for-linus-3.10-rc0-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/pci: Used cached MSI-X capability offset
  xen/pci: Use PCI_MSIX_TABLE_BIR, not PCI_MSIX_FLAGS_BIRMASK
  xen: clear IRQ_NOAUTOEN and IRQ_NOREQUEST
  xen: mask x2APIC feature in PV
  xen: SWIOTLB is only used on x86
  xen/spinlock: Fix check from greater than to be also be greater or equal to.
  xen/smp/pvhvm: Don't point per_cpu(xen_vpcu, 33 and larger) to shared_info
  xen/vcpu: Document the xen_vcpu_info and xen_vcpu
  xen/vcpu/pvhvm: Fix vcpu hotplugging hanging.
2013-05-11 16:19:30 -07:00
Zhenzhong Duan
4ea9b9aca9 xen: mask x2APIC feature in PV
On x2apic enabled pvm, doing sysrq+l, got NULL pointer dereference as below.

    SysRq : Show backtrace of all active CPUs
    BUG: unable to handle kernel NULL pointer dereference at           (null)
    IP: [<ffffffff8125e3cb>] memcpy+0xb/0x120
    Call Trace:
     [<ffffffff81039633>] ? __x2apic_send_IPI_mask+0x73/0x160
     [<ffffffff8103973e>] x2apic_send_IPI_all+0x1e/0x20
     [<ffffffff8103498c>] arch_trigger_all_cpu_backtrace+0x6c/0xb0
     [<ffffffff81501be4>] ? _raw_spin_lock_irqsave+0x34/0x50
     [<ffffffff8131654e>] sysrq_handle_showallcpus+0xe/0x10
     [<ffffffff8131616d>] __handle_sysrq+0x7d/0x140
     [<ffffffff81316230>] ? __handle_sysrq+0x140/0x140
     [<ffffffff81316287>] write_sysrq_trigger+0x57/0x60
     [<ffffffff811ca996>] proc_reg_write+0x86/0xc0
     [<ffffffff8116dd8e>] vfs_write+0xce/0x190
     [<ffffffff8116e3e5>] sys_write+0x55/0x90
     [<ffffffff8150a242>] system_call_fastpath+0x16/0x1b

That's because apic points to apic_x2apic_cluster or apic_x2apic_phys
but the basic element like cpumask isn't initialized.

Mask x2APIC feature in pvm to avoid overwrite of apic pointer,
update commit message per Konrad's suggestion.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Tested-by: Tamon Shiose <tamon.shiose@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-05-08 08:38:11 -04:00
Konrad Rzeszutek Wilk
cb91f8f44c xen/spinlock: Fix check from greater than to be also be greater or equal to.
During review of git commit cb9c6f15f3
("xen/spinlock:  Check against default value of -1 for IRQ line.")
Stefano pointed out a bug in the patch. Unfortunatly due to vacation
timing the fix was not applied and this patch fixes it up.

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-05-08 08:38:09 -04:00
Konrad Rzeszutek Wilk
d5b17dbff8 xen/smp/pvhvm: Don't point per_cpu(xen_vpcu, 33 and larger) to shared_info
As it will point to some data, but not event channel data (the
shared_info has an array limited to 32).

This means that for PVHVM guests with more than 32 VCPUs without
the usage of VCPUOP_register_info any interrupts to VCPUs
larger than 32 would have gone unnoticed during early bootup.

That is OK, as during early bootup, in smp_init we end up calling
the hotplug mechanism (xen_hvm_cpu_notify) which makes the
VCPUOP_register_vcpu_info call for all VCPUs and we can receive
interrupts on VCPUs 33 and further.

This is just a cleanup.

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-05-08 08:38:08 -04:00
Konrad Rzeszutek Wilk
a520996ae2 xen/vcpu: Document the xen_vcpu_info and xen_vcpu
They are important structures and it is not clear at first
look what they are for.

The xen_vcpu is a pointer. By default it points to the shared_info
structure (at the CPU offset location). However if the
VCPUOP_register_vcpu_info hypercall is implemented we can make the
xen_vcpu pointer point to a per-CPU location.

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
[v1: Added comments from Ian Campbell]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-05-07 10:05:34 -04:00
Konrad Rzeszutek Wilk
7f1fc268c4 xen/vcpu/pvhvm: Fix vcpu hotplugging hanging.
If a user did:

	echo 0 > /sys/devices/system/cpu/cpu1/online
	echo 1 > /sys/devices/system/cpu/cpu1/online

we would (this a build with DEBUG enabled) get to:
smpboot: ++++++++++++++++++++=_---CPU UP  1
.. snip..
smpboot: Stack at about ffff880074c0ff44
smpboot: CPU1: has booted.

and hang. The RCU mechanism would kick in an try to IPI the CPU1
but the IPIs (and all other interrupts) would never arrive at the
CPU1. At first glance at least. A bit digging in the hypervisor
trace shows that (using xenanalyze):

[vla] d4v1 vec 243 injecting
   0.043163027 --|x d4v1 intr_window vec 243 src 5(vector) intr f3
]  0.043163639 --|x d4v1 vmentry cycles 1468
]  0.043164913 --|x d4v1 vmexit exit_reason PENDING_INTERRUPT eip ffffffff81673254
   0.043164913 --|x d4v1 inj_virq vec 243  real
  [vla] d4v1 vec 243 injecting
   0.043164913 --|x d4v1 intr_window vec 243 src 5(vector) intr f3
]  0.043165526 --|x d4v1 vmentry cycles 1472
]  0.043166800 --|x d4v1 vmexit exit_reason PENDING_INTERRUPT eip ffffffff81673254
   0.043166800 --|x d4v1 inj_virq vec 243  real
  [vla] d4v1 vec 243 injecting

there is a pending event (subsequent debugging shows it is the IPI
from the VCPU0 when smpboot.c on VCPU1 has done
"set_cpu_online(smp_processor_id(), true)") and the guest VCPU1 is
interrupted with the callback IPI (0xf3 aka 243) which ends up calling
__xen_evtchn_do_upcall.

The __xen_evtchn_do_upcall seems to do *something* but not acknowledge
the pending events. And the moment the guest does a 'cli' (that is the
ffffffff81673254 in the log above) the hypervisor is invoked again to
inject the IPI (0xf3) to tell the guest it has pending interrupts.
This repeats itself forever.

The culprit was the per_cpu(xen_vcpu, cpu) pointer. At the bootup
we set each per_cpu(xen_vcpu, cpu) to point to the
shared_info->vcpu_info[vcpu] but later on use the VCPUOP_register_vcpu_info
to register per-CPU  structures (xen_vcpu_setup).
This is used to allow events for more than 32 VCPUs and for performance
optimizations reasons.

When the user performs the VCPU hotplug we end up calling the
the xen_vcpu_setup once more. We make the hypercall which returns
-EINVAL as it does not allow multiple registration calls (and
already has re-assigned where the events are being set). We pick
the fallback case and set per_cpu(xen_vcpu, cpu) to point to the
shared_info->vcpu_info[vcpu] (which is a good fallback during bootup).
However the hypervisor is still setting events in the register
per-cpu structure (per_cpu(xen_vcpu_info, cpu)).

As such when the events are set by the hypervisor (such as timer one),
and when we iterate in __xen_evtchn_do_upcall we end up reading stale
events from the shared_info->vcpu_info[vcpu] instead of the
per_cpu(xen_vcpu_info, cpu) structures. Hence we never acknowledge the
events that the hypervisor has set and the hypervisor keeps on reminding
us to ack the events which we never do.

The fix is simple. Don't on the second time when xen_vcpu_setup is
called over-write the per_cpu(xen_vcpu, cpu) if it points to
per_cpu(xen_vcpu_info).

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
CC: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-05-06 11:29:17 -04:00
Linus Torvalds
1e2f5b598a Merge branch 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 paravirt update from Ingo Molnar:
 "Various paravirtualization related changes - the biggest one makes
  guest support optional via CONFIG_HYPERVISOR_GUEST"

* 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, wakeup, sleep: Use pvops functions for changing GDT entries
  x86, xen, gdt: Remove the pvops variant of store_gdt.
  x86-32, gdt: Store/load GDT for ACPI S3 or hibernation/resume path is not needed
  x86-64, gdt: Store/load GDT for ACPI S3 or hibernate/resume path is not needed.
  x86: Make Linux guest support optional
  x86, Kconfig: Move PARAVIRT_DEBUG into the paravirt menu
2013-04-30 08:41:21 -07:00
Linus Torvalds
01c7cd0ef5 Merge branch 'x86-kaslr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perparatory x86 kasrl changes from Ingo Molnar:
 "This contains changes from the ongoing KASLR work, by Kees Cook.

  The main changes are the use of a read-only IDT on x86 (which
  decouples the userspace visible virtual IDT address from the physical
  address), and a rework of ELF relocation support, in preparation of
  random, boot-time kernel image relocation."

* 'x86-kaslr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, relocs: Refactor the relocs tool to merge 32- and 64-bit ELF
  x86, relocs: Build separate 32/64-bit tools
  x86, relocs: Add 64-bit ELF support to relocs tool
  x86, relocs: Consolidate processing logic
  x86, relocs: Generalize ELF structure names
  x86: Use a read-only IDT alias on all CPUs
2013-04-30 08:37:24 -07:00
Linus Torvalds
8700c95adb Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP/hotplug changes from Ingo Molnar:
 "This is a pretty large, multi-arch series unifying and generalizing
  the various disjunct pieces of idle routines that architectures have
  historically copied from each other and have grown in random, wildly
  inconsistent and sometimes buggy directions:

   101 files changed, 455 insertions(+), 1328 deletions(-)

  this went through a number of review and test iterations before it was
  committed, it was tested on various architectures, was exposed to
  linux-next for quite some time - nevertheless it might cause problems
  on architectures that don't read the mailing lists and don't regularly
  test linux-next.

  This cat herding excercise was motivated by the -rt kernel, and was
  brought to you by Thomas "the Whip" Gleixner."

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  idle: Remove GENERIC_IDLE_LOOP config switch
  um: Use generic idle loop
  ia64: Make sure interrupts enabled when we "safe_halt()"
  sparc: Use generic idle loop
  idle: Remove unused ARCH_HAS_DEFAULT_IDLE
  bfin: Fix typo in arch_cpu_idle()
  xtensa: Use generic idle loop
  x86: Use generic idle loop
  unicore: Use generic idle loop
  tile: Use generic idle loop
  tile: Enter idle with preemption disabled
  sh: Use generic idle loop
  score: Use generic idle loop
  s390: Use generic idle loop
  powerpc: Use generic idle loop
  parisc: Use generic idle loop
  openrisc: Use generic idle loop
  mn10300: Use generic idle loop
  mips: Use generic idle loop
  microblaze: Use generic idle loop
  ...
2013-04-30 07:50:17 -07:00