kvm_set_irq() has an internal buffer of three irq routing entries, allowing
connecting a GSI to three IRQ chips or on MSI. However setup_routing_entry()
does not properly enforce this, allowing three irqchip routes followed by
an MSI route to overflow the buffer.
Fix by ensuring that an MSI entry is added to an empty list.
Signed-off-by: Avi Kivity <avi@redhat.com>
lpage_info is created for each large level even when the memory slot is
not for RAM. This means that when we add one slot for a PCI device, we
end up allocating at least KVM_NR_PAGE_SIZES - 1 pages by vmalloc().
To make things worse, there is an increasing number of devices which
would result in more pages being wasted this way.
This patch mitigates this problem by using kvm_kvzalloc().
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
Will be used for lpage_info allocation later.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch implements the directed yield hypercall found on other
System z hypervisors. It delegates execution time to the virtual cpu
specified in the instruction's parameter.
Useful to avoid long spinlock waits in the guest.
Christian Borntraeger: moved common code in virt/kvm/
Signed-off-by: Konstantin Weitz <WEITZKON@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently, MSI messages can only be injected to in-kernel irqchips by
defining a corresponding IRQ route for each message. This is not only
unhandy if the MSI messages are generated "on the fly" by user space,
IRQ routes are a limited resource that user space has to manage
carefully.
By providing a direct injection path, we can both avoid using up limited
resources and simplify the necessary steps for user land.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Merge reason: development work has dependency on kvm patches merged
upstream.
Conflicts:
Documentation/feature-removal-schedule.txt
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
As pointed out by Jason Baron, when assigning a device to a guest
we first set the iommu domain pointer, which enables mapping
and unmapping of memory slots to the iommu. This leaves a window
where this path is enabled, but we haven't synchronized the iommu
mappings to the existing memory slots. Thus a slot being removed
at that point could send us down unexpected code paths removing
non-existent pinnings and iommu mappings. Take the slots_lock
around creating the iommu domain and initial mappings as well as
around iommu teardown to avoid this race.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Intel spec says that TMR needs to be set/cleared
when IRR is set, but kvm also clears it on EOI.
I did some tests on a real (AMD based) system,
and I see same TMR values both before
and after EOI, so I think it's a minor bug in kvm.
This patch fixes TMR to be set/cleared on IRR set
only as per spec.
And now that we don't clear TMR, we can save
an atomic read of TMR on EOI that's not propagated
to ioapic, by checking whether ioapic needs
a specific vector first and calculating
the mode afterwards.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We've been adding new mappings, but not destroying old mappings.
This can lead to a page leak as pages are pinned using
get_user_pages, but only unpinned with put_page if they still
exist in the memslots list on vm shutdown. A memslot that is
destroyed while an iommu domain is enabled for the guest will
therefore result in an elevated page reference count that is
never cleared.
Additionally, without this fix, the iommu is only programmed
with the first translation for a gpa. This can result in
peer-to-peer errors if a mapping is destroyed and replaced by a
new mapping at the same gpa as the iommu will still be pointing
to the original, pinned memory address.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Now that we do neither double buffering nor heuristic selection of the
write protection method these are not needed anymore.
Note: some drivers have their own implementation of set_bit_le() and
making it generic needs a bit of work; so we use test_and_set_bit_le()
and will later replace it with generic set_bit_le().
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
S390's kvm_vcpu_stat does not contain halt_wakeup member.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The kvm_vcpu_kick function performs roughly the same funcitonality on
most all architectures, so we shouldn't have separate copies.
PowerPC keeps a pointer to interchanging waitqueues on the vcpu_arch
structure and to accomodate this special need a
__KVM_HAVE_ARCH_VCPU_GET_WQ define and accompanying function
kvm_arch_vcpu_wq have been defined. For all other architectures this
is a generic inline that just returns &vcpu->wq;
Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch makes the kvm_io_range array can be resized dynamically.
Signed-off-by: Amos Kong <akong@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
As kvm_notify_acked_irq calls kvm_assigned_dev_ack_irq under
rcu_read_lock, we cannot use a mutex in the latter function. Switch to a
spin lock to address this.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Using 'int' type is not suitable for a 'long' object. So, correct it.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
PCI 2.3 allows to generically disable IRQ sources at device level. This
enables us to share legacy IRQs of such devices with other host devices
when passing them to a guest.
The new IRQ sharing feature introduced here is optional, user space has
to request it explicitly. Moreover, user space can inform us about its
view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the
interrupt and signaling it if the guest masked it via the virtualized
PCI config space.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If some vcpus are created before KVM_CREATE_IRQCHIP, then
irqchip_in_kernel() and vcpu->arch.apic will be inconsistent, leading
to potential NULL pointer dereferences.
Fix by:
- ensuring that no vcpus are installed when KVM_CREATE_IRQCHIP is called
- ensuring that a vcpu has an apic if it is installed after KVM_CREATE_IRQCHIP
This is somewhat long winded because vcpu->arch.apic is created without
kvm->lock held.
Based on earlier patch by Michael Ellerman.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Avi Kivity <avi@redhat.com>
Other threads may process the same page in that small window and skip
TLB flush and then return before these functions do flush.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Some members of kvm_memory_slot are not used by every architecture.
This patch is the first step to make this difference clear by
introducing kvm_memory_slot::arch; lpage_info is moved into it.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Narrow down the controlled text inside the conditional so that it will
include lpage_info and rmap stuff only.
For this we change the way we check whether the slot is being created
from "if (npages && !new.rmap)" to "if (npages && !old.npages)".
We also stop checking if lpage_info is NULL when we create lpage_info
because we do it from inside the slot creation code block.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This makes it easy to make lpage_info architecture specific.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch cleans up the code and removes the "(void)level;" warning
suppressor.
Note that we can also use this for PT_PAGE_TABLE_LEVEL to treat every
level uniformly later.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This moves __gfn_to_memslot() and search_memslots() from kvm_main.c to
kvm_host.h to reduce the code duplication caused by the need for
non-modular code in arch/powerpc/kvm/book3s_hv_rm_mmu.c to call
gfn_to_memslot() in real mode.
Rather than putting gfn_to_memslot() itself in a header, which would
lead to increased code size, this puts __gfn_to_memslot() in a header.
Then, the non-modular uses of gfn_to_memslot() are changed to call
__gfn_to_memslot() instead. This way there is only one place in the
source code that needs to be changed should the gfn_to_memslot()
implementation need to be modified.
On powerpc, the Book3S HV style of KVM has code that is called from
real mode which needs to call gfn_to_memslot() and thus needs this.
(Module code is allocated in the vmalloc region, which can't be
accessed in real mode.)
With this, we can remove builtin_gfn_to_memslot() from book3s_hv_rm_mmu.c.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
find_index_from_host_irq returns 0 on error
but callers assume < 0 on error. This should
not matter much: an out of range irq should never happen since
irq handler was registered with this irq #,
and even if it does we get a spurious msix irq in guest
and typically nothing terrible happens.
Still, better to make it consistent.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This adds an smp_wmb in kvm_mmu_notifier_invalidate_range_end() and an
smp_rmb in mmu_notifier_retry() so that mmu_notifier_retry() will give
the correct answer when called without kvm->mmu_lock being held.
PowerPC Book3S HV KVM wants to use a bitlock per guest page rather than
a single global spinlock in order to improve the scalability of updates
to the guest MMU hashed page table, and so needs this.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch exports the s390 SIE hardware control block to userspace
via the mapping of the vcpu file descriptor. In order to do so,
a new arch callback named kvm_arch_vcpu_fault is introduced for all
architectures. It allows to map architecture specific pages.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch introduces a new config option for user controlled kernel
virtual machines. It introduces a parameter to KVM_CREATE_VM that
allows to set bits that alter the capabilities of the newly created
virtual machine.
The parameter is passed to kvm_arch_init_vm for all architectures.
The only valid modifier bit for now is KVM_VM_S390_UCONTROL.
This requires CAP_SYS_ADMIN privileges and creates a user controlled
virtual machine on s390 architectures.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
It is possible that the __set_bit() in mark_page_dirty() is called
simultaneously on the same region of memory, which may result in only
one bit being set, because some callers do not take mmu_lock before
mark_page_dirty().
This problem is hard to produce because when we reach mark_page_dirty()
beginning from, e.g., tdp_page_fault(), mmu_lock is being held during
__direct_map(): making kvm-unit-tests' dirty log api test write to two
pages concurrently was not useful for this reason.
So we have confirmed that there can actually be race condition by
checking if some callers really reach there without holding mmu_lock
using spin_is_locked(): probably they were from kvm_write_guest_page().
To fix this race, this patch changes the bit operation to the atomic
version: note that nr_dirty_pages also suffers from the race but we do
not need exactly correct numbers for now.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
module_param(bool) used to counter-intuitively take an int. In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.
It's time to remove the int/unsigned int option. For this version
it'll simply give a warning, but it'll break next kernel version.
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
by checking the return value from kvm_init_debug, we
can ensure that the entries under debugfs for KVM have
been created correctly.
Signed-off-by: Yang Bai <hamo.by@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Drop bsp_vcpu pointer from kvm struct since its only use is incorrect
anyway.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Switch to using memdup_user when possible. This makes code more
smaller and compact, and prevents errors.
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Switch to kmemdup() in two places to shorten the code and avoid possible bugs.
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the
ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write
consistent with ioapic_mmio_read, which also allows byte and word accesses.
Signed-off-by: Julian Stecklina <js@alien8.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
The operation of getting dirty log is frequent when framebuffer-based
displays are used(for example, Xwindow), so, we introduce a mapping table
to speed up id_to_memslot()
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Sort memslots base on its size and use line search to find it, so that the
larger memslots have better fit
The idea is from Avi
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce id_to_memslot to get memslot by slot id
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce kvm_for_each_memslot to walk all valid memslot
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce update_memslots to update slot which will be update to
kvm->memslots
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Needed for the next patch which uses this number to decide how to write
protect a slot.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
Use kmemdup rather than duplicating its implementation
The semantic patch that makes this change is available
in scripts/coccinelle/api/memdup.cocci.
More information about semantic patching is available at
http://coccinelle.lip6.fr/
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
My testing version of Smatch complains that addr and len come from
the user and they can wrap. The path is:
-> kvm_vm_ioctl()
-> kvm_vm_ioctl_unregister_coalesced_mmio()
-> coalesced_mmio_in_range()
I don't know what the implications are of wrapping here, but we may
as well fix it, if only to silence the warning.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Only allow KVM device assignment to attach to devices which:
- Are not bridges
- Have BAR resources (assume others are special devices)
- The user has permissions to use
Assigning a bridge is a configuration error, it's not supported, and
typically doesn't result in the behavior the user is expecting anyway.
Devices without BAR resources are typically chipset components that
also don't have host drivers. We don't want users to hold such devices
captive or cause system problems by fencing them off into an iommu
domain. We determine "permission to use" by testing whether the user
has access to the PCI sysfs resource files. By default a normal user
will not have access to these files, so it provides a good indication
that an administration agent has granted the user access to the device.
[Yang Bai: add missing #include]
[avi: fix comment style]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yang Bai <hamo.by@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This option has no users and it exposes a security hole that we
can allow devices to be assigned without iommu protection. Make
KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory option.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
When mapping a memory region, split it to page sizes as supported
by the iommu hardware. Always prefer bigger pages, when possible,
in order to reduce the TLB pressure.
The logic to do that is now added to the IOMMU core, so neither the iommu
drivers themselves nor users of the IOMMU API have to duplicate it.
This allows a more lenient granularity of mappings; traditionally the
IOMMU API took 'order' (of a page) as a mapping size, and directly let
the low level iommu drivers handle the mapping, but now that the IOMMU
core can split arbitrary memory regions into pages, we can remove this
limitation, so users don't have to split those regions by themselves.
Currently the supported page sizes are advertised once and they then
remain static. That works well for OMAP and MSM but it would probably
not fly well with intel's hardware, where the page size capabilities
seem to have the potential to be different between several DMA
remapping devices.
register_iommu() currently sets a default pgsize behavior, so we can convert
the IOMMU drivers in subsequent patches. After all the drivers
are converted, the temporary default settings will be removed.
Mainline users of the IOMMU API (kvm and omap-iovmm) are adopted
to deal with bytes instead of page order.
Many thanks to Joerg Roedel <Joerg.Roedel@amd.com> for significant review!
Signed-off-by: Ohad Ben-Cohen <ohad@wizery.com>
Cc: David Brown <davidb@codeaurora.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <Joerg.Roedel@amd.com>
Cc: Stepan Moskovchenko <stepanm@codeaurora.org>
Cc: KyongHo Cho <pullip.cho@samsung.com>
Cc: Hiroshi DOYU <hdoyu@nvidia.com>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This file has things like module_param_named() and MODULE_PARM_DESC()
so it needs the full module.h header present. Without it, you'll get:
CC arch/x86/kvm/../../../virt/kvm/iommu.o
virt/kvm/iommu.c:37: error: expected ‘)’ before ‘bool’
virt/kvm/iommu.c:39: error: expected ‘)’ before string constant
make[3]: *** [arch/x86/kvm/../../../virt/kvm/iommu.o] Error 1
make[2]: *** [arch/x86/kvm] Error 2
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
This was coming in via an implicit module.h (and its sub-includes)
before, but we'll be cleaning that up shortly. Call out the stat.h
include requirement in advance.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (33 commits)
iommu/core: Remove global iommu_ops and register_iommu
iommu/msm: Use bus_set_iommu instead of register_iommu
iommu/omap: Use bus_set_iommu instead of register_iommu
iommu/vt-d: Use bus_set_iommu instead of register_iommu
iommu/amd: Use bus_set_iommu instead of register_iommu
iommu/core: Use bus->iommu_ops in the iommu-api
iommu/core: Convert iommu_found to iommu_present
iommu/core: Add bus_type parameter to iommu_domain_alloc
Driver core: Add iommu_ops to bus_type
iommu/core: Define iommu_ops and register_iommu only with CONFIG_IOMMU_API
iommu/amd: Fix wrong shift direction
iommu/omap: always provide iommu debug code
iommu/core: let drivers know if an iommu fault handler isn't installed
iommu/core: export iommu_set_fault_handler()
iommu/omap: Fix build error with !IOMMU_SUPPORT
iommu/omap: Migrate to the generic fault report mechanism
iommu/core: Add fault reporting mechanism
iommu/core: Use PAGE_SIZE instead of hard-coded value
iommu/core: use the existing IS_ALIGNED macro
iommu/msm: ->unmap() should return order of unmapped page
...
Fixup trivial conflicts in drivers/iommu/Makefile: "move omap iommu to
dedicated iommu folder" vs "Rename the DMAR and INTR_REMAP config
options" just happened to touch lines next to each other.
* 'kvm-updates/3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (75 commits)
KVM: SVM: Keep intercepting task switching with NPT enabled
KVM: s390: implement sigp external call
KVM: s390: fix register setting
KVM: s390: fix return value of kvm_arch_init_vm
KVM: s390: check cpu_id prior to using it
KVM: emulate lapic tsc deadline timer for guest
x86: TSC deadline definitions
KVM: Fix simultaneous NMIs
KVM: x86 emulator: convert push %sreg/pop %sreg to direct decode
KVM: x86 emulator: switch lds/les/lss/lfs/lgs to direct decode
KVM: x86 emulator: streamline decode of segment registers
KVM: x86 emulator: simplify OpMem64 decode
KVM: x86 emulator: switch src decode to decode_operand()
KVM: x86 emulator: qualify OpReg inhibit_byte_regs hack
KVM: x86 emulator: switch OpImmUByte decode to decode_imm()
KVM: x86 emulator: free up some flag bits near src, dst
KVM: x86 emulator: switch src2 to generic decode_operand()
KVM: x86 emulator: expand decode flags to 64 bits
KVM: x86 emulator: split dst decode to a generic decode_operand()
KVM: x86 emulator: move memop, memopp into emulation context
...
With per-bus iommu_ops the iommu_found function needs to
work on a bus_type too. This patch adds a bus_type parameter
to that function and converts all call-places.
The function is also renamed to iommu_present because the
function now checks if an iommu is present for a given bus
and does not check for a global iommu anymore.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This is necessary to store a pointer to the bus-specific
iommu_ops in the iommu-domain structure. It will be used
later to call into bus-specific iommu-ops.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The threaded IRQ handler for MSI-X has almost nothing in common with the
INTx/MSI handler. Move its code into a dedicated handler.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We only perform work in kvm_assigned_dev_ack_irq if the guest IRQ is of
INTx type. This completely avoids the callback invocation in non-INTx
cases by registering the IRQ ack notifier only for INTx.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently the method of dealing with an IO operation on a bus (PIO/MMIO)
is to call the read or write callback for each device registered
on the bus until we find a device which handles it.
Since the number of devices on a bus can be significant due to ioeventfds
and coalesced MMIO zones, this leads to a lot of overhead on each IO
operation.
Instead of registering devices, we now register ranges which points to
a device. Lookup is done using an efficient bsearch instead of a linear
search.
Performance test was conducted by comparing exit count per second with
200 ioeventfds created on one byte and the guest is trying to access a
different byte continuously (triggering usermode exits).
Before the patch the guest has achieved 259k exits per second, after the
patch the guest does 274k exits per second.
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch changes coalesced mmio to create one mmio device per
zone instead of handling all zones in one device.
Doing so enables us to take advantage of existing locking and prevents
a race condition between coalesced mmio registration/unregistration
and lookups.
Suggested-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Move the check whether there are available entries to within the spinlock.
This allows working with larger amount of VCPUs and reduces premature
exits when using a large number of VCPUs.
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Device drivers that create and destroy SR-IOV virtual functions via
calls to pci_enable_sriov() and pci_disable_sriov can cause catastrophic
failures if they attempt to destroy VFs while they are assigned to
guest virtual machines. By adding a flag for use by the KVM module
to indicate that a device is assigned a device driver can check that
flag and avoid destroying VFs while they are assigned and avoid system
failures.
CC: Ian Campbell <ijc@hellion.org.uk>
CC: Konrad Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
IOMMU interrupt remapping support provides a further layer of
isolation for device assignment by preventing arbitrary interrupt
block DMA writes by a malicious guest from reaching the host. By
default, we should require that the platform provides interrupt
remapping support, with an opt-in mechanism for existing behavior.
Both AMD IOMMU and Intel VT-d2 hardware support interrupt
remapping, however we currently only have software support on
the Intel side. Users wishing to re-enable device assignment
when interrupt remapping is not supported on the platform can
use the "allow_unsafe_assigned_interrupts=1" module option.
[avi: break long lines]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The idea is from Avi:
| We could cache the result of a miss in an spte by using a reserved bit, and
| checking the page fault error code (or seeing if we get an ept violation or
| ept misconfiguration), so if we get repeated mmio on a page, we don't need to
| search the slot list/tree.
| (https://lkml.org/lkml/2011/2/22/221)
When the page fault is caused by mmio, we cache the info in the shadow page
table, and also set the reserved bits in the shadow page table, so if the mmio
is caused again, we can quickly identify it and emulate it directly
Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it
can be reduced by this feature, and also avoid walking guest page table for
soft mmu.
[jan: fix operator precedence issue]
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If the page fault is caused by mmio, the gfn can not be found in memslots, and
'bad_pfn' is returned on gfn_to_hva path, so we can use 'bad_pfn' to identify
the mmio page fault.
And, to clarify the meaning of mmio pfn, we return fault page instead of bad
page when the gfn is not allowd to prefetch
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce kvm_read_guest_cached() function in addition to write one we
already have.
[ by glauber: export function signature in kvm header ]
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Eric Munson <emunson@mgebm.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
KVM_MAX_MSIX_PER_DEV implies that up to that many MSI-X entries can be
requested. But the kernel so far rejected already the upper limit.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
KVM has an ioctl to define which signal mask should be used while running
inside VCPU_RUN. At least for big endian systems, this mask is different
on 32-bit and 64-bit systems (though the size is identical).
Add a compat wrapper that converts the mask to whatever the kernel accepts,
allowing 32-bit kvm user space to set signal masks.
This patch fixes qemu with --enable-io-thread on ppc64 hosts when running
32-bit user land.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
So far kvm_arch_vcpu_setup is responsible for freeing the vcpu struct if
it fails. Move this confusing resonsibility back into the hands of
kvm_vm_ioctl_create_vcpu. Only kvm_arch_vcpu_setup of x86 is affected,
all other archs cannot fail.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Simply use __copy_to_user/__clear_user to write guest page since we have
already verified the user address when the memslot is set
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* 'kvm-updates/3.0' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Initialize kvm before registering the mmu notifier
KVM: x86: use proper port value when checking io instruction permission
KVM: add missing void __user * cast to access_ok() call
It doesn't make sense to ever see a half-initialized kvm structure on
mmu notifier callbacks. Previously, 85722cda changed the ordering to
ensure that the mmu_lock was initialized before mmu notifier
registration, but there is still a race where the mmu notifier could
come in and try accessing other portions of struct kvm before they are
intialized.
Solve this by moving the mmu notifier registration to occur after the
structure is completely initialized.
Google-Bug-Id: 452199
Signed-off-by: Mike Waychison <mikew@google.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
fa3d315a "KVM: Validate userspace_addr of memslot when registered" introduced
this new warning onn s390:
kvm_main.c: In function '__kvm_set_memory_region':
kvm_main.c:654:7: warning: passing argument 1 of '__access_ok' makes pointer from integer without a cast
arch/s390/include/asm/uaccess.h:53:19: note: expected 'const void *' but argument is of type '__u64'
Add the missing cast to get rid of it again...
Cc: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (27 commits)
PCI: Don't use dmi_name_in_vendors in quirk
PCI: remove unused AER functions
PCI/sysfs: move bus cpuaffinity to class dev_attrs
PCI: add rescan to /sys/.../pci_bus/.../
PCI: update bridge resources to get more big ranges when allocating space (again)
KVM: Use pci_store/load_saved_state() around VM device usage
PCI: Add interfaces to store and load the device saved state
PCI: Track the size of each saved capability data area
PCI/e1000e: Add and use pci_disable_link_state_locked()
x86/PCI: derive pcibios_last_bus from ACPI MCFG
PCI: add latency tolerance reporting enable/disable support
PCI: add OBFF enable/disable support
PCI: add ID-based ordering enable/disable support
PCI hotplug: acpiphp: assume device is in state D0 after powering on a slot.
PCI: Set PCIE maxpayload for card during hotplug insertion
PCI/ACPI: Report _OSC control mask returned on failure to get control
x86/PCI: irq and pci_ids patch for Intel Panther Point DeviceIDs
PCI: handle positive error codes
PCI: check pci_vpd_pci22_wait() return
PCI: Use ICH6_GPIO_EN in ich6_lpc_acpi_gpio
...
Fix up trivial conflicts in include/linux/pci_ids.h: commit a6e5e2be44
moved the intel SMBUS ID definitons to the i2c-i801.c driver.
This way, we can avoid checking the user space address many times when
we read the guest memory.
Although we can do the same for write if we check which slots are
writable, we do not care write now: reading the guest memory happens
more often than writing.
[avi: change VERIFY_READ to VERIFY_WRITE]
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
Function ioapic_debug() in the ioapic_deliver() misnames
one filed by reference. This patch correct it.
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Store the device saved state so that we can reload the device back
to the original state when it's unassigned. This has the benefit
that the state survives across pci_reset_function() calls via
the PCI sysfs reset interface while the VM is using the device.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
We can get memslot id from memslot->id directly
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If asynchronous hva_to_pfn() is requested call GUP with FOLL_NOWAIT to
avoid sleeping on IO. Check for hwpoison is done at the same time,
otherwise check_user_page_hwpoison() will call GUP again and will put
vcpu to sleep.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
irqfd in kvm used flush_work incorrectly: it assumed that work scheduled
previously can't run after flush_work, but since kvm uses a non-reentrant
workqueue (by means of schedule_work) we need flush_work_sync to get that
guarantee.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reported-by: Jean-Philippe Menil <jean-philippe.menil@univ-nantes.fr>
Tested-by: Jean-Philippe Menil <jean-philippe.menil@univ-nantes.fr>
Signed-off-by: Avi Kivity <avi@redhat.com>
* 'syscore' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
Introduce ARCH_NO_SYSDEV_OPS config option (v2)
cpufreq: Use syscore_ops for boot CPU suspend/resume (v2)
KVM: Use syscore_ops instead of sysdev class and sysdev
PCI / Intel IOMMU: Use syscore_ops instead of sysdev class and sysdev
timekeeping: Use syscore_ops instead of sysdev class and sysdev
x86: Use syscore_ops instead of sysdev classes and sysdevs
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
asm-generic/bitops/le.h is only intended to be included directly from
asm-generic/bitops/ext2-non-atomic.h or asm-generic/bitops/minix-le.h
which implements generic ext2 or minix bit operations.
This stops including asm-generic/bitops/le.h directly and use ext2
non-atomic bit operations instead.
It seems odd to use ext2_set_bit() on kvm, but it will replaced with
__set_bit_le() after introducing little endian bit operations for all
architectures. This indirect step is necessary to maintain bisectability
for some architectures which have their own little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KVM uses a sysdev class and a sysdev for executing kvm_suspend()
after interrupts have been turned off on the boot CPU (during system
suspend) and for executing kvm_resume() before turning on interrupts
on the boot CPU (during system resume). However, since both of these
functions ignore their arguments, the entire mechanism may be
replaced with a struct syscore_ops object which is simpler.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Avi Kivity <avi@redhat.com>
The RCU use in kvm_irqfd_deassign is tricky: we have rcu_assign_pointer
but no synchronize_rcu: synchronize_rcu is done by kvm_irq_routing_update
which we share a spinlock with.
Fix up a comment in an attempt to make this clearer.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Code under this lock requires non-preemptibility. Ensure this also over
-rt by converting it to raw spinlock.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic
slowdowns of certain workloads, we instead use yield_to to get
another VCPU in the same KVM guest to run sooner.
This seems to give a 10-15% speedup in certain workloads.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Keep track of which task is running a KVM vcpu. This helps us
figure out later what task to wake up if we want to boost a
vcpu that got preempted.
Unfortunately there are no guarantees that the same task
always keeps the same vcpu, so we can only track the task
across a single "run" of the vcpu.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
is_hwpoison_address only checks whether the page table entry is
hwpoisoned, regardless the memory page mapped. While __get_user_pages
will check both.
QEMU will clear the poisoned page table entry (via unmap/map) to make
it possible to allocate a new memory page for the virtual address
across guest rebooting. But it is also possible that the underlying
memory page is kept poisoned even after the corresponding page table
entry is cleared, that is, a new memory page can not be allocated.
__get_user_pages can catch these situations.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Now, we have 'vcpu->mode' to judge whether need to send ipi to other
cpus, this way is very exact, so checking request bit is needless,
then we can drop the spinlock let it's collateral
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently we keep track of only two states: guest mode and host
mode. This patch adds an "exiting guest mode" state that tells
us that an IPI will happen soon, so unless we need to wait for the
IPI, we can avoid it completely.
Also
1: No need atomically to read/write ->mode in vcpu's thread
2: reorganize struct kvm_vcpu to make ->mode and ->requests
in the same cache line explicitly
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Get rid of this warning:
CC arch/s390/kvm/../../../virt/kvm/kvm_main.o
arch/s390/kvm/../../../virt/kvm/kvm_main.c:596:12: warning: 'kvm_create_dirty_bitmap' defined but not used
The only caller of the function is within a !CONFIG_S390 section, so add the
same ifdef around kvm_create_dirty_bitmap() as well.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Instead, drop large mappings, which were the reason we dropped shadow.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Cleanup some code with common compound_trans_head helper.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For GRU and EPT, we need gup-fast to set referenced bit too (this is why
it's correct to return 0 when shadow_access_mask is zero, it requires
gup-fast to set the referenced bit). qemu-kvm access already sets the
young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow
paging EPT minor fault we relay on gup-fast to signal the page is in
use...
We also need to check the young bits on the secondary pagetables for NPT
and not nested shadow mmu as the data may never get accessed again by the
primary pte.
Without this closer accuracy, we'd have to remove the heuristic that
avoids collapsing hugepages in hugepage virtual regions that have not even
a single subpage in use.
->test_young is full backwards compatible with GRU and other usages that
don't have young bits in pagetables set by the hardware and that should
nuke the secondary mmu mappings when ->clear_flush_young runs just like
EPT does.
Removing the heuristic that checks the young bit in
khugepaged/collapse_huge_page completely isn't so bad either probably but
I thought it was worth it and this makes it reliable.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This should work for both hugetlbfs and transparent hugepages.
[akpm@linux-foundation.org: bring forward PageTransCompound() addition for bisectability]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since vmx blocks INIT signals, we disable virtualization extensions during
reboot. This leads to virtualization instructions faulting; we trap these
faults and spin while the reboot continues.
Unfortunately spinning on a non-preemptible kernel may block a task that
reboot depends on; this causes the reboot to hang.
Fix by skipping over the instruction and hoping for the best.
Signed-off-by: Avi Kivity <avi@redhat.com>
Quote from Avi:
| I don't think we need to flush immediately; set a "tlb dirty" bit somewhere
| that is cleareded when we flush the tlb. kvm_mmu_notifier_invalidate_page()
| can consult the bit and force a flush if set.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>