Add an MMIO handling framework to the VGIC emulation:
Each register is described by its offset, size (or number of bits per
IRQ, if applicable) and the read/write handler functions. We provide
initialization macros to describe each GIC register later easily.
Separate dispatch functions for read and write accesses are connected
to the kvm_io_bus framework and binary-search for the responsible
register handler based on the offset address within the region.
We convert the incoming data (referenced by a pointer) to the host's
endianess and use pass-by-value to hand the data over to the actual
handler functions.
The register handler prototype and the endianess conversion are
courtesy of Christoffer Dall.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Tell KVM whether a particular VCPU has an IRQ that needs handling
in the guest. This is used to decide whether a VCPU is runnable.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
As the GICv3 virtual interface registers differ from their GICv2
siblings, we need different handlers for processing maintenance
interrupts and reading/writing to the LRs.
Implement the respective handler functions and connect them to
existing code to be called if the host is using a GICv3.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Processing maintenance interrupts and accessing the list registers
are dependent on the host's GIC version.
Introduce vgic-v2.c to contain GICv2 specific functions.
Implement the GICv2 specific code for syncing the emulation state
into the VGIC registers.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Implement the framework for syncing IRQs between our emulation and
the list registers, which represent the guest's view of IRQs.
This is done in kvm_vgic_flush_hwstate and kvm_vgic_sync_hwstate,
which gets called on guest entry and exit.
The code talking to the actual GICv2/v3 hardware is added in the
following patches.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Adds the sorting function to cover the case where you have more IRQs
to consider than you have LRs. We now consider priorities.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Provide a vgic_queue_irq_unlock() function which decides whether a
given IRQ needs to be queued to a VCPU's ap_list.
This should be called whenever an IRQ becomes pending or enabled,
either as a result of userspace injection, from in-kernel emulated
devices like the architected timer or from MMIO accesses to the
distributor emulation.
Also provides the necessary functions to allow userland to inject an
IRQ to a guest.
Since this is the first code that starts using our locking mechanism, we
add some (hopefully) clear documentation of our locking strategy and
requirements along with this patch.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
The new VGIC implementation centers around a struct vgic_irq instance
per virtual IRQ.
Provide a function to retrieve the right instance for a given IRQ
number and (in case of private interrupts) the right VCPU.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
As (some) GICv3 hosts can emulate a GICv2, some GICv2 specific masks
for the list register definition also apply to GICv3 LRs.
At the moment we have those definitions in the KVM VGICv3
implementation, so let's move them into the GICv3 header file to
have them automatically defined.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Currently the PMU uses a member of the struct vgic_dist directly,
which not only breaks abstraction, but will fail with the new VGIC.
Abstract this access in the VGIC header file and refactor the validity
check in the PMU code.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
When the kernel was handling a guest MMIO read access internally, we
need to copy the emulation result into the run->mmio structure in order
for the kvm_handle_mmio_return() function to pick it up and inject the
result back into the guest.
Currently the only user of kvm_io_bus for ARM is the VGIC, which did
this copying itself, so this was not causing issues so far.
But with the upcoming new vgic implementation we need this done
properly.
Update the kvm_handle_mmio_return description and cleanup the code to
only perform a single copying when needed.
Code and commit message inspired by Andre Przywara.
Reported-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
The number of list registers is a property of the underlying system, not
of emulated VGIC CPU interface.
As we are about to move this variable to global state in the new vgic
for clarity, move it from the legacy implementation as well to make the
merge of the new code easier.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
We are about to modify the VGIC to allocate all data structures
dynamically and store mapped IRQ information on a per-IRQ struct, which
is indeed allocated dynamically at init time.
Therefore, we cannot record the mapped IRQ info from the timer at timer
reset time like it's done now, because VCPU reset happens before timer
init.
A possible later time to do this is on the first run of a per VCPU, it
just requires us to move the enable state to be a per-VCPU state and do
the lookup of the physical IRQ number when we are about to run the VCPU.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Now that the virtual arch timer does not care about the irq_phys_map
anymore, let's rework kvm_vgic_map_phys_irq() to return an error
value instead. Any reference to that mapping can later be done by
passing the correct combination of VCPU and virtual IRQ number.
This makes the irq_phys_map handling completely private to the
VGIC code.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Now that the interface between the arch timer and the VGIC does not
require passing the irq_phys_map entry pointer anymore, let's remove
it from the virtual arch timer and use the virtual IRQ number instead
directly.
The remaining pointer returned by kvm_vgic_map_phys_irq() will be
removed in the following patch.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
The communication of a Linux IRQ number from outside the VGIC to the
vgic was a leftover from the day when the vgic code cared about how a
particular device injects virtual interrupts mapped to a physical
interrupt.
We can safely remove this notion, leaving all physical IRQ handling to
be done in the device driver (the arch timer in this case), which makes
room for a saner API for the new VGIC.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
kvm_vgic_unmap_phys_irq() only needs the virtual IRQ number, so let's
just pass that between the arch timer and the VGIC to get rid of
the irq_phys_map pointer.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
For getting the active state of a mapped IRQ, we actually only need
the virtual IRQ number, not the pointer to the mapping entry.
Pass the virtual IRQ number from the arch timer to the VGIC directly.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
When we want to inject a hardware mapped IRQ into a guest, we actually
only need the virtual IRQ number from the irq_phys_map.
So let's pass this number directly from the arch timer to the VGIC
to avoid using the map as a parameter.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
We actually don't use the irq_phys_map parameter in
vgic_update_irq_pending(), so let's just remove it.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
AVIC has a use for kvm_vcpu_wake_up.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
commit 3491caf275 ("KVM: halt_polling: provide a way to qualify
wakeups during poll") added more aggressive shrinking of the
polling interval if the wakeup did not match some criteria. This
still allows to keep polling enabled if the polling time was
smaller that the current max poll time (block_ns <= vcpu->halt_poll_ns).
Performance measurement shows that even more aggressive shrinking
(shrink polling on any invalid wakeup) reduces absolute and relative
(to the workload) CPU usage even further.
Cc: David Matlack <dmatlack@google.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some wakeups should not be considered a sucessful poll. For example on
s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
would be considered runnable - letting all vCPUs poll all the time for
transactional like workload, even if one vCPU would be enough.
This can result in huge CPU usage for large guests.
This patch lets architectures provide a way to qualify wakeups if they
should be considered a good/bad wakeups in regard to polls.
For s390 the implementation will fence of halt polling for anything but
known good, single vCPU events. The s390 implementation for floating
interrupts does a wakeup for one vCPU, but the interrupt will be delivered
by whatever CPU checks first for a pending interrupt. We prefer the
woken up CPU by marking the poll of this CPU as "good" poll.
This code will also mark several other wakeup reasons like IPI or
expired timers as "good". This will of course also mark some events as
not sucessful. As KVM on z runs always as a 2nd level hypervisor,
we prefer to not poll, unless we are really sure, though.
This patch successfully limits the CPU usage for cases like uperf 1byte
transactional ping pong workload or wakeup heavy workload like OLTP
while still providing a proper speedup.
This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
wakeups that are considered not good for polling.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
Cc: David Matlack <dmatlack@google.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
[Rename config symbol. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If we don't support a mechanism for bypassing IRQs, don't register as
a consumer. This eliminates meaningless dev_info()s when the connect
fails between producer and consumer, such as on AMD systems where
kvm_x86_ops->update_pi_irte is not implemented
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A NULL token is meaningless and can only lead to unintended problems.
Error on registration with a NULL token, ignore de-registrations with
a NULL token.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The KVM_MAX_VCPUS define provides the maximum number of vCPUs per guest, and
also the upper limit for vCPU ids. This is okay for all archs except PowerPC
which can have higher ids, depending on the cpu/core/thread topology. In the
worst case (single threaded guest, host with 8 threads per core), it limits
the maximum number of vCPUS to KVM_MAX_VCPUS / 8.
This patch separates the vCPU numbering from the total number of vCPUs, with
the introduction of KVM_MAX_VCPU_ID, as the maximal valid value for vCPU ids
plus one.
The corresponding KVM_CAP_MAX_VCPU_ID allows userspace to validate vCPU ids
before passing them to KVM_CREATE_VCPU.
This patch only implements KVM_MAX_VCPU_ID with a specific value for PowerPC.
Other archs continue to return KVM_MAX_VCPUS instead.
Suggested-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, the firmware tables are parsed 2 times: once in the GIC
drivers, the other time when initializing the vGIC. It means code
duplication and make more tedious to add the support for another
firmware table (like ACPI).
Use the recently introduced helper gic_get_kvm_info() to get
information about the virtual GIC.
With this change, the virtual GIC becomes agnostic to the firmware
table and KVM will be able to initialize the vGIC on ACPI.
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
The firmware table is currently parsed by the virtual timer code in
order to retrieve the virtual timer interrupt. However, this is already
done by the arch timer driver.
To avoid code duplication, use the newly function arch_timer_get_kvm_info()
which return all the information required by the virtual timer code.
Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
On a host that runs NTP, corrections can have a direct impact on
the background timer that we program on the behalf of a vcpu.
In particular, NTP performing a forward correction will result in
a timer expiring sooner than expected from a guest point of view.
Not a big deal, we kick the vcpu anyway.
But on wake-up, the vcpu thread is going to perform a check to
find out whether or not it should block. And at that point, the
timer check is going to say "timer has not expired yet, go back
to sleep". This results in the timer event being lost forever.
There are multiple ways to handle this. One would be record that
the timer has expired and let kvm_cpu_has_pending_timer return
true in that case, but that would be fairly invasive. Another is
to check for the "short sleep" condition in the hrtimer callback,
and restart the timer for the remaining time when the condition
is detected.
This patch implements the latter, with a bit of refactoring in
order to avoid too much code duplication.
Cc: <stable@vger.kernel.org>
Reported-by: Alexander Graf <agraf@suse.de>
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
The kernel is written in C, not python, so we need braces around
multi-line if statements. GCC 6 actually warns about this, thanks to the
fantastic new "-Wmisleading-indentation" flag:
| virt/kvm/arm/pmu.c: In function ‘kvm_pmu_overflow_status’:
| virt/kvm/arm/pmu.c:198:3: warning: statement is indented as if it were guarded by... [-Wmisleading-indentation]
| reg &= vcpu_sys_reg(vcpu, PMCNTENSET_EL0);
| ^~~
| arch/arm64/kvm/../../../virt/kvm/arm/pmu.c:196:2: note: ...this ‘if’ clause, but it is not
| if ((vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_E))
| ^~
As it turns out, this particular case is harmless (we just do some &=
operations with 0), but worth fixing nonetheless.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
smp_load_acquire() is enough here and it's cheaper than smp_mb().
Adding a comment about reusing memory barrier of kvm_make_all_cpus_request()
here to keep order between modifications to the page tables and reading mode.
Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Moving the initialization earlier is needed in 4.6 because
kvm_arch_init_vm is now using mmu_lock, causing lockdep to
complain:
[ 284.440294] INFO: trying to register non-static key.
[ 284.445259] the code is fine but needs lockdep annotation.
[ 284.450736] turning off the locking correctness validator.
...
[ 284.528318] [<ffffffff810aecc3>] lock_acquire+0xd3/0x240
[ 284.533733] [<ffffffffa0305aa0>] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
[ 284.541467] [<ffffffff81715581>] _raw_spin_lock+0x41/0x80
[ 284.546960] [<ffffffffa0305aa0>] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
[ 284.554707] [<ffffffffa0305aa0>] kvm_page_track_register_notifier+0x20/0x60 [kvm]
[ 284.562281] [<ffffffffa02ece70>] kvm_mmu_init_vm+0x20/0x30 [kvm]
[ 284.568381] [<ffffffffa02dbf7a>] kvm_arch_init_vm+0x1ea/0x200 [kvm]
[ 284.574740] [<ffffffffa02bff3f>] kvm_dev_ioctl+0xbf/0x4d0 [kvm]
However, it also helps fixing a preexisting problem, which is why this
patch is also good for stable kernels: kvm_create_vm was incrementing
current->mm->mm_count but not decrementing it at the out_err label (in
case kvm_init_mmu_notifier failed). The new initialization order makes
it possible to add the required mmdrop without adding a new error label.
Cc: stable@vger.kernel.org
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull x86 protection key support from Ingo Molnar:
"This tree adds support for a new memory protection hardware feature
that is available in upcoming Intel CPUs: 'protection keys' (pkeys).
There's a background article at LWN.net:
https://lwn.net/Articles/643797/
The gist is that protection keys allow the encoding of
user-controllable permission masks in the pte. So instead of having a
fixed protection mask in the pte (which needs a system call to change
and works on a per page basis), the user can map a (handful of)
protection mask variants and can change the masks runtime relatively
cheaply, without having to change every single page in the affected
virtual memory range.
This allows the dynamic switching of the protection bits of large
amounts of virtual memory, via user-space instructions. It also
allows more precise control of MMU permission bits: for example the
executable bit is separate from the read bit (see more about that
below).
This tree adds the MM infrastructure and low level x86 glue needed for
that, plus it adds a high level API to make use of protection keys -
if a user-space application calls:
mmap(..., PROT_EXEC);
or
mprotect(ptr, sz, PROT_EXEC);
(note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
this special case, and will set a special protection key on this
memory range. It also sets the appropriate bits in the Protection
Keys User Rights (PKRU) register so that the memory becomes unreadable
and unwritable.
So using protection keys the kernel is able to implement 'true'
PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
PROT_READ as well. Unreadable executable mappings have security
advantages: they cannot be read via information leaks to figure out
ASLR details, nor can they be scanned for ROP gadgets - and they
cannot be used by exploits for data purposes either.
We know about no user-space code that relies on pure PROT_EXEC
mappings today, but binary loaders could start making use of this new
feature to map binaries and libraries in a more secure fashion.
There is other pending pkeys work that offers more high level system
call APIs to manage protection keys - but those are not part of this
pull request.
Right now there's a Kconfig that controls this feature
(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
(like most x86 CPU feature enablement code that has no runtime
overhead), but it's not user-configurable at the moment. If there's
any serious problem with this then we can make it configurable and/or
flip the default"
* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
mm/core, x86/mm/pkeys: Add execute-only protection keys support
x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
x86/mm/pkeys: Allow kernel to modify user pkey rights register
x86/fpu: Allow setting of XSAVE state
x86/mm: Factor out LDT init from context init
mm/core, x86/mm/pkeys: Add arch_validate_pkey()
mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
x86/mm/pkeys: Add Kconfig prompt to existing config option
x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
x86/mm/pkeys: Dump PKRU with other kernel registers
mm/core, x86/mm/pkeys: Differentiate instruction fetches
x86/mm/pkeys: Optimize fault handling in access_error()
mm/core: Do not enforce PKEY permissions on remote mm access
um, pkeys: Add UML arch_*_access_permitted() methods
mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
x86/mm/gup: Simplify get_user_pages() PTE bit handling
...
but lots of architecture-specific changes.
* ARM:
- VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
- PMU support for guests
- 32bit world switch rewritten in C
- various optimizations to the vgic save/restore code.
* PPC:
- enabled KVM-VFIO integration ("VFIO device")
- optimizations to speed up IPIs between vcpus
- in-kernel handling of IOMMU hypercalls
- support for dynamic DMA windows (DDW).
* s390:
- provide the floating point registers via sync regs;
- separated instruction vs. data accesses
- dirty log improvements for huge guests
- bugfixes and documentation improvements.
* x86:
- Hyper-V VMBus hypercall userspace exit
- alternative implementation of lowest-priority interrupts using vector
hashing (for better VT-d posted interrupt support)
- fixed guest debugging with nested virtualizations
- improved interrupt tracking in the in-kernel IOAPIC
- generic infrastructure for tracking writes to guest memory---currently
its only use is to speedup the legacy shadow paging (pre-EPT) case, but
in the future it will be used for virtual GPUs as well
- much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJW5r3BAAoJEL/70l94x66D2pMH/jTSWWwdTUJMctrDjPVzKzG0
yOzHW5vSLFoFlwEOY2VpslnXzn5TUVmCAfrdmFNmQcSw6hGb3K/xA/ZX/KLwWhyb
oZpr123ycahga+3q/ht/dFUBCCyWeIVMdsLSFwpobEBzPL0pMgc9joLgdUC6UpWX
tmN0LoCAeS7spC4TTiTTpw3gZ/L+aB0B6CXhOMjldb9q/2CsgaGyoVvKA199nk9o
Ngu7ImDt7l/x1VJX4/6E/17VHuwqAdUrrnbqerB/2oJ5ixsZsHMGzxQ3sHCmvyJx
WG5L00ubB1oAJAs9fBg58Y/MdiWX99XqFhdEfxq4foZEiQuCyxygVvq3JwZTxII=
=OUZZ
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"One of the largest releases for KVM... Hardly any generic
changes, but lots of architecture-specific updates.
ARM:
- VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
- PMU support for guests
- 32bit world switch rewritten in C
- various optimizations to the vgic save/restore code.
PPC:
- enabled KVM-VFIO integration ("VFIO device")
- optimizations to speed up IPIs between vcpus
- in-kernel handling of IOMMU hypercalls
- support for dynamic DMA windows (DDW).
s390:
- provide the floating point registers via sync regs;
- separated instruction vs. data accesses
- dirty log improvements for huge guests
- bugfixes and documentation improvements.
x86:
- Hyper-V VMBus hypercall userspace exit
- alternative implementation of lowest-priority interrupts using
vector hashing (for better VT-d posted interrupt support)
- fixed guest debugging with nested virtualizations
- improved interrupt tracking in the in-kernel IOAPIC
- generic infrastructure for tracking writes to guest
memory - currently its only use is to speedup the legacy shadow
paging (pre-EPT) case, but in the future it will be used for
virtual GPUs as well
- much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (217 commits)
KVM: x86: remove eager_fpu field of struct kvm_vcpu_arch
KVM: x86: disable MPX if host did not enable MPX XSAVE features
arm64: KVM: vgic-v3: Only wipe LRs on vcpu exit
arm64: KVM: vgic-v3: Reset LRs at boot time
arm64: KVM: vgic-v3: Do not save an LR known to be empty
arm64: KVM: vgic-v3: Save maintenance interrupt state only if required
arm64: KVM: vgic-v3: Avoid accessing ICH registers
KVM: arm/arm64: vgic-v2: Make GICD_SGIR quicker to hit
KVM: arm/arm64: vgic-v2: Only wipe LRs on vcpu exit
KVM: arm/arm64: vgic-v2: Reset LRs at boot time
KVM: arm/arm64: vgic-v2: Do not save an LR known to be empty
KVM: arm/arm64: vgic-v2: Move GICH_ELRSR saving to its own function
KVM: arm/arm64: vgic-v2: Save maintenance interrupt state only if required
KVM: arm/arm64: vgic-v2: Avoid accessing GICH registers
KVM: s390: allocate only one DMA page per VM
KVM: s390: enable STFLE interpretation only if enabled for the guest
KVM: s390: wake up when the VCPU cpu timer expires
KVM: s390: step the VCPU timer while in enabled wait
KVM: s390: protect VCPU cpu timer with a seqcount
KVM: s390: step VCPU cpu timer during kvm_run ioctl
...
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle are:
- Make schedstats a runtime tunable (disabled by default) and
optimize it via static keys.
As most distributions enable CONFIG_SCHEDSTATS=y due to its
instrumentation value, this is a nice performance enhancement.
(Mel Gorman)
- Implement 'simple waitqueues' (swait): these are just pure
waitqueues without any of the more complex features of full-blown
waitqueues (callbacks, wake flags, wake keys, etc.). Simple
waitqueues have less memory overhead and are faster.
Use simple waitqueues in the RCU code (in 4 different places) and
for handling KVM vCPU wakeups.
(Peter Zijlstra, Daniel Wagner, Thomas Gleixner, Paul Gortmaker,
Marcelo Tosatti)
- sched/numa enhancements (Rik van Riel)
- NOHZ performance enhancements (Rik van Riel)
- Various sched/deadline enhancements (Steven Rostedt)
- Various fixes (Peter Zijlstra)
- ... and a number of other fixes, cleanups and smaller enhancements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
sched/cputime: Fix steal_account_process_tick() to always return jiffies
sched/deadline: Remove dl_new from struct sched_dl_entity
Revert "kbuild: Add option to turn incompatible pointer check into error"
sched/deadline: Remove superfluous call to switched_to_dl()
sched/debug: Fix preempt_disable_ip recording for preempt_disable()
sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
time, acct: Drop irq save & restore from __acct_update_integrals()
acct, time: Change indentation in __acct_update_integrals()
sched, time: Remove non-power-of-two divides from __acct_update_integrals()
sched/rt: Kick RT bandwidth timer immediately on start up
sched/debug: Add deadline scheduler bandwidth ratio to /proc/sched_debug
sched/debug: Move sched_domain_sysctl to debug.c
sched/debug: Move the /sys/kernel/debug/sched_features file setup into debug.c
sched/rt: Fix PI handling vs. sched_setscheduler()
sched/core: Remove duplicated sched_group_set_shares() prototype
sched/fair: Consolidate nohz CPU load update code
sched/fair: Avoid using decay_load_missed() with a negative value
sched/deadline: Always calculate end of period on sched_yield()
sched/cgroup: Fix cgroup entity load tracking tear-down
rcu: Use simple wait queues where possible in rcutree
...
When growing halt-polling, there is no check that the poll time exceeds
the limit. It's possible for vcpu->halt_poll_ns grow once past
halt_poll_ns, and stay there until a halt which takes longer than
vcpu->halt_poll_ns. For example, booting a Linux guest with
halt_poll_ns=11000:
... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 0 (shrink 10000)
... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (grow 0)
... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (grow 10000)
Signed-off-by: David Matlack <dmatlack@google.com>
Fixes: aca6ff29c4
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
- PMU support for guests
- 32bit world switch rewritten in C
- Various optimizations to the vgic save/restore code
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJW36xjAAoJECPQ0LrRPXpDGQkQAMDppzcTOixT3e8VPdHAX09a
Z5PO0gyTMVV7Jyz5Ul3pedPJA2GSK9mxOCwqvIFbdxLAR6ZB00juO5FrTHkSdI91
1XLPj4bKoMWcVvhL/g5A4Glp/pVMW1k/9Yq8zZAtYlsLRlqG5rLOutSadcqHcYaJ
cTD/pFf7b2oPtkTPyoFml75KgHBT/8uvAvFDOWA66Id2z6T11+PsBT/6XnGDiwKg
tpGTNzx3kPIKIzOAOHqVW6UBxFOeabebXLT8wUz3VwNn/UbG6gkumMNApMAyF2q1
zU0nAh8+7Ek6Dr4OFWE6BfW6sgg/l7i1lA8XoAmqG7ZTrSptCc59fvaZJxPruG+Q
dMsU6QgR77JJjbZTinf9a1jReZ/liZrx2gZXedVKdILrjmDSq0UnGcxjUOEDZOGy
2/dbrlJhv+LhpcJtuPpxPCfoqbW5L0ynzmuYuXRdRz3lTHiOWIRx5gugrhO+wH4D
4gvZhbw3XCiYfpYHYhl8A1EH5kanKgdXDocz9yIm7mZm89gngufF/HkeXS3ZU25T
yThyBGulGjqN4FCdgf1HolkTfFjnfSx4qJovJ58eHga+HNLXRkTecZZcbFy2OOHv
8Bx0PIlwj4RgSaRLWQUudAhdhKS2g22DKDDljxFwhkMPNghvqkYMJCRDKLu6GBXQ
4YsLKM+TaShHFjSpx+ao
=rpvb
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/ARM updates for 4.6
- VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
- PMU support for guests
- 32bit world switch rewritten in C
- Various optimizations to the vgic save/restore code
Conflicts:
include/uapi/linux/kvm.h
In order to let the GICv3 code be more lazy in the way it
accesses the LRs, it is necessary to start with a clean slate.
Let's reset the LRs on each CPU when the vgic is probed (which
includes a round trip to EL2...).
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Just like on GICv2, we're a bit hammer-happy with GICv3, and access
them more often than we should.
Adopt a policy similar to what we do for GICv2, only save/restoring
the minimal set of registers. As we don't access the registers
linearly anymore (we may skip some), the convoluted accessors become
slightly simpler, and we can drop the ugly indexing macro that
tended to confuse the reviewers.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
The GICD_SGIR register lives a long way from the beginning of
the handler array, which is searched linearly. As this is hit
pretty often, let's move it up. This saves us some precious
cycles when the guest is generating IPIs.
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
So far, we're always writing all possible LRs, setting the empty
ones with a zero value. This is obvious doing a lot of work for
nothing, and we're better off clearing those we've actually
dirtied on the exit path (it is very rare to inject more than one
interrupt at a time anyway).
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
In order to let make the GICv2 code more lazy in the way it
accesses the LRs, it is necessary to start with a clean slate.
Let's reset the LRs on each CPU when the vgic is probed.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
On exit, any empty LR will be signaled in GICH_ELRSR*. Which
means that we do not have to save it, and we can just clear
its state in the in-memory copy.
Take this opportunity to move the LR saving code into its
own function.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
In order to make the saving path slightly more readable and
prepare for some more optimizations, let's move the GICH_ELRSR
saving to its own function.
No functional change.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Next on our list of useless accesses is the maintenance interrupt
status registers (GICH_MISR, GICH_EISR{0,1}).
It is pointless to save them if we haven't asked for a maintenance
interrupt the first place, which can only happen for two reasons:
- Underflow: GICH_HCR_UIE will be set,
- EOI: GICH_LR_EOI will be set.
These conditions can be checked on the in-memory copies of the regs.
Should any of these two condition be valid, we must read GICH_MISR.
We can then check for GICH_MISR_EOI, and only when set read
GICH_EISR*.
This means that in most case, we don't have to save them at all.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
GICv2 registers are *slow*. As in "terrifyingly slow". Which is bad.
But we're equaly bad, as we make a point in accessing them even if
we don't have any interrupt in flight.
A good solution is to first find out if we have anything useful to
write into the GIC, and if we don't, to simply not do it. This
involves tracking which LRs actually have something valid there.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
For the kvm_is_error_hva, ubsan complains if the uninitialized writable
is passed to __direct_map, even though the value itself is not used
(__direct_map goes to mmu_set_spte->set_spte->set_mmio_spte but never
looks at that argument).
Ensuring that __gfn_to_pfn_memslot initializes *writable is cheap and
avoids this kind of issue.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Programming the active state in the (re)distributor can be an
expensive operation so it makes some sense to try and reduce
the number of accesses as much as possible. So far, we
program the active state on each VM entry, but there is some
opportunity to do less.
An obvious solution is to cache the active state in memory,
and only program it in the HW when conditions change. But
because the HW can also change things under our feet (the active
state can transition from 1 to 0 when the guest does an EOI),
some precautions have to be taken, which amount to only caching
an "inactive" state, and always programing it otherwise.
With this in place, we observe a reduction of around 700 cycles
on a 2GHz GICv2 platform for a NULL hypercall.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
To configure the virtual PMUv3 overflow interrupt number, we use the
vcpu kvm_device ioctl, encapsulating the KVM_ARM_VCPU_PMU_V3_IRQ
attribute within the KVM_ARM_VCPU_PMU_V3_CTRL group.
After configuring the PMUv3, call the vcpu ioctl with attribute
KVM_ARM_VCPU_PMU_V3_INIT to initialize the PMUv3.
Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Acked-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
To support guest PMUv3, use one bit of the VCPU INIT feature array.
Initialize the PMU when initialzing the vcpu with that bit and PMU
overflow interrupt set.
Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Acked-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
When KVM frees VCPU, it needs to free the perf_event of PMU.
Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
When resetting vcpu, it needs to reset the PMU state to initial status.
Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
When calling perf_event_create_kernel_counter to create perf_event,
assign a overflow handler. Then when the perf event overflows, set the
corresponding bit of guest PMOVSSET register. If this counter is enabled
and its interrupt is enabled as well, kick the vcpu to sync the
interrupt.
On VM entry, if there is counter overflowed and interrupt level is
changed, inject the interrupt with corresponding level. On VM exit, sync
the interrupt level as well if it has been changed.
Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
According to ARMv8 spec, when writing 1 to PMCR.E, all counters are
enabled by PMCNTENSET, while writing 0 to PMCR.E, all counters are
disabled. When writing 1 to PMCR.P, reset all event counters, not
including PMCCNTR, to zero. When writing 1 to PMCR.C, reset PMCCNTR to
zero.
Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Add access handler which emulates writing and reading PMSWINC
register and add support for creating software increment event.
Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Since the reset value of PMOVSSET and PMOVSCLR is UNKNOWN, use
reset_unknown for its reset handler. Add a handler to emulate writing
PMOVSSET or PMOVSCLR register.
When writing non-zero value to PMOVSSET, the counter and its interrupt
is enabled, kick this vcpu to sync PMU interrupt.
Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
When we use tools like perf on host, perf passes the event type and the
id of this event type category to kernel, then kernel will map them to
hardware event number and write this number to PMU PMEVTYPER<n>_EL0
register. When getting the event number in KVM, directly use raw event
type to create a perf_event for it.
Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Since the reset value of PMCNTENSET and PMCNTENCLR is UNKNOWN, use
reset_unknown for its reset handler. Add a handler to emulate writing
PMCNTENSET or PMCNTENCLR register.
When writing to PMCNTENSET, call perf_event_enable to enable the perf
event. When writing to PMCNTENCLR, call perf_event_disable to disable
the perf event.
Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
These kind of registers include PMEVCNTRn, PMCCNTR and PMXEVCNTR which
is mapped to PMEVCNTRn.
The access handler translates all aarch32 register offsets to aarch64
ones and uses vcpu_sys_reg() to access their values to avoid taking care
of big endian.
When reading these registers, return the sum of register value and the
value perf event counts.
Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
We already have virt/kvm/arm/ containing timer and vgic stuff.
Add yet another subdirectory to contain the hyp-specific files
(timer and vgic again).
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
The problem:
On -rt, an emulated LAPIC timer instances has the following path:
1) hard interrupt
2) ksoftirqd is scheduled
3) ksoftirqd wakes up vcpu thread
4) vcpu thread is scheduled
This extra context switch introduces unnecessary latency in the
LAPIC path for a KVM guest.
The solution:
Allow waking up vcpu thread from hardirq context,
thus avoiding the need for ksoftirqd to be scheduled.
Normal waitqueues make use of spinlocks, which on -RT
are sleepable locks. Therefore, waking up a waitqueue
waiter involves locking a sleeping lock, which
is not allowed from hard interrupt context.
cyclictest command line:
This patch reduces the average latency in my tests from 14us to 11us.
Daniel writes:
Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency
benchmark on mainline. The test was run 1000 times on
tip/sched/core 4.4.0-rc8-01134-g0905f04:
./x86-run x86/tscdeadline_latency.flat -cpu host
with idle=poll.
The test seems not to deliver really stable numbers though most of
them are smaller. Paolo write:
"Anything above ~10000 cycles means that the host went to C1 or
lower---the number means more or less nothing in that case.
The mean shows an improvement indeed."
Before:
min max mean std
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 5162.596000 2019270.084000 5824.491541 20681.645558
std 75.431231 622607.723969 89.575700 6492.272062
min 4466.000000 23928.000000 5537.926500 585.864966
25% 5163.000000 1613252.750000 5790.132275 16683.745433
50% 5175.000000 2281919.000000 5834.654000 23151.990026
75% 5190.000000 2382865.750000 5861.412950 24148.206168
max 5228.000000 4175158.000000 6254.827300 46481.048691
After
min max mean std
count 1000.000000 1000.00000 1000.000000 1000.000000
mean 5143.511000 2076886.10300 5813.312474 21207.357565
std 77.668322 610413.09583 86.541500 6331.915127
min 4427.000000 25103.00000 5529.756600 559.187707
25% 5148.000000 1691272.75000 5784.889825 17473.518244
50% 5160.000000 2308328.50000 5832.025000 23464.837068
75% 5172.000000 2393037.75000 5853.177675 24223.969976
max 5222.000000 3922458.00000 6186.720500 42520.379830
[Patch was originaly based on the swait implementation found in the -rt
tree. Daniel ported it to mainline's version and gathered the
benchmark numbers for tscdeadline_latency test.]
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1455871601-27484-4-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
- Fix per-vcpu vgic bitmap allocation
- Do not give copy random memory on MMIO read
- Fix GICv3 APR register restore order
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWzefIAAoJECPQ0LrRPXpDHkEQAIPGVBEilV0Na9QfIcQBiSxK
IGwFSXAIa2sScfjAyDPbSME/R912XTzdfXLgvYjoUhP8WUX3g/2dRD7OcYOh33zB
MgUg6QRSIUIyXj6HzOsFnt/rOWlEchjGXzcyVzlQTRjJhIRyHnFprUJsVbPk1Wc8
NJSXlyAYc3dHmJB29NjAgWRhZGmBx9SddRPfHFYLv9DoVkFGpD+TYL6XMeyfY8Eh
PVjGipi8K8kl4DJb/pc5kOhtqoXu30JqVlgvpUAQEPSbYQSBbdmjRpd1Ol7M73b1
sX1+UQmuIk3wcij/YpD3Ep70N5pfjgGAqms1vzBvTk6PKTXKtrjj15uOYcWgx38Z
W9llAnlzOY5+1htirxiIdfy44gxChcWb5XTykxnJXKEaEQdVHx5E8Yc9Nf3TbNMr
cLJh5CX9KowOxjW/HmbXXKrL2VNyb0XaecH0VWUV/QNeVqvbY/o38VRgTU0EMuoJ
nY1QeP3DOQfpq44UHhhzY9gx3myxW4MBr/C/vcbsNi3KiHwP1BIDygenf1cq+FID
4t/qXEJ+7ScEcDeiw+dTRPodD+6BwL4SH67aGbrxYE2yU9vugdkq2EtP3i5Z0iga
cKPdzAcFoBJJF4OKcTjdk34dEzGiSVcDdNXhAmIzpHL6xqMwyYNIYUyrHivu3QwP
8Ctb1ReLpiF574/JhDMo
=/Pum
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
KVM/ARM fixes for 4.5-rc6
- Fix per-vcpu vgic bitmap allocation
- Do not give copy random memory on MMIO read
- Fix GICv3 APR register restore order
When we allocate bitmaps in vgic_vcpu_init_maps, we divide the number of
bits we need by 8 to figure out how many bytes to allocate. However,
bitmap elements are always accessed as unsigned longs, and if we didn't
happen to allocate a size such that size % sizeof(unsigned long) == 0,
bitmap accesses may go past the end of the allocation.
When using KASAN (which does byte-granular access checks), this results
in a continuous stream of BUGs whenever these bitmaps are accessed:
=============================================================================
BUG kmalloc-128 (Tainted: G B ): kasan: bad access detected
-----------------------------------------------------------------------------
INFO: Allocated in vgic_init.part.25+0x55c/0x990 age=7493 cpu=3 pid=1730
INFO: Slab 0xffffffbde6d5da40 objects=16 used=15 fp=0xffffffc935769700 flags=0x4000000000000080
INFO: Object 0xffffffc935769500 @offset=1280 fp=0x (null)
Bytes b4 ffffffc9357694f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Object ffffffc935769500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Object ffffffc935769510: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Object ffffffc935769520: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Object ffffffc935769530: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Object ffffffc935769540: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Object ffffffc935769550: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Object ffffffc935769560: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Object ffffffc935769570: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Padding ffffffc9357695b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Padding ffffffc9357695c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Padding ffffffc9357695d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Padding ffffffc9357695e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Padding ffffffc9357695f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
CPU: 3 PID: 1740 Comm: kvm-vcpu-0 Tainted: G B 4.4.0+ #17
Hardware name: ARM Juno development board (r1) (DT)
Call trace:
[<ffffffc00008e770>] dump_backtrace+0x0/0x280
[<ffffffc00008ea04>] show_stack+0x14/0x20
[<ffffffc000726360>] dump_stack+0x100/0x188
[<ffffffc00030d324>] print_trailer+0xfc/0x168
[<ffffffc000312294>] object_err+0x3c/0x50
[<ffffffc0003140fc>] kasan_report_error+0x244/0x558
[<ffffffc000314548>] __asan_report_load8_noabort+0x48/0x50
[<ffffffc000745688>] __bitmap_or+0xc0/0xc8
[<ffffffc0000d9e44>] kvm_vgic_flush_hwstate+0x1bc/0x650
[<ffffffc0000c514c>] kvm_arch_vcpu_ioctl_run+0x2ec/0xa60
[<ffffffc0000b9a6c>] kvm_vcpu_ioctl+0x474/0xa68
[<ffffffc00036b7b0>] do_vfs_ioctl+0x5b8/0xcb0
[<ffffffc00036bf34>] SyS_ioctl+0x8c/0xa0
[<ffffffc000086cb0>] el0_svc_naked+0x24/0x28
Memory state around the buggy address:
ffffffc935769400: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffffffc935769480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffffffc935769500: 04 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffffffc935769580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffffffc935769600: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Fix the issue by always allocating a multiple of sizeof(unsigned long),
as we do elsewhere in the vgic code.
Fixes: c1bfb577a ("arm/arm64: KVM: vgic: switch to dynamic allocation")
Cc: stable@vger.kernel.org
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
To make the intention clearer, use list_first_entry instead of
list_entry.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Right now halt_poll_ns can be change during runtime. The
grow and shrink factors can only be set during module load.
Lets fix several aspects of grow shrink:
- make grow/shrink changeable by root
- make all variables unsigned int
- read the variables once to prevent races
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We will soon modify the vanilla get_user_pages() so it can no
longer be used on mm/tasks other than 'current/current->mm',
which is by far the most common way it is called. For now,
we allow the old-style calls, but warn when they are used.
(implemented in previous patch)
This patch switches all callers of:
get_user_pages()
get_user_pages_unlocked()
get_user_pages_locked()
to stop passing tsk/mm so they will no longer see the warnings.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: jack@suse.cz
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For protection keys, we need to understand whether protections
should be enforced in software or not. In general, we enforce
protections when working on our own task, but not when on others.
We call these "current" and "remote" operations.
This patch introduces a new get_user_pages() variant:
get_user_pages_remote()
Which is a replacement for when get_user_pages() is called on
non-current tsk/mm.
We also introduce a new gup flag: FOLL_REMOTE which can be used
for the "__" gup variants to get this new behavior.
The uprobes is_trap_at_addr() location holds mmap_sem and
calls get_user_pages(current->mm) on an instruction address. This
makes it a pretty unique gup caller. Being an instruction access
and also really originating from the kernel (vs. the app), I opted
to consider this a 'remote' access where protection keys will not
be enforced.
Without protection keys, this patch should not change any behavior.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: jack@suse.cz
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 4b4b4512da ("arm/arm64: KVM: Rework the arch timer to use
level-triggered semantics") brought the virtual architected timer
closer to the VGIC. There is one occasion were we don't properly
check for the VGIC actually having been initialized before, but
instead go on to check the active state of some IRQ number.
If userland hasn't instantiated a virtual GIC, we end up with a
kernel NULL pointer dereference:
=========
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = ffffffc9745c5000
[00000000] *pgd=00000009f631e003, *pud=00000009f631e003, *pmd=0000000000000000
Internal error: Oops: 96000006 [#2] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 2144 Comm: kvm_simplest-ar Tainted: G D 4.5.0-rc2+ #1300
Hardware name: ARM Juno development board (r1) (DT)
task: ffffffc976da8000 ti: ffffffc976e28000 task.ti: ffffffc976e28000
PC is at vgic_bitmap_get_irq_val+0x78/0x90
LR is at kvm_vgic_map_is_active+0xac/0xc8
pc : [<ffffffc0000b7e28>] lr : [<ffffffc0000b972c>] pstate: 20000145
....
=========
Fix this by bailing out early of kvm_timer_flush_hwstate() if we don't
have a VGIC at all.
Reported-by: Cosmin Gorgovan <cosmin@linux-geek.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: <stable@vger.kernel.org> # 4.4.x
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
support of 248 VCPUs.
* ARM: rewrite of the arm64 world switch in C, support for
16-bit VM identifiers. Performance counter virtualization
missed the boat.
* x86: Support for more Hyper-V features (synthetic interrupt
controller), MMU cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWlSKwAAoJEL/70l94x66DY0UIAK5vp4zfQoQOJC4KP4Xgxwdu
kpnK2Boz3/74o1b0y5+eJZoUZCsXCVLtmP5uhmMxUYWDgByFG2X8ZDhPFwB5FYLT
2dN+Lr4tsolgIfRdHZtrT6Svp9SDL039bWTdscnbR6l37/j9FRWvpKdhI3orloFD
/i4CSW2dVIq1/9Xctwu/rtcOEesEx4Cad+6YV3/530eVAXFzE908nXfmqJNZTocY
YCGcmrMVCOu0ng5QM4xSzmmYjKMLUcRs+QzZWkVBzdJtTgwZUr09yj7I2dZ1yj/i
cxYrJy6shSwE74XkXsmvG+au3C5u3vX4tnXjBFErnPJ99oqzHatVnFWNRhj4dLQ=
=PIj1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"PPC changes will come next week.
- s390: Support for runtime instrumentation within guests, support of
248 VCPUs.
- ARM: rewrite of the arm64 world switch in C, support for 16-bit VM
identifiers. Performance counter virtualization missed the boat.
- x86: Support for more Hyper-V features (synthetic interrupt
controller), MMU cleanups"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (115 commits)
kvm: x86: Fix vmwrite to SECONDARY_VM_EXEC_CONTROL
kvm/x86: Hyper-V SynIC timers tracepoints
kvm/x86: Hyper-V SynIC tracepoints
kvm/x86: Update SynIC timers on guest entry only
kvm/x86: Skip SynIC vector check for QEMU side
kvm/x86: Hyper-V fix SynIC timer disabling condition
kvm/x86: Reorg stimer_expiration() to better control timer restart
kvm/x86: Hyper-V unify stimer_start() and stimer_restart()
kvm/x86: Drop stimer_stop() function
kvm/x86: Hyper-V timers fix incorrect logical operation
KVM: move architecture-dependent requests to arch/
KVM: renumber vcpu->request bits
KVM: document which architecture uses each request bit
KVM: Remove unused KVM_REQ_KICK to save a bit in vcpu->requests
kvm: x86: Check kvm_write_guest return value in kvm_write_wall_clock
KVM: s390: implement the RI support of guest
kvm/s390: drop unpaired smp_mb
kvm: x86: fix comment about {mmu,nested_mmu}.gva_to_gpa
KVM: x86: MMU: Use clear_page() instead of init_shadow_page_table()
arm/arm64: KVM: Detect vGIC presence at runtime
...
Since the numbers now overlap, it makes sense to enumerate
them in asm/kvm_host.h rather than linux/kvm_host.h. Functions
that refer to architecture-specific requests are also moved
to arch/.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Complete rewrite of the arm64 world switch in C, hopefully
paving the way for more sharing with the 32bit code, better
maintainability and easier integration of new features.
Also smaller and slightly faster in some cases...
- Support for 16bit VM identifiers
- Various cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWe80IAAoJECPQ0LrRPXpDdt8P/ittxzklIT7rsZxdOiIY6vLQ
i2hWGo1KdZR+8rsNyQEeGyg2Ocdja0Vld9ciBKgXKeKtc6x6AHfq0x6eyGRbF2jJ
Wdkd2a9lLJVJJIf1LBhOQuwjNiEvAgvqO5nwXL77s/rqx3Ur5OlyohSvRFBy7Pqp
8cdnCV/43I/y7k0iGhitFVrEC9TL0cfeJmM7YhXkt8IcpkcpCDfgdI7wgIb0ntvv
dqvrRmfp+Q3/hJ6SVRsy6uzOrjFjRE8hIIG6TiqfRd/FgI5x2xvGkSAE3Wx2YdRM
myPDiAuY2wOyALZpn9zD7qFMOfI2wX8kaX5S/ctnbvLelkmQblI39/zfYuxJ36xC
Mo2yMKcvT37AIMz2fxx3mGnIR7NZBNXVQGJmv/1p9vMQ8RbUXUhT0w6hP5SH9S7m
RDoOkfd37wQugQ7bgI7cqg4hodMRlybGPq8QaKp80y0Ej3cPblM+y0fbR153SSbS
6nCwYceFLdWJEV+tTFpKD5cvxOGeYfoC/8LcVRYRcWg6nlr4+qo61MHevyCe/Qxw
Pw+z9wFpoVKumRT72KmzFUxFQqRnTshE3KJqNJdsqMPM8ZuW5TJ/MtUa+JdAWmSH
dEAqd7Hy5W1CqgGEk+los+QlKw+5uFepcZ7SOuQ2ehME/Ae3xI8s9+UE6yqeHx6Y
PV2s5lyJRCPfm3qOu7TS
=wuir
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-next
KVM/ARM changes for Linux v4.5
- Complete rewrite of the arm64 world switch in C, hopefully
paving the way for more sharing with the 32bit code, better
maintainability and easier integration of new features.
Also smaller and slightly faster in some cases...
- Support for 16bit VM identifiers
- Various cleanups
Having the system register numbers as #defines has been a pain
since day one, as the ordering is pretty fragile, and moving
things around leads to renumbering and epic conflict resolutions.
Now that we're mostly acessing the sysreg file in C, an enum is
a much better type to use, and we can clean things up a bit.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
We store GICv3 LRs in reverse order so that the CPU can save/restore
them in rever order as well (don't ask why, the design is crazy),
and yet generate memory traffic that doesn't completely suck.
We need this macro to be available to the C version of save/restore.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
vgic_io_ops is only referenced within vgic.c, so it can be declared
static.
Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
External inputs to the vgic from time to time need to poke into the
state of a virtual interrupt, the prime example is the architected timer
code.
Since the IRQ's active state can be represented in two places; the LR or
the distributor, we first loop over the LRs but if not active in the LRs
we just return if *any* IRQ is active on the VCPU in question.
This is of course bogus, as we should check if the specific IRQ in
quesiton is active on the distributor instead.
Reported-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
KVM creates debugfs files to export VM statistics to userland. To be
able to remove them on kvm exit it tracks the files' dentries.
Since their parent directory is also tracked and since each parent
direntry knows its children we can easily remove them by using
debugfs_remove_recursive(kvm_debugfs_dir). Therefore we don't
need the extra tracking in the kvm_stats_debugfs_item anymore.
Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-By: Sascha Silbe <silbe@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Let's reuse the new common function for VPCU lookup by id.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
[split out the new function into a separate patch]
This patch makes kvm_is_visible_gfn return bool due to this particular
function only using either one or zero as its return value.
No functional change.
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The kmem_cache_destroy() function tests whether its argument is NULL
and then returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Actually kvm_arch_irq_routing_update() should be
kvm_arch_post_irq_routing_update() as it's called at the end
of irq routing update.
This renaming frees kvm_arch_irq_routing_update function name.
kvm_arch_irq_routing_update() weak function which will be used
to update mappings for arch-specific irq routing entries
(in particular, the upcoming Hyper-V synthetic interrupts).
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We were probing the physial distributor state for the active state of a
HW virtual IRQ, because we had seen evidence that the LR state was not
cleared when the guest deactivated a virtual interrupted.
However, this issue turned out to be a software bug in the GIC, which
was solved by: 84aab5e68c2a5e1e18d81ae8308c3ce25d501b29
(KVM: arm/arm64: arch_timer: Preserve physical dist. active
state on LR.active, 2015-11-24)
Therefore, get rid of the complexities and just look at the LR.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
We were incorrectly removing the active state from the physical
distributor on the timer interrupt when the timer output level was
deasserted. We shouldn't be doing this without considering the virtual
interrupt's active state, because the architecture requires that when an
LR has the HW bit set and the pending or active bits set, then the
physical interrupt must also have the corresponding bits set.
This addresses an issue where we have been observing an inconsistency
between the LR state and the physical distributor state where the LR
state was active and the physical distributor was not active, which
shouldn't happen.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
handling.
PPC: Mostly bug fixes.
ARM: No big features, but many small fixes and prerequisites including:
- a number of fixes for the arch-timer
- introducing proper level-triggered semantics for the arch-timers
- a series of patches to synchronously halt a guest (prerequisite for
IRQ forwarding)
- some tracepoint improvements
- a tweak for the EL2 panic handlers
- some more VGIC cleanups getting rid of redundant state
x86: quite a few changes:
- support for VT-d posted interrupts (i.e. PCI devices can inject
interrupts directly into vCPUs). This introduces a new component (in
virt/lib/) that connects VFIO and KVM together. The same infrastructure
will be used for ARM interrupt forwarding as well.
- more Hyper-V features, though the main one Hyper-V synthetic interrupt
controller will have to wait for 4.5. These will let KVM expose Hyper-V
devices.
- nested virtualization now supports VPID (same as PCID but for vCPUs)
which makes it quite a bit faster
- for future hardware that supports NVDIMM, there is support for clflushopt,
clwb, pcommit
- support for "split irqchip", i.e. LAPIC in kernel + IOAPIC/PIC/PIT in
userspace, which reduces the attack surface of the hypervisor
- obligatory smattering of SMM fixes
- on the guest side, stable scheduler clock support was rewritten to not
require help from the hypervisor.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWO2IQAAoJEL/70l94x66D/K0H/3AovAgYmJQToZlimsktMk6a
f2xhdIqfU5lIQQh5uNBCfL3o9o8H9Py1ym7aEw3fmztPHHJYc91oTatt2UEKhmEw
VtZHp/dFHt3hwaIdXmjRPEXiYctraKCyrhaUYdWmUYkoKi7lW5OL5h+S7frG2U6u
p/hFKnHRZfXHr6NSgIqvYkKqtnc+C0FWY696IZMzgCksOO8jB1xrxoSN3tANW3oJ
PDV+4og0fN/Fr1capJUFEc/fejREHneANvlKrLaa8ht0qJQutoczNADUiSFLcMPG
iHljXeDsv5eyjMtUuIL8+MPzcrIt/y4rY41ZPiKggxULrXc6H+JJL/e/zThZpXc=
=iv2z
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"First batch of KVM changes for 4.4.
s390:
A bunch of fixes and optimizations for interrupt and time handling.
PPC:
Mostly bug fixes.
ARM:
No big features, but many small fixes and prerequisites including:
- a number of fixes for the arch-timer
- introducing proper level-triggered semantics for the arch-timers
- a series of patches to synchronously halt a guest (prerequisite
for IRQ forwarding)
- some tracepoint improvements
- a tweak for the EL2 panic handlers
- some more VGIC cleanups getting rid of redundant state
x86:
Quite a few changes:
- support for VT-d posted interrupts (i.e. PCI devices can inject
interrupts directly into vCPUs). This introduces a new
component (in virt/lib/) that connects VFIO and KVM together.
The same infrastructure will be used for ARM interrupt
forwarding as well.
- more Hyper-V features, though the main one Hyper-V synthetic
interrupt controller will have to wait for 4.5. These will let
KVM expose Hyper-V devices.
- nested virtualization now supports VPID (same as PCID but for
vCPUs) which makes it quite a bit faster
- for future hardware that supports NVDIMM, there is support for
clflushopt, clwb, pcommit
- support for "split irqchip", i.e. LAPIC in kernel +
IOAPIC/PIC/PIT in userspace, which reduces the attack surface of
the hypervisor
- obligatory smattering of SMM fixes
- on the guest side, stable scheduler clock support was rewritten
to not require help from the hypervisor"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (123 commits)
KVM: VMX: Fix commit which broke PML
KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0()
KVM: x86: allow RSM from 64-bit mode
KVM: VMX: fix SMEP and SMAP without EPT
KVM: x86: move kvm_set_irq_inatomic to legacy device assignment
KVM: device assignment: remove pointless #ifdefs
KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic
KVM: x86: zero apic_arb_prio on reset
drivers/hv: share Hyper-V SynIC constants with userspace
KVM: x86: handle SMBASE as physical address in RSM
KVM: x86: add read_phys to x86_emulate_ops
KVM: x86: removing unused variable
KVM: don't pointlessly leave KVM_COMPAT=y in non-KVM configs
KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr()
KVM: arm/arm64: Clean up vgic_retire_lr() and surroundings
KVM: arm/arm64: Optimize away redundant LR tracking
KVM: s390: use simple switch statement as multiplexer
KVM: s390: drop useless newline in debugging data
KVM: s390: SCA must not cross page boundaries
KVM: arm: Do not indent the arguments of DECLARE_BITMAP
...
We do not want to do too much work in atomic context, in particular
not walking all the VCPUs of the virtual machine. So we want
to distinguish the architecture-specific injection function for irqfd
from kvm_set_msi. Since it's still empty, reuse the newly added
kvm_arch_set_irq and rename it to kvm_arch_set_irq_inatomic.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Includes a number of fixes for the arch-timer, introducing proper
level-triggered semantics for the arch-timers, a series of patches to
synchronously halt a guest (prerequisite for IRQ forwarding), some tracepoint
improvements, a tweak for the EL2 panic handlers, some more VGIC cleanups
getting rid of redundant state, and finally a stylistic change that gets rid of
some ctags warnings.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWOhgAAAoJEEtpOizt6ddyKS8H/2ZHMTPjo6yChnrusNWy4Qbr
6laPDlzL+g45oMQRwNL7GnM1deRftaxvT2Wi+X84D/6Y/BD6MPds4HgtBfuWcSZ1
CyRJ0Ot/zrxenucSuJuOjq+a9gdizdAczkbB1MfYDULJH8fb6D+7RYLo3zgh4Xo4
pla3L9U6gSWe+YopBjZtZH43m3fwiwSM/v+uHOTIcXrsbR+fEgx/EFSKmA/DUCuo
P5cFO/ceUGu7nATCexu5V82TgR2hvurrsR7mqfwY8YcF6HRM+NEOoS29xWC77v5S
u/F08TKuKQLv0YTEFTyLETI/oEeuC0cHtrRQBNf4+9kXEOzKyXaae0wR/I6X2Ss=
=GMNk
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/ARM Changes for v4.4-rc1
Includes a number of fixes for the arch-timer, introducing proper
level-triggered semantics for the arch-timers, a series of patches to
synchronously halt a guest (prerequisite for IRQ forwarding), some tracepoint
improvements, a tweak for the EL2 panic handlers, some more VGIC cleanups
getting rid of redundant state, and finally a stylistic change that gets rid of
some ctags warnings.
Conflicts:
arch/x86/include/asm/kvm_host.h
Now we see that vgic_set_lr() and vgic_sync_lr_elrsr() are always used
together. Merge them into one function, saving from second vgic_ops
dereferencing every time.
Signed-off-by: Pavel Fedin <p.fedin@samsung.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
1. Remove unnecessary 'irq' argument, because irq number can be retrieved
from the LR.
2. Since cff9211eb1
("arm/arm64: KVM: Fix arch timer behavior for disabled interrupts ")
LR_STATE_PENDING is queued back by vgic_retire_lr() itself. Also, it
clears vlr.state itself. Therefore, we remove the same, now duplicated,
check with all accompanying bit manipulations from vgic_unqueue_irqs().
3. vgic_retire_lr() is always accompanied by vgic_irq_clear_queued(). Since
it already does more than just clearing the LR, move
vgic_irq_clear_queued() inside of it.
Signed-off-by: Pavel Fedin <p.fedin@samsung.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Currently we use vgic_irq_lr_map in order to track which LRs hold which
IRQs, and lr_used bitmap in order to track which LRs are used or free.
vgic_irq_lr_map is actually used only for piggy-back optimization, and
can be easily replaced by iteration over lr_used. This is good because in
future, when LPI support is introduced, number of IRQs will grow up to at
least 16384, while numbers from 1024 to 8192 are never going to be used.
This would be a huge memory waste.
In its turn, lr_used is also completely redundant since
ae705930fc ("arm/arm64: KVM: Keep elrsr/aisr
in sync with software model"), because together with lr_used we also update
elrsr. This allows to easily replace lr_used with elrsr, inverting all
conditions (because in elrsr '1' means 'free').
Signed-off-by: Pavel Fedin <p.fedin@samsung.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Pull irq updates from Thomas Gleixner:
"The irq departement delivers:
- Rework the irqdomain core infrastructure to accomodate ACPI based
systems. This is required to support ARM64 without creating
artificial device tree nodes.
- Sanitize the ACPI based ARM GIC initialization by making use of the
new firmware independent irqdomain core
- Further improvements to the generic MSI management
- Generalize the irq migration on CPU hotplug
- Improvements to the threaded interrupt infrastructure
- Allow the migration of "chained" low level interrupt handlers
- Allow optional force masking of interrupts in disable_irq[_nosysnc]
- Support for two new interrupt chips - Sigh!
- A larger set of errata fixes for ARM gicv3
- The usual pile of fixes, updates, improvements and cleanups all
over the place"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
Document that IRQ_NONE should be returned when IRQ not actually handled
PCI/MSI: Allow the MSI domain to be device-specific
PCI: Add per-device MSI domain hook
of/irq: Use the msi-map property to provide device-specific MSI domain
of/irq: Split of_msi_map_rid to reuse msi-map lookup
irqchip/gic-v3-its: Parse new version of msi-parent property
PCI/MSI: Use of_msi_get_domain instead of open-coded "msi-parent" parsing
of/irq: Use of_msi_get_domain instead of open-coded "msi-parent" parsing
of/irq: Add support code for multi-parent version of "msi-parent"
irqchip/gic-v3-its: Add handling of PCI requester id.
PCI/MSI: Add helper function pci_msi_domain_get_msi_rid().
of/irq: Add new function of_msi_map_rid()
Docs: dt: Add PCI MSI map bindings
irqchip/gic-v2m: Add support for multiple MSI frames
irqchip/gic-v3: Fix translation of LPIs after conversion to irq_fwspec
irqchip/mxs: Add Alphascale ASM9260 support
irqchip/mxs: Prepare driver for hardware with different offsets
irqchip/mxs: Panic if ioremap or domain creation fails
irqdomain: Documentation updates
irqdomain/msi: Use fwnode instead of of_node
...
The VGIC and timer code for KVM arm/arm64 doesn't have any tracepoints
or tracepoint infrastructure defined. Rewriting some of the timer code
handling showed me how much we need this, so let's add these simple
trace points once and for all and we can easily expand with additional
trace points in these files as we go along.
Cc: Wei Huang <wei@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
We mark edge-triggered interrupts with the HW bit set as queued to
prevent the VGIC code from injecting LRs with both the Active and
Pending bits set at the same time while also setting the HW bit,
because the hardware does not support this.
However, this means that we must also clear the queued flag when we sync
back a LR where the state on the physical distributor went from active
to inactive because the guest deactivated the interrupt. At this point
we must also check if the interrupt is pending on the distributor, and
tell the VGIC to queue it again if it is.
Since these actions on the sync path are extremely close to those for
level-triggered interrupts, rename process_level_irq to
process_queued_irq, allowing it to cater for both cases.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
The arch timer currently uses edge-triggered semantics in the sense that
the line is never sampled by the vgic and lowering the line from the
timer to the vgic doesn't have any effect on the pending state of
virtual interrupts in the vgic. This means that we do not support a
guest with the otherwise valid behavior of (1) disable interrupts (2)
enable the timer (3) disable the timer (4) enable interrupts. Such a
guest would validly not expect to see any interrupts on real hardware,
but will see interrupts on KVM.
This patch fixes this shortcoming through the following series of
changes.
First, we change the flow of the timer/vgic sync/flush operations. Now
the timer is always flushed/synced before the vgic, because the vgic
samples the state of the timer output. This has the implication that we
move the timer operations in to non-preempible sections, but that is
fine after the previous commit getting rid of hrtimer schedules on every
entry/exit.
Second, we change the internal behavior of the timer, letting the timer
keep track of its previous output state, and only lower/raise the line
to the vgic when the state changes. Note that in theory this could have
been accomplished more simply by signalling the vgic every time the
state *potentially* changed, but we don't want to be hitting the vgic
more often than necessary.
Third, we get rid of the use of the map->active field in the vgic and
instead simply set the interrupt as active on the physical distributor
whenever the input to the GIC is asserted and conversely clear the
physical active state when the input to the GIC is deasserted.
Fourth, and finally, we now initialize the timer PPIs (and all the other
unused PPIs for now), to be level-triggered, and modify the sync code to
sample the line state on HW sync and re-inject a new interrupt if it is
still pending at that time.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
We currently initialize the SGIs to be enabled in the VGIC code, but we
use the VGIC_NR_PPIS define for this purpose, instead of the the more
natural VGIC_NR_SGIS. Change this slightly confusing use of the
defines.
Note: This should have no functional change, as both names are defined
to the number 16.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
The GICD_ICFGR allows the bits for the SGIs and PPIs to be read only.
We currently simulate this behavior by writing a hardcoded value to the
register for the SGIs and PPIs on every write of these bits to the
register (ignoring what the guest actually wrote), and by writing the
same value as the reset value to the register.
This is a bit counter-intuitive, as the register is RO for these bits,
and we can just implement it that way, allowing us to control the value
of the bits purely in the reset code.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Currently vgic_process_maintenance() processes dealing with a completed
level-triggered interrupt directly, but we are soon going to reuse this
logic for level-triggered mapped interrupts with the HW bit set, so
move this logic into a separate static function.
Probably the most scary part of this commit is convincing yourself that
the current flow is safe compared to the old one. In the following I
try to list the changes and why they are harmless:
Move vgic_irq_clear_queued after kvm_notify_acked_irq:
Harmless because the only potential effect of clearing the queued
flag wrt. kvm_set_irq is that vgic_update_irq_pending does not set
the pending bit on the emulated CPU interface or in the
pending_on_cpu bitmask if the function is called with level=1.
However, the point of kvm_notify_acked_irq is to call kvm_set_irq
with level=0, and we set the queued flag again in
__kvm_vgic_sync_hwstate later on if the level is stil high.
Move vgic_set_lr before kvm_notify_acked_irq:
Also, harmless because the LR are cpu-local operations and
kvm_notify_acked only affects the dist
Move vgic_dist_irq_clear_soft_pend after kvm_notify_acked_irq:
Also harmless, because now we check the level state in the
clear_soft_pend function and lower the pending bits if the level is
low.
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
We currently schedule a soft timer every time we exit the guest if the
timer did not expire while running the guest. This is really not
necessary, because the only work we do in the timer work function is to
kick the vcpu.
Kicking the vcpu does two things:
(1) If the vpcu thread is on a waitqueue, make it runnable and remove it
from the waitqueue.
(2) If the vcpu is running on a different physical CPU from the one
doing the kick, it sends a reschedule IPI.
The second case cannot happen, because the soft timer is only ever
scheduled when the vcpu is not running. The first case is only relevant
when the vcpu thread is on a waitqueue, which is only the case when the
vcpu thread has called kvm_vcpu_block().
Therefore, we only need to make sure a timer is scheduled for
kvm_vcpu_block(), which we do by encapsulating all calls to
kvm_vcpu_block() with kvm_timer_{un}schedule calls.
Additionally, we only schedule a soft timer if the timer is enabled and
unmasked, since it is useless otherwise.
Note that theoretically userspace can use the SET_ONE_REG interface to
change registers that should cause the timer to fire, even if the vcpu
is blocked without a scheduled timer, but this case was not supported
before this patch and we leave it for future work for now.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Some times it is useful for architecture implementations of KVM to know
when the VCPU thread is about to block or when it comes back from
blocking (arm/arm64 needs to know this to properly implement timers, for
example).
Therefore provide a generic architecture callback function in line with
what we do elsewhere for KVM generic-arch interactions.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
We currently do a single update of the vgic state when the distributor
enable/disable control register is accessed and then bypass updating the
state for as long as the distributor remains disabled.
This is incorrect, because updating the state does not consider the
distributor enable bit, and this you can end up in a situation where an
interrupt is marked as pending on the CPU interface, but not pending on
the distributor, which is an impossible state to be in, and triggers a
warning. Consider for example the following sequence of events:
1. An interrupt is marked as pending on the distributor
- the interrupt is also forwarded to the CPU interface
2. The guest turns off the distributor (it's about to do a reboot)
- we stop updating the CPU interface state from now on
3. The guest disables the pending interrupt
- we remove the pending state from the distributor, but don't touch
the CPU interface, see point 2.
Since the distributor disable bit really means that no interrupts should
be forwarded to the CPU interface, we modify the code to keep updating
the internal VGIC state, but always set the CPU interface pending bits
to zero when the distributor is disabled.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
When a guest reboots or offlines/onlines CPUs, it is not uncommon for it
to clear the pending and active states of an interrupt through the
emulated VGIC distributor. However, since the architected timers are
defined by the architecture to be level triggered and the guest
rightfully expects them to be that, but we emulate them as
edge-triggered, we have to mimic level-triggered behavior for an
edge-triggered virtual implementation.
We currently do not signal the VGIC when the map->active field is true,
because it indicates that the guest has already been signalled of the
interrupt as required. Normally this field is set to false when the
guest deactivates the virtual interrupt through the sync path.
We also need to catch the case where the guest deactivates the interrupt
through the emulated distributor, again allowing guests to boot even if
the original virtual timer signal hit before the guest's GIC
initialization sequence is run.
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
We have an interesting issue when the guest disables the timer interrupt
on the VGIC, which happens when turning VCPUs off using PSCI, for
example.
The problem is that because the guest disables the virtual interrupt at
the VGIC level, we never inject interrupts to the guest and therefore
never mark the interrupt as active on the physical distributor. The
host also never takes the timer interrupt (we only use the timer device
to trigger a guest exit and everything else is done in software), so the
interrupt does not become active through normal means.
The result is that we keep entering the guest with a programmed timer
that will always fire as soon as we context switch the hardware timer
state and run the guest, preventing forward progress for the VCPU.
Since the active state on the physical distributor is really part of the
timer logic, it is the job of our virtual arch timer driver to manage
this state.
The timer->map->active boolean field indicates whether we have signalled
this interrupt to the vgic and if that interrupt is still pending or
active. As long as that is the case, the hardware doesn't have to
generate physical interrupts and therefore we mark the interrupt as
active on the physical distributor.
We also have to restore the pending state of an interrupt that was
queued to an LR but was retired from the LR for some reason, while
remaining pending in the LR.
Cc: Marc Zyngier <marc.zyngier@arm.com>
Reported-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
When lowering a level-triggered line from userspace, we forgot to lower
the pending bit on the emulated CPU interface and we also did not
re-compute the pending_on_cpu bitmap for the CPU affected by the change.
Update vgic_update_irq_pending() to fix the two issues above and also
raise a warning in vgic_quue_irq_to_lr if we encounter an interrupt
pending on a CPU which is neither marked active nor pending.
[ Commit text reworked completely - Christoffer ]
Signed-off-by: Pavel Fedin <p.fedin@samsung.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Allow for arch-specific interrupt types to be set. For that, add
kvm_arch_set_irq() which takes interrupt type-specific action if it
recognizes the interrupt type given, and -EWOULDBLOCK otherwise.
The default implementation always returns -EWOULDBLOCK.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Factor out kvm_notify_acked_gsi() helper to iterate over EOI listeners
and notify those matching the given gsi.
It will be reused in the upcoming Hyper-V SynIC implementation.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The loop(for) inside irqfd_update() is unnecessary
because any other value for irq_entry.type will just trigger
schedule_work(&irqfd->inject) in irqfd_wakeup.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
async_pf_execute() seems to be missing a memory barrier which might
cause the waker to not notice the waiter and miss sending a wake_up as
in the following figure.
async_pf_execute kvm_vcpu_block
------------------------------------------------------------------------
spin_lock(&vcpu->async_pf.lock);
if (waitqueue_active(&vcpu->wq))
/* The CPU might reorder the test for
the waitqueue up here, before
prior writes complete */
prepare_to_wait(&vcpu->wq, &wait,
TASK_INTERRUPTIBLE);
/*if (kvm_vcpu_check_block(vcpu) < 0) */
/*if (kvm_arch_vcpu_runnable(vcpu)) { */
...
return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
!vcpu->arch.apf.halted)
|| !list_empty_careful(&vcpu->async_pf.done)
...
return 0;
list_add_tail(&apf->link,
&vcpu->async_pf.done);
spin_unlock(&vcpu->async_pf.lock);
waited = true;
schedule();
------------------------------------------------------------------------
The attached patch adds the missing memory barrier.
I found this issue when I was looking through the linux source code
for places calling waitqueue_active() before wake_up*(), but without
preceding memory barriers, after sending a patch to fix a similar
issue in drivers/tty/n_tty.c (Details about the original issue can be
found here: https://lkml.org/lkml/2015/9/28/849).
Signed-off-by: Kosuke Tatsukawa <tatsu@ab.jp.nec.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hardware virtualisation of GICv3 is only supported by 64bit hosts for
the moment. Some VGICv3 bits are missing from the 32bit side, and this
patch allows to still be able to build 32bit hosts when CONFIG_ARM_GIC_V3
is selected.
To this end, we introduce a new option, CONFIG_KVM_ARM_VGIC_V3, that is
only enabled on the 64bit side. The selection is done unconditionally
because CONFIG_ARM_GIC_V3 is always enabled on arm64.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
This patch updates the Posted-Interrupts Descriptor when vCPU
is blocked.
pre-block:
- Add the vCPU to the blocked per-CPU list
- Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
post-block:
- Remove the vCPU from the per-CPU list
Signed-off-by: Feng Wu <feng.wu@intel.com>
[Concentrate invocation of pre/post-block hooks to vcpu_block. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds an arch specific hooks 'arch_update' in
'struct kvm_kernel_irqfd'. On Intel side, it is used to
update the IRTE when VT-d posted-interrupts is used.
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds the registration/unregistration of an
irq_bypass_consumer on irqfd assignment/deassignment.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch introduces
- kvm_arch_irq_bypass_add_producer
- kvm_arch_irq_bypass_del_producer
- kvm_arch_irq_bypass_stop
- kvm_arch_irq_bypass_start
They make possible to specialize the KVM IRQ bypass consumer in
case CONFIG_KVM_HAVE_IRQ_BYPASS is set.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
[Add weak implementations of the callbacks. - Feng]
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move _irqfd_resampler and _irqfd struct declarations in a new
public header: kvm_irqfd.h. They are respectively renamed into
kvm_kernel_irqfd_resampler and kvm_kernel_irqfd. Those datatypes
will be used by architecture specific code, in the context of
IRQ bypass manager integration.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We need to build files in virt/lib/, which are now used by
KVM and VFIO, so add virt directory to the top Makefile.
Signed-off-by: Feng Wu <feng.wu@intel.com>
Acked-by: Michal Marek <mmarek@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a physical I/O device is assigned to a virtual machine through
facilities like VFIO and KVM, the interrupt for the device generally
bounces through the host system before being injected into the VM.
However, hardware technologies exist that often allow the host to be
bypassed for some of these scenarios. Intel Posted Interrupts allow
the specified physical edge interrupts to be directly injected into a
guest when delivered to a physical processor while the vCPU is
running. ARM IRQ Forwarding allows forwarded physical interrupts to
be directly deactivated by the guest.
The IRQ bypass manager here is meant to provide the shim to connect
interrupt producers, generally the host physical device driver, with
interrupt consumers, generally the hypervisor, in order to configure
these bypass mechanism. To do this, we base the connection on a
shared, opaque token. For KVM-VFIO this is expected to be an
eventfd_ctx since this is the connection we already use to connect an
eventfd to an irqfd on the in-kernel path. When a producer and
consumer with matching tokens is found, callbacks via both registered
participants allow the bypass facilities to be automatically enabled.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
All fields of kvm_io_range were initialized or copied explicitly
afterwards. So switch to use kmalloc().
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In order to support a userspace IOAPIC interacting with an in kernel
APIC, the EOI exit bitmaps need to be configurable.
If the IOAPIC is in userspace (i.e. the irqchip has been split), the
EOI exit bitmaps will be set whenever the GSI Routes are configured.
In particular, for the low MSI routes are reservable for userspace
IOAPICs. For these MSI routes, the EOI Exit bit corresponding to the
destination vector of the route will be set for the destination VCPU.
The intention is for the userspace IOAPICs to use the reservable MSI
routes to inject interrupts into the guest.
This is a slight abuse of the notion of an MSI Route, given that MSIs
classically bypass the IOAPIC. It might be worthwhile to add an
additional route type to improve clarity.
Compile tested for Intel x86.
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We observed some performance degradation on s390x with dynamic
halt polling. Until we can provide a proper fix, let's enable
halt_poll_ns as default only for supported architectures.
Architectures are now free to set their own halt_poll_ns
default value.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Workaround for a Cortex-A57 erratum
- Bug fix for the debugging infrastructure
- Fix for 32bit guests with more than 4GB of address space
on a 32bit host
- A number of fixes for the (unusual) case when we don't use
the in-kernel GIC emulation
- Removal of ThumbEE handling on arm64, since these have been
dropped from the architecture before anyone actually ever
built a CPU
- Remove the KVM_ARM_MAX_VCPUS limitation which has become
fairly pointless
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV+slQAAoJECPQ0LrRPXpD+WUQAMLC3ZUasJX1gsVixd++zAwB
FXu0TFlKCUsLWllXZtyhGI6ya7ljuCzfhRbA/eZFmFVbwDnULt1p5ahw7eHCIZ2a
yY93TS6XN3YHwVpY7f2lDsvLhBLyWeTdWhj5TtLy6mslQyEUqxdmsiC7gl40Fp2S
8tKIxoYYRpmbgKl/Lbi8GxdHH6c0aQ2Nt7Fq4nV9dJqy5tiGdg6OxqgU/rVmkdkv
Rv1jrdtncstNRi9NBbKRRDp5DTqWboF35HJQpdIRpR8jJTLuuzzCimP5Hz9crKuO
uXchIq2GtQB60NklZtPL15zMdmfdq+JHwdC14v05kB5Ai8NThGwKYQ3JF+krO3cG
RKsAlrIq0AwPN8hAboLcKGzjLFFryaHZsa+d7elxaaDQz1FGz4uP56fIUURoGZuX
vWTsKLRKcuPCYtnV6Frg2BCTB6nq1cRgjmMC9TABnraelZ3z0lDl4wFngg4aL2u6
QYOdP8L++/S1HAPOF7VhFYndXkbM3KoVLAepev8jvzRnwg4QVrqsvfgwFSdMNcMz
ga7bJ4pUEP+Qq1i0qc41P9O708bCGm7TIw3CzTdKIZhc/l0t137lw1rhv67JfXZh
cAni4osjhpdZUT0F9lIl/6OQB3Kgk6on3cs909Y/tT1srh9s+iVO1AwpGY1j5T4j
gFRy90o2LBuepoI/8yF3
=NNtz
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-4.3-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
Second set of KVM/ARM changes for 4.3-rc2
- Workaround for a Cortex-A57 erratum
- Bug fix for the debugging infrastructure
- Fix for 32bit guests with more than 4GB of address space
on a 32bit host
- A number of fixes for the (unusual) case when we don't use
the in-kernel GIC emulation
- Removal of ThumbEE handling on arm64, since these have been
dropped from the architecture before anyone actually ever
built a CPU
- Remove the KVM_ARM_MAX_VCPUS limitation which has become
fairly pointless
This patch removes config option of KVM_ARM_MAX_VCPUS,
and like other ARCHs, just choose the maximum allowed
value from hardware, and follows the reasons:
1) from distribution view, the option has to be
defined as the max allowed value because it need to
meet all kinds of virtulization applications and
need to support most of SoCs;
2) using a bigger value doesn't introduce extra memory
consumption, and the help text in Kconfig isn't accurate
because kvm_vpu structure isn't allocated until request
of creating VCPU is sent from QEMU;
3) the main effect is that the field of vcpus[] in 'struct kvm'
becomes a bit bigger(sizeof(void *) per vcpu) and need more cache
lines to hold the structure, but 'struct kvm' is one generic struct,
and it has worked well on other ARCHs already in this way. Also,
the world switch frequecy is often low, for example, it is ~2000
when running kernel building load in VM from APM xgene KVM host,
so the effect is very small, and the difference can't be observed
in my test at all.
Cc: Dann Frazier <dann.frazier@canonical.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
This new statistic can help diagnosing VCPUs that, for any reason,
trigger bad behavior of halt_poll_ns autotuning.
For example, say halt_poll_ns = 480000, and wakeups are spaced exactly
like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
10+20+40+80+160+320+480 = 1110 microseconds out of every
479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
is consuming about 30% more CPU than it would use without
polling. This would show as an abnormally high number of
attempted polling compared to the successful polls.
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com<
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, if we had a zero length mmio eventfd assigned on
KVM_MMIO_BUS. It will never be found by kvm_io_bus_cmp() since it
always compares the kvm_io_range() with the length that guest
wrote. This will cause e.g for vhost, kick will be trapped by qemu
userspace instead of vhost. Fixing this by using zero length if an
iodevice is zero length.
Cc: stable@vger.kernel.org
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch factors out core eventfd assign/deassign logic and leaves
the argument checking and bus index selection to callers.
Cc: stable@vger.kernel.org
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We only want zero length mmio eventfd to be registered on
KVM_FAST_MMIO_BUS. So check this explicitly when arg->len is zero to
make sure this.
Cc: stable@vger.kernel.org
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
After 'commit 0b8ba4a2b6 ("KVM: fix checkpatch.pl errors in
kvm/coalesced_mmio.h")', the declaration of the two function will exceed 80
characters.
This patch reduces the TAPs to make each line in 80 characters.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Fix timer interrupt injection after the rework
that went in during the merge window
- Reset the timer to zero on reboot
- Make sure the TCR_EL2 RES1 bits are really set to 1
- Fix a PSCI affinity bug for non-existing vcpus
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV9VQYAAoJECPQ0LrRPXpDYFsQAIp+nIxv6LijQdq310EF7Z95
1j16vx9NfsAsNH0pwiRmx4gfNOHlPp6f30Kb1HJf2TN08Rc7TS8j4Dr3zYnZEdQj
9eNGTjCjlz93VQwKnTBvBvYkHsCEWC76pOxiYoFmnLsGc4m/OilHhohOzOZTAMFC
Le07VXJwrhGHQ5bjdIMmFj+5rMj4eWfT2bYg8uVl5EUNFNkDgAMtT/Nbml90friC
x+r2I9H+G8hHmemOv6hefW7l2JScLSYpqLeGdxZUtnGy9LNP3+jTD1QyqUftUPuS
HtcB4zCtRk62nMupgFNV514Kf26wcwpS53vhd8aM2kxSkpr5xkUFoL+uJNT8ssIH
Y+FwAOeTD0osWbmxw8Gf9d3qYzPBQSCRP3E4mCbNNNsTlMRU+Hu9au0VNBKggkfF
fZ4UphCVX2C+cWaQcB40950h1F76Vx3boAfcMgiXoO+8LJx9OogDtr8+j5cBBbKU
+tMngLhelsMtSCTam9c0jS6BsQfHDy3W1CQNl30cqd2RFjvygOSAqzdCMlBk0lLG
4Sq0eib1zJM+rE2OQs8SaAftyQ0kbElmM3XZB3ZiitaHgkUKV7kt9dMUG9Vfn2SE
/ycgR9eoBq5P6FKhyROmL+0g824nYJmn4fwpecw1rHoRr8jyZjvpbE3VSL0aI5XW
w4UYsAQSTcqM6bHpNYuu
=G02h
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-4.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
KVM/ARM changes for 4.3-rc2
- Fix timer interrupt injection after the rework
that went in during the merge window
- Reset the timer to zero on reboot
- Make sure the TCR_EL2 RES1 bits are really set to 1
- Fix a PSCI affinity bug for non-existing vcpus
If there is already some polling ongoing, it's impossible to disable the
polling, since as soon as somebody sets halt_poll_ns to 0, polling will
never stop, as grow and shrink are only handled if halt_poll_ns is != 0.
This patch fix it by reset vcpu->halt_poll_ns in order to stop polling
when polling is disabled.
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Merge third patch-bomb from Andrew Morton:
- even more of the rest of MM
- lib/ updates
- checkpatch updates
- small changes to a few scruffy filesystems
- kmod fixes/cleanups
- kexec updates
- a dma-mapping cleanup series from hch
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (81 commits)
dma-mapping: consolidate dma_set_mask
dma-mapping: consolidate dma_supported
dma-mapping: cosolidate dma_mapping_error
dma-mapping: consolidate dma_{alloc,free}_noncoherent
dma-mapping: consolidate dma_{alloc,free}_{attrs,coherent}
mm: use vma_is_anonymous() in create_huge_pmd() and wp_huge_pmd()
mm: make sure all file VMAs have ->vm_ops set
mm, mpx: add "vm_flags_t vm_flags" arg to do_mmap_pgoff()
mm: mark most vm_operations_struct const
namei: fix warning while make xmldocs caused by namei.c
ipc: convert invalid scenarios to use WARN_ON
zlib_deflate/deftree: remove bi_reverse()
lib/decompress_unlzma: Do a NULL check for pointer
lib/decompressors: use real out buf size for gunzip with kernel
fs/affs: make root lookup from blkdev logical size
sysctl: fix int -> unsigned long assignments in INT_MIN case
kexec: export KERNEL_IMAGE_SIZE to vmcoreinfo
kexec: align crash_notes allocation to make it be inside one physical page
kexec: remove unnecessary test in kimage_alloc_crash_control_pages()
kexec: split kexec_load syscall from kexec core code
...
In the scope of the idle memory tracking feature, which is introduced by
the following patch, we need to clear the referenced/accessed bit not only
in primary, but also in secondary ptes. The latter is required in order
to estimate wss of KVM VMs. At the same time we want to avoid flushing
tlb, because it is quite expensive and it won't really affect the final
result.
Currently, there is no function for clearing pte young bit that would meet
our requirements, so this patch introduces one. To achieve that we have
to add a new mmu-notifier callback, clear_young, since there is no method
for testing-and-clearing a secondary pte w/o flushing tlb. The new method
is not mandatory and currently only implemented by KVM.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We were taking the exit path after checking ue->flags and return value
of setup_routing_entry(), but 'e' was not freed incase of a failure.
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tracepoint for dynamic halt_pool_ns, fired on every potential change.
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is a downside of always-poll since poll is still happened for idle
vCPUs which can waste cpu usage. This patchset add the ability to adjust
halt_poll_ns dynamically, to grow halt_poll_ns when shot halt is detected,
and to shrink halt_poll_ns when long halt is detected.
There are two new kernel parameters for changing the halt_poll_ns:
halt_poll_ns_grow and halt_poll_ns_shrink.
no-poll always-poll dynamic-poll
-----------------------------------------------------------------------
Idle (nohz) vCPU %c0 0.15% 0.3% 0.2%
Idle (250HZ) vCPU %c0 1.1% 4.6%~14% 1.2%
TCP_RR latency 34us 27us 26.7us
"Idle (X) vCPU %c0" is the percent of time the physical cpu spent in
c0 over 60 seconds (each vCPU is pinned to a pCPU). (nohz) means the
guest was tickless. (250HZ) means the guest was ticking at 250HZ.
The big win is with ticking operating systems. Running the linux guest
with nohz=off (and HZ=250), we save 3.4%~12.8% CPUs/second and get close
to no-polling overhead levels by using the dynamic-poll. The savings
should be even higher for higher frequency ticks.
Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[Simplify the patch. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Change halt_poll_ns into per-VCPU variable, seeded from module parameter,
to allow greater flexibility.
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Provide a better quality of implementation and be architecture compliant
on ARMv7 for the architected timer by resetting the CNTV_CTL to 0 on
reset of the timer.
This change alone fixes the UEFI reset issue reported by Laszlo back in
February.
Cc: Laszlo Ersek <lersek@redhat.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Drew Jones <drjones@redhat.com>
Cc: Wei Huang <wei@redhat.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
We currently set the physical active state only when we *inject* a new
pending virtual interrupt, but this is actually not correct, because we
could have been preempted and run something else on the system that
resets the active state to clear. This causes us to run the VM with the
timer set to fire, but without setting the physical active state.
The solution is to always check the LR configurations, and we if have a
mapped interrupt in the LR in either the pending or active state
(virtual), then set the physical active state.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Highlights for KVM PPC this time around:
- Book3S: A few bug fixes
- Book3S: Allow micro-threading on POWER8
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJV2D6aAAoJECszeR4D/txggBMP/3nHD3UjEAFUhhA6VjfK2wNw
IW2aXQ5+2T51l1K8iSGMyKpW2w4zG5Bv9LdBP2badhaVpgM4//nVf7kcEBrdhjYq
ns7V3klzTuNY5RBbWZz3Zri0mgCkJVF1XlC3xBzGPSNKpZyrkORhlxfg5GXig8lj
pvUcku7XgkCFabAIIZmf0pg9hpDHpH3k1G9yZxuA8pys951IPRoo1CgsYmWSbmzh
jfA2CxBl10dHZOuk/ENyJveJgtthmBB4ezCbWXy+wcMzBKhMC5R93LUoiKXMLWpM
HkziNGjHA1gFSxDtfUVgkcXfan3a5JmlC+u50dLCTetXOVL7m2beIiXwv3smfjLn
AkpcChceEChxn0MxwKJjNvU+RVh3kmv8rklfPlBXHTtQ5ZSXxlcxYrmgL64stmrt
e27dzvJd9J7KX6wEpNyuZINsmFyn3lM3IoxqmSsVCRd43fzhZt9QGcYEXMIe1+lb
E7QncsYMuuWB/sfSieyPaXtmK5ym2+R220xlKezBZdzWdtisPrpCRyl7BdiqCj6O
1gROi6qEyj3m5Qw/eGbFKBF0d8oVXqo1wBJkbihMl55D+jMeZMk673aeGhno8au1
kH+Im+H5xU3oEzdqvC9y3c9kE2sRkzj43GjepIb86Y463fg6KQ5j2gbZUZolGsGH
AnRSGcbbVer/q+9kymPw
=t+9t
-----END PGP SIGNATURE-----
Merge tag 'signed-kvm-ppc-next' of git://github.com/agraf/linux-2.6 into kvm-queue
Patch queue for ppc - 2015-08-22
Highlights for KVM PPC this time around:
- Book3S: A few bug fixes
- Book3S: Allow micro-threading on POWER8
In order to remove the crude hack where we sneak the masked bit
into the timer's control register, make use of the phys_irq_map
API control the active state of the interrupt.
This causes some limited changes to allow for potential error
propagation.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Virtual interrupts mapped to a HW interrupt should only be triggered
from inside the kernel. Otherwise, you could end up confusing the
kernel (and the GIC's) state machine.
Rearrange the injection path so that kvm_vgic_inject_irq is
used for non-mapped interrupts, and kvm_vgic_inject_mapped_irq is
used for mapped interrupts. The latter should only be called from
inside the kernel (timer, irqfd).
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
In order to control the active state of an interrupt, introduce
a pair of accessors allowing the state to be set/queried.
This only affects the logical state, and the HW state will only be
applied at world-switch time.
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
To allow a HW interrupt to be injected into a guest, we lookup the
guest virtual interrupt in the irq_phys_map list, and if we have
a match, encode both interrupts in the LR.
We also mark the interrupt as "active" at the host distributor level.
On guest EOI on the virtual interrupt, the host interrupt will be
deactivated.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
In order to be able to feed physical interrupts to a guest, we need
to be able to establish the virtual-physical mapping between the two
worlds.
The mappings are kept in a set of RCU lists, indexed by virtual interrupts.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
We only set the irq_queued flag for level interrupts, meaning
that "!vgic_irq_is_queued(vcpu, irq)" is a good enough predicate
for all interrupts.
This will allow us to inject edge HW interrupts, for which the
state ACTIVE+PENDING is not allowed.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Now that struct vgic_lr supports the LR_HW bit and carries a hwirq
field, we can encode that information into the list registers.
This patch provides implementations for both GICv2 and GICv3.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
If there are no assigned devices, the guest PAT are not providing
any useful information and can be overridden to writeback; VMX
always does this because it has the "IPAT" bit in its extended
page table entries, but SVM does not have anything similar.
Hook into VFIO and legacy device assignment so that they
provide this information to KVM.
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 1cde2930e1 ("sched/preempt: Add static_key() to preempt_notifiers")
had two problems. First, the preempt-notifier API needs to sleep with the
addition of the static_key, we do however need to hold off preemption
while modifying the preempt notifier list, otherwise a preemption could
observe an inconsistent list state. KVM correctly registers and
unregisters preempt notifiers with preemption disabled, so the sleep
caused dmesg splats.
Second, KVM registers and unregisters preemption notifiers very often
(in vcpu_load/vcpu_put). With a single uniprocessor guest the static key
would move between 0 and 1 continuously, hitting the slow path on every
userspace exit.
To fix this, wrap the static_key inc/dec in a new API, and call it from
KVM.
Fixes: 1cde2930e1 ("sched/preempt: Add static_key() to preempt_notifiers")
Reported-by: Pontus Fuchs <pontus.fuchs@gmail.com>
Reported-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- CPU ops and PSCI (Power State Coordination Interface) refactoring
following the merging of the arm64 ACPI support, together with
handling of Trusted (secure) OS instances
- Using fixmap for permanent FDT mapping, removing the initial dtb
placement requirements (within 512MB from the start of the kernel
image). This required moving the FDT self reservation out of the
memreserve processing
- Idmap (1:1 mapping used for MMU on/off) handling clean-up
- Removing flush_cache_all() - not safe on ARM unless the MMU is off.
Last stages of CPU power down/up are handled by firmware already
- "Alternatives" (run-time code patching) refactoring and support for
immediate branch patching, GICv3 CPU interface access
- User faults handling clean-up
And some fixes:
- Fix for VDSO building with broken ELF toolchains
- Fixing another case of init_mm.pgd usage for user mappings (during
ASID roll-over broadcasting)
- Fix for FPSIMD reloading after CPU hotplug
- Fix for missing syscall trace exit
- Workaround for .inst asm bug
- Compat fix for switching the user tls tpidr_el0 register
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVitZgAAoJEGvWsS0AyF7x+ToP/0Yci5bNsYVwVay8N1rK6WHh
SGzDMzyxcSBjQpz2IhhTJ28eTAEH+a+HWQms+0tBehjqxqkvjuzBN0okDkc/z8NB
7Z0BV2aLkQcMwTbjgIh5jm25ZpGmvmvbWPD5oBwgmgQ4v4i1OLRKgx7+YQ+z9rWZ
zC70d0UwyWjs2oxmjd2ZrAkps7v/qozEFhcRHxLzCn8Mbw+3FcTQsqMbfnoWGnH0
YuGfHQQqBY4/HC7uAslMCy7tXeJXqb+NsgrnAovjfEbHGDjsg0KNl06K++LHwE37
A5Noa3M0AQEPYqx/sg0Ec8RNUUEMB4RA2DCaibp7XlVGncXOwFfiyk/M5uVrYXIO
ku5QF0ytUfZKzrMq3yQKbEDuCPOFTqjjdVpkeXKFdW66zYTohKVc3vUBV5xHZ5uO
7Kr8H0ZnhAv3OxPyKdEwAuQ5sJdWwQSvZyGClxMUO4eC/UzD0Mwwf1Y8WYtiAXx+
NSTeBKw/m33W3/KhNuNH1+qGEOKhuXuKX7AcYA84Rab8ytxYWcurHCG2bmhzpEse
2DZtNMybrP/HMQPyJlYgGac8B3QbsAIAkkU1f+dJTAv9otuBDhscaDQyZ9Y6WVht
/k8zJiZeMEuGAmwgTkzLmWs/8pQq42nW4J4eQdXPZAwp4ghCIypPWfaZASAwee6/
p+es3v8P4k9wkv2TFZMh
=YeGl
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"Mostly refactoring/clean-up:
- CPU ops and PSCI (Power State Coordination Interface) refactoring
following the merging of the arm64 ACPI support, together with
handling of Trusted (secure) OS instances
- Using fixmap for permanent FDT mapping, removing the initial dtb
placement requirements (within 512MB from the start of the kernel
image). This required moving the FDT self reservation out of the
memreserve processing
- Idmap (1:1 mapping used for MMU on/off) handling clean-up
- Removing flush_cache_all() - not safe on ARM unless the MMU is off.
Last stages of CPU power down/up are handled by firmware already
- "Alternatives" (run-time code patching) refactoring and support for
immediate branch patching, GICv3 CPU interface access
- User faults handling clean-up
And some fixes:
- Fix for VDSO building with broken ELF toolchains
- Fix another case of init_mm.pgd usage for user mappings (during
ASID roll-over broadcasting)
- Fix for FPSIMD reloading after CPU hotplug
- Fix for missing syscall trace exit
- Workaround for .inst asm bug
- Compat fix for switching the user tls tpidr_el0 register"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (42 commits)
arm64: use private ratelimit state along with show_unhandled_signals
arm64: show unhandled SP/PC alignment faults
arm64: vdso: work-around broken ELF toolchains in Makefile
arm64: kernel: rename __cpu_suspend to keep it aligned with arm
arm64: compat: print compat_sp instead of sp
arm64: mm: Fix freeing of the wrong memmap entries with !SPARSEMEM_VMEMMAP
arm64: entry: fix context tracking for el0_sp_pc
arm64: defconfig: enable memtest
arm64: mm: remove reference to tlb.S from comment block
arm64: Do not attempt to use init_mm in reset_context()
arm64: KVM: Switch vgic save/restore to alternative_insn
arm64: alternative: Introduce feature for GICv3 CPU interface
arm64: psci: fix !CONFIG_HOTPLUG_CPU build warning
arm64: fix bug for reloading FPSIMD state after CPU hotplug.
arm64: kernel thread don't need to save fpsimd context.
arm64: fix missing syscall trace exit
arm64: alternative: Work around .inst assembler bugs
arm64: alternative: Merge alternative-asm.h into alternative.h
arm64: alternative: Allow immediate branch as alternative instruction
arm64: Rework alternate sequence for ARM erratum 845719
...
The allocation size of the kvm_irq_routing_table depends on
the number of irq routing entries because they are all
allocated with one kzalloc call.
When the irq routing table gets bigger this requires high
order allocations which fail from time to time:
qemu-kvm: page allocation failure: order:4, mode:0xd0
This patch fixes this issue by breaking up the allocation of
the table and its entries into individual kzalloc calls.
These could all be satisfied with order-0 allocations, which
are less likely to fail.
The downside of this change is the lower performance, because
of more calls to kzalloc. But given how often kvm_set_irq_routing
is called in the lifetime of a guest, it doesn't really
matter much.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
[Avoid sparse warning through rcu_access_pointer. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Back in the days, vgic.c used to have an intimate knowledge of
the actual GICv2. These days, this has been abstracted away into
hardware-specific backends.
Remove the now useless arm-gic.h #include directive, making it
clear that GICv2 specific code doesn't belong here.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Commit fd1d0ddf2a (KVM: arm/arm64: check IRQ number on userland
injection) rightly limited the range of interrupts userspace can
inject in a guest, but failed to consider the (unlikely) case where
a guest is configured with 1024 interrupts.
In this case, interrupts ranging from 1020 to 1023 are unuseable,
as they have a special meaning for the GIC CPU interface.
Make sure that these number cannot be used as an IRQ. Also delete
a redundant (and similarily buggy) check in kvm_set_irq.
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Cc: Andre Przywara <andre.przywara@arm.com>
Cc: <stable@vger.kernel.org> # 4.1, 4.0, 3.19, 3.18
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
If a GICv3-enabled guest tries to configure Group0, we print a
warning on the console (because we don't support Group0 interrupts).
This is fairly pointless, and would allow a guest to spam the
console. Let's just drop the warning.
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
So far, we configured the world-switch by having a small array
of pointers to the save and restore functions, depending on the
GIC used on the platform.
Loading these values each time is a bit silly (they never change),
and it makes sense to rely on the instruction patching instead.
This leads to a nice cleanup of the code.
Acked-by: Will Deacon <will.deacon@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Commit 47a98b15ba ("arm/arm64: KVM: support for un-queuing active
IRQs") introduced handling of the GICD_I[SC]ACTIVER registers,
but only for the GICv2 emulation. For the sake of completeness and
as this is a pre-requisite for save/restore of the GICv3 distributor
state, we should also emulate their handling in the distributor and
redistributor frames of an emulated GICv3.
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Only two ioctls have to be modified; the address space id is
placed in the higher 16 bits of their slot id argument.
As of this patch, no architecture defines more than one
address space; x86 will be the first.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We need to hide SMRAM from guests not running in SMM. Therefore, all
uses of kvm_read_guest* and kvm_write_guest* must be changed to use
different address spaces, depending on whether the VCPU is in system
management mode. We need to introduce a new family of functions for
this purpose.
For now, the VCPU-based functions have the same behavior as the
existing per-VM ones, they just accept a different type for the
first argument. Later however they will be changed to use one of many
"struct kvm_memslots" stored in struct kvm, through an architecture hook.
VM-based functions will unconditionally use the first memslots pointer.
Whenever possible, this patch introduces slot-based functions with an
__ prefix, with two wrappers for generic and vcpu-based actions.
The exceptions are kvm_read_guest and kvm_write_guest, which are copied
into the new functions kvm_vcpu_read_guest and kvm_vcpu_write_guest.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Most of the function that wrap it can be rewritten without it, except
for gfn_to_pfn_prot. Just inline it into gfn_to_pfn_prot, and rewrite
the other function on top of gfn_to_pfn_memslot*.
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The memory slot is already available from gfn_to_memslot_dirty_bitmap.
Isn't it a shame to look it up again? Plus, it makes gfn_to_page_many_atomic
agnostic of multiple VCPU address spaces.
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This lets the function access the new memory slot without going through
kvm_memslots and id_to_memslot. It will simplify the code when more
than one address space will be supported.
Unfortunately, the "const"ness of the new argument must be casted
away in two places. Fixing KVM to accept const struct kvm_memory_slot
pointers would require modifications in pretty much all architectures,
and is left for later.
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Architecture-specific helpers are not supposed to muck with
struct kvm_userspace_memory_region contents. Add const to
enforce this.
In order to eliminate the only write in __kvm_set_memory_region,
the cleaning of deleted slots is pulled up from update_memslots
to __kvm_set_memory_region.
Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_memslots provides lockdep checking. Use it consistently instead of
explicit dereferencing of kvm->memslots.
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_alloc_memslots is extracted out of previously scattered code
that was in kvm_init_memslots_id and kvm_create_vm.
kvm_free_memslot and kvm_free_memslots are new names of
kvm_free_physmem and kvm_free_physmem_slot, but they also take
an explicit pointer to struct kvm_memslots.
This will simplify the transition to multiple address spaces,
each represented by one pointer to struct kvm_memslots.
Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
gfn_to_pfn_async is used in just one place, and because of x86-specific
treatment that place will need to look at the memory slot. Hence inline
it into try_async_pf and export __gfn_to_pfn_memslot.
The patch also switches the subsequent call to gfn_to_pfn_prot to use
__gfn_to_pfn_memslot. This is a small optimization. Finally, remove
the now-unused async argument of __gfn_to_pfn.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On cpu hotplug only KVM emits an unconditional message that its notifier
has been called. It certainly can be assumed that calling cpu hotplug
notifiers work, therefore there is no added value if KVM prints a message.
If an error happens on cpu online KVM will still emit a warning.
So let's remove this superfluous message.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Caching memslot value and using mark_page_dirty_in_slot() avoids another
O(log N) search when dirtying the page.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Message-Id: <1428695247-27603-1-git-send-email-rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rather small this time:
- a fix for a nasty bug with virtual IRQ injection
- a fix for irqfd
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVN7ecAAoJECPQ0LrRPXpDmcEP/jTcP89kPm8r7xk0YggalpQT
4An+Nx5dnumO7ZJMVZM5eqBuXT3C2tvgKdz1VpEBr9J7F31QBEANCXd9P6smEhXI
/0GuJ82tz+VC/XaVFUXsmtSW0Kkt7dIEJ29OCHdSjchxsM3C4X175IMhtVvKDGOc
qVvHJllS4WBddNRTrWuk5su62QmEra6TgQMdXPabtcVcUYgQ3DLHnp/rvlYFsCfD
MvZKCrlqAHNftzxW4fcimPdaToUWVNmKADm0ejujIxDRfNBH/6w7vmcFkNICAyqa
Q/UgVKpVxa/baRr6jKyXh32S0ewQSki2OViIijs45tHXlSQMgUnOQ9ZMKkgR215q
J5QsQFDJ5QjVth42CX3jAN6+iZFO1QX7xMBCutjg/If8rL+Cv5zKEGPzv4mOjt2M
slSvgXo0Rv2c/o+7XOz15rCIW8Io4+MA2y1M6htZ8UCVAmQCGwkvlIo/9ZU2b8Hl
NGnA+C8thXtrsMxjLfNWFQBGlBHKWuptTpB0qypws0C/KSXG3LM/R41q+WXqmJKy
UZJWgC75ITeRle9PAHSoIzdDiH+hNhhUUyRRruLXoses98lIgUyuTvHNFhPHjIfC
V/S6Ai8lBAMk/jBw/iMH1G+ubw2i2ZzFVw132NQrpucPchZvZ1DYMPA1tZEQT7Tl
QkfD5j64y5qiQWbbqfZL
=Cfx3
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-4.1-take2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
KVM/ARM changes for v4.1, take #2:
Rather small this time:
- a fix for a nasty bug with virtual IRQ injection
- a fix for irqfd
When userland injects a SPI via the KVM_IRQ_LINE ioctl we currently
only check it against a fixed limit, which historically is set
to 127. With the new dynamic IRQ allocation the effective limit may
actually be smaller (64).
So when now a malicious or buggy userland injects a SPI in that
range, we spill over on our VGIC bitmaps and bytemaps memory.
I could trigger a host kernel NULL pointer dereference with current
mainline by injecting some bogus IRQ number from a hacked kvmtool:
-----------------
....
DEBUG: kvm_vgic_inject_irq(kvm, cpu=0, irq=114, level=1)
DEBUG: vgic_update_irq_pending(kvm, cpu=0, irq=114, level=1)
DEBUG: IRQ #114 still in the game, writing to bytemap now...
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = ffffffc07652e000
[00000000] *pgd=00000000f658b003, *pud=00000000f658b003, *pmd=0000000000000000
Internal error: Oops: 96000006 [#1] PREEMPT SMP
Modules linked in:
CPU: 1 PID: 1053 Comm: lkvm-msi-irqinj Not tainted 4.0.0-rc7+ #3027
Hardware name: FVP Base (DT)
task: ffffffc0774e9680 ti: ffffffc0765a8000 task.ti: ffffffc0765a8000
PC is at kvm_vgic_inject_irq+0x234/0x310
LR is at kvm_vgic_inject_irq+0x30c/0x310
pc : [<ffffffc0000ae0a8>] lr : [<ffffffc0000ae180>] pstate: 80000145
.....
So this patch fixes this by checking the SPI number against the
actual limit. Also we remove the former legacy hard limit of
127 in the ioctl code.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
CC: <stable@vger.kernel.org> # 4.0, 3.19, 3.18
[maz: wrap KVM_ARM_IRQ_GIC_MAX with #ifndef __KERNEL__,
as suggested by Christopher Covington]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
irqfd/arm curently does not support routing. kvm_irq_map_gsi is
supposed to return all the routing entries associated with the
provided gsi and return the number of those entries. We should
return 0 at this point.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
This creates a debugfs directory for each HV guest (assuming debugfs
is enabled in the kernel config), and within that directory, a file
by which the contents of the guest's HPT (hashed page table) can be
read. The directory is named vmnnnn, where nnnn is the PID of the
process that created the guest. The file is named "htab". This is
intended to help in debugging problems in the host's management
of guest memory.
The contents of the file consist of a series of lines like this:
3f48 4000d032bf003505 0000000bd7ff1196 00000003b5c71196
The first field is the index of the entry in the HPT, the second and
third are the HPT entry, so the third entry contains the real page
number that is mapped by the entry if the entry's valid bit is set.
The fourth field is the guest's view of the second doubleword of the
entry, so it contains the guest physical address. (The format of the
second through fourth fields are described in the Power ISA and also
in arch/powerpc/include/asm/mmu-hash64.h.)
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
ARM/ARM64: fixes for live migration, irqfd and ioeventfd support (enabling
vhost, too), page aging
s390: interrupt handling rework, allowing to inject all local interrupts
via new ioctl and to get/set the full local irq state for migration
and introspection. New ioctls to access memory by virtual address,
and to get/set the guest storage keys. SIMD support.
MIPS: FPU and MIPS SIMD Architecture (MSA) support. Includes some patches
from Ralf Baechle's MIPS tree.
x86: bugfixes (notably for pvclock, the others are small) and cleanups.
Another small latency improvement for the TSC deadline timer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJVJ9vmAAoJEL/70l94x66DoMEH/R3rh8IMf4jTiWRkcqohOMPX
k1+NaSY/lCKayaSgggJ2hcQenMbQoXEOdslvaA/H0oC+VfJGK+lmU6E63eMyyhjQ
Y+Px6L85NENIzDzaVu/TIWWuhil5PvIRr3VO8cvntExRoCjuekTUmNdOgCvN2ObW
wswN2qRdPIeEj2kkulbnye+9IV4G0Ne9bvsmUdOdfSSdi6ZcV43JcvrpOZT++mKj
RrKB+3gTMZYGJXMMLBwMkdl8mK1ozriD+q0mbomT04LUyGlPwYLl4pVRDBqyksD7
KsSSybaK2E4i5R80WEljgDMkNqrCgNfg6VZe4n9Y+CfAAOToNnkMJaFEi+yuqbs=
=yu2b
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"First batch of KVM changes for 4.1
The most interesting bit here is irqfd/ioeventfd support for ARM and
ARM64.
Summary:
ARM/ARM64:
fixes for live migration, irqfd and ioeventfd support (enabling
vhost, too), page aging
s390:
interrupt handling rework, allowing to inject all local interrupts
via new ioctl and to get/set the full local irq state for migration
and introspection. New ioctls to access memory by virtual address,
and to get/set the guest storage keys. SIMD support.
MIPS:
FPU and MIPS SIMD Architecture (MSA) support. Includes some
patches from Ralf Baechle's MIPS tree.
x86:
bugfixes (notably for pvclock, the others are small) and cleanups.
Another small latency improvement for the TSC deadline timer"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
KVM: use slowpath for cross page cached accesses
kvm: mmu: lazy collapse small sptes into large sptes
KVM: x86: Clear CR2 on VCPU reset
KVM: x86: DR0-DR3 are not clear on reset
KVM: x86: BSP in MSR_IA32_APICBASE is writable
KVM: x86: simplify kvm_apic_map
KVM: x86: avoid logical_map when it is invalid
KVM: x86: fix mixed APIC mode broadcast
KVM: x86: use MDA for interrupt matching
kvm/ppc/mpic: drop unused IRQ_testbit
KVM: nVMX: remove unnecessary double caching of MAXPHYADDR
KVM: nVMX: checks for address bits beyond MAXPHYADDR on VM-entry
KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpu
KVM: vmx: pass error code with internal error #2
x86: vdso: fix pvclock races with task migration
KVM: remove kvm_read_hva and kvm_read_hva_atomic
KVM: x86: optimize delivery of TSC deadline timer interrupt
KVM: x86: extract blocking logic from __vcpu_run
kvm: x86: fix x86 eflags fixed bit
KVM: s390: migrate vcpu interrupt state
...
kvm_write_guest_cached() does not mark all written pages as dirty and
code comments in kvm_gfn_to_hva_cache_init() talk about NULL memslot
with cross page accesses. Fix all the easy way.
The check is '<= 1' to have the same result for 'len = 0' cache anywhere
in the page. (nr_pages_needed is 0 on page boundary.)
Fixes: 8f964525a1 ("KVM: Allow cross page reads and writes from cached translations.")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Message-Id: <20150408121648.GA3519@potion.brq.redhat.com>
Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The corresponding write functions just use __copy_to_user. Do the
same on the read side.
This reverts what's left of commit 86ab8cffb4 (KVM: introduce
gfn_to_hva_read/kvm_read_hva/kvm_read_hva_atomic, 2012-08-21)
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <1427976500-28533-1-git-send-email-pbonzini@redhat.com>
1. Assorted changes
1.1 allow more feature bits for the guest
1.2 Store breaking event address on program interrupts
2. Interrupt handling rework
2.1 Fix copy_to_user while holding a spinlock (cc stable)
2.2 Rework floating interrupts to follow the priorities
2.3 Allow to inject all local interrupts via new ioctl
2.4 allow to get/set the full local irq state, e.g. for migration
and introspection
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJVGvEUAAoJEBF7vIC1phx82tEP/3KrwsDRs+buiBqyv9k+qCFV
v+R94gReBB5ggfbGfUYgBJMR2/4XQ+0jcZ55jfBCC4osOq6Juw/8HIj2nSgbQHmz
F9Go0n8IqJ3DnqPTc0KYdFZ7kqDvMV5ME3XJrFiAHv1TUL9H/KpZArkcVIwD2NOo
w01AVrCDY4bTajYqKShzGFymQl1K5vTGGvgxhh4kAHct4Nt5N5HFmyROm0RrsFZx
Sycx4t177O7zhCN2tv5Zy8iWaEvzHAESoXkhZ2cJ6t+FXii2Eov5IgyyfYRXBfbm
YACyvlFD087UdFGTt85ggPVS/S/5hn9xXmVHuIimHeyZU7CXCN5vYPcn+ZyksYr5
uA8+/2OPAgcaeDa2f7nCjl8jmcLR3hkQ0n/urA+pPYAZANJoFDfiGOr/kVk6aKff
JTGSFUjNK891/IGEsdrSk2p64U5xMd8LFa3Il++kZT91gc2nrZOHNz5FGlXlkLdJ
sADeNFWhoprEt/2P4aX6W2j26L8G874XkldDSjrS41U8L55+IiEm09r8oAWgfc5A
pryeDaN4nSjFC+HOtlPkcVkAcsswiI6nHIm3+/XFetCq+v4pnVKFMHWsTeEjiQgQ
H5aV9mfEKTJaCPrAJMsj8ZsKq0usG+BeRNqpIvxPAQB8fyl3jw9iu+RHeY1xWsTg
BRHB/+CGYIxDu4XdRexv
=Rrx5
-----END PGP SIGNATURE-----
Merge tag 'kvm-s390-next-20150331' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
Features and fixes for 4.1 (kvm/next)
1. Assorted changes
1.1 allow more feature bits for the guest
1.2 Store breaking event address on program interrupts
2. Interrupt handling rework
2.1 Fix copy_to_user while holding a spinlock (cc stable)
2.2 Rework floating interrupts to follow the priorities
2.3 Allow to inject all local interrupts via new ioctl
2.4 allow to get/set the full local irq state, e.g. for migration
and introspection
Fixes page refcounting issues in our Stage-2 page table management code,
fixes a missing unlock in a gicv3 error path, and fixes a race that can
cause lost interrupts if signals are pending just prior to entering the
guest.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVBs95AAoJEEtpOizt6ddyAfoH/Rwj2T38ZDikImPpgfeFrmJs
ZlWC+Z3akwjVHPv308/MsKdyashtA7OjiMp3DOheMFMYJay/ecgY/92vFCc6uh5S
LDoXCbp+Pneth6C6bbU2Gw+aoCD07ZYCn9PeFq40MfpQUhCEGWhx41OFzHppqOZx
e+jodHRE+sBVTFUtbz+HubAfcM46f/8bP7682CEKsVZPeTSiHyeojdZEglfB37MG
ar/iC1/cyO/097vWaBqv1t1WZoHbWmMrDlzo5X+AtayVXFNdv4Ztw0Rz2kRhnLB8
8GXYawoSQoTN8FX1oyTyr5YWcWD7wDTzhcHsHS1xZHhvrdLCEcFrHeEWkuUlYjU=
=YS6j
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-fixes-4.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into 'kvm-next'
Fixes for KVM/ARM for 4.0-rc5.
Fixes page refcounting issues in our Stage-2 page table management code,
fixes a missing unlock in a gicv3 error path, and fixes a race that can
cause lost interrupts if signals are pending just prior to entering the
guest.
We have introduced struct kvm_s390_irq a while ago which allows to
inject all kinds of interrupts as defined in the Principles of
Operation.
Add ioctl to inject interrupts with the extended struct kvm_s390_irq
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Currently we have struct kvm_exit_mmio for encapsulating MMIO abort
data to be passed on from syndrome decoding all the way down to the
VGIC register handlers. Now as we switch the MMIO handling to be
routed through the KVM MMIO bus, it does not make sense anymore to
use that structure already from the beginning. So we keep the data in
local variables until we put them into the kvm_io_bus framework.
Then we fill kvm_exit_mmio in the VGIC only, making it a VGIC private
structure. On that way we replace the data buffer in that structure
with a pointer pointing to a single location in a local variable, so
we get rid of some copying on the way.
With all of the virtual GIC emulation code now being registered with
the kvm_io_bus, we can remove all of the old MMIO handling code and
its dispatching functionality.
I didn't bother to rename kvm_exit_mmio (to vgic_mmio or something),
because that touches a lot of code lines without any good reason.
This is based on an original patch by Nikolay.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Cc: Nikolay Nikolaev <n.nikolaev@virtualopensystems.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Using the framework provided by the recent vgic.c changes, we
register a kvm_io_bus device on mapping the virtual GICv3 resources.
The distributor mapping is pretty straight forward, but the
redistributors need some more love, since they need to be tagged with
the respective redistributor (read: VCPU) they are connected with.
We use the kvm_io_bus framework to register one devices per VCPU.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Currently we handle the redistributor registers in two separate MMIO
regions, one for the overall behaviour and SPIs and one for the
SGIs/PPIs. That latter forces the creation of _two_ KVM I/O bus
devices for each redistributor.
Since the spec mandates those two pages to be contigious, we could as
well merge them and save the churn with the second KVM I/O bus device.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Using the framework provided by the recent vgic.c changes we register
a kvm_io_bus device when initializing the virtual GICv2.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Currently we use a lot of VGIC specific code to do the MMIO
dispatching.
Use the previous reworks to add kvm_io_bus style MMIO handlers.
Those are not yet called by the MMIO abort handler, also the actual
VGIC emulator function do not make use of it yet, but will be enabled
with the following patches.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
The vgic_find_range() function in vgic.c takes a struct kvm_exit_mmio
argument, but actually only used the length field in there. Since we
need to get rid of that structure in that part of the code anyway,
let's rework the function (and it's callers) to pass the length
argument to the function directly.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
The name "kvm_mmio_range" is a bit bold, given that it only covers
the VGIC's MMIO ranges. To avoid confusion with kvm_io_range, rename
it to vgic_io_range.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
iodev.h contains definitions for the kvm_io_bus framework. This is
needed both by the generic KVM code in virt/kvm as well as by
architecture specific code under arch/. Putting the header file in
virt/kvm and using local includes in the architecture part seems at
least dodgy to me, so let's move the file into include/kvm, so that a
more natural "#include <kvm/iodev.h>" can be used by all of the code.
This also solves a problem later when using struct kvm_io_device
in arm_vgic.h.
Fixing up the FSF address in the GPL header and a wrong include path
on the way.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
This is needed in e.g. ARM vGIC emulation, where the MMIO handling
depends on the VCPU that does the access.
Signed-off-by: Nikolay Nikolaev <n.nikolaev@virtualopensystems.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
KVM guest can fail to startup with following trace on host:
qemu-system-x86: page allocation failure: order:4, mode:0x40d0
Call Trace:
dump_stack+0x47/0x67
warn_alloc_failed+0xee/0x150
__alloc_pages_direct_compact+0x14a/0x150
__alloc_pages_nodemask+0x776/0xb80
alloc_kmem_pages+0x3a/0x110
kmalloc_order+0x13/0x50
kmemdup+0x1b/0x40
__kvm_set_memory_region+0x24a/0x9f0 [kvm]
kvm_set_ioapic+0x130/0x130 [kvm]
kvm_set_memory_region+0x21/0x40 [kvm]
kvm_vm_ioctl+0x43f/0x750 [kvm]
Failure happens when attempting to allocate pages for
'struct kvm_memslots', however it doesn't have to be
present in physically contiguous (kmalloc-ed) address
space, change allocation to kvm_kvzalloc() so that
it will be vmalloc-ed when its size is more then a page.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
When all bits in mask are not set,
kvm_arch_mmu_enable_log_dirty_pt_masked() has nothing to do. But since
it needs to be called from the generic code, it cannot be inlined, and
a few function calls, two when PML is enabled, are wasted.
Since it is common to see many pages remain clean, e.g. framebuffers can
stay calm for a long time, it is worth eliminating this overhead.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Fixes page refcounting issues in our Stage-2 page table management code,
fixes a missing unlock in a gicv3 error path, and fixes a race that can
cause lost interrupts if signals are pending just prior to entering the
guest.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVBs95AAoJEEtpOizt6ddyAfoH/Rwj2T38ZDikImPpgfeFrmJs
ZlWC+Z3akwjVHPv308/MsKdyashtA7OjiMp3DOheMFMYJay/ecgY/92vFCc6uh5S
LDoXCbp+Pneth6C6bbU2Gw+aoCD07ZYCn9PeFq40MfpQUhCEGWhx41OFzHppqOZx
e+jodHRE+sBVTFUtbz+HubAfcM46f/8bP7682CEKsVZPeTSiHyeojdZEglfB37MG
ar/iC1/cyO/097vWaBqv1t1WZoHbWmMrDlzo5X+AtayVXFNdv4Ztw0Rz2kRhnLB8
8GXYawoSQoTN8FX1oyTyr5YWcWD7wDTzhcHsHS1xZHhvrdLCEcFrHeEWkuUlYjU=
=YS6j
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-fixes-4.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm
Fixes for KVM/ARM for 4.0-rc5.
Fixes page refcounting issues in our Stage-2 page table management code,
fixes a missing unlock in a gicv3 error path, and fixes a race that can
cause lost interrupts if signals are pending just prior to entering the
guest.
When a VCPU is no longer running, we currently check to see if it has a
timer scheduled in the future, and if it does, we schedule a host
hrtimer to notify is in case the timer expires while the VCPU is still
not running. When the hrtimer fires, we mask the guest's timer and
inject the timer IRQ (still relying on the guest unmasking the time when
it receives the IRQ).
This is all good and fine, but when migration a VM (checkpoint/restore)
this introduces a race. It is unlikely, but possible, for the following
sequence of events to happen:
1. Userspace stops the VM
2. Hrtimer for VCPU is scheduled
3. Userspace checkpoints the VGIC state (no pending timer interrupts)
4. The hrtimer fires, schedules work in a workqueue
5. Workqueue function runs, masks the timer and injects timer interrupt
6. Userspace checkpoints the timer state (timer masked)
At restore time, you end up with a masked timer without any timer
interrupts and your guest halts never receiving timer interrupts.
Fix this by only kicking the VCPU in the workqueue function, and sample
the expired state of the timer when entering the guest again and inject
the interrupt and mask the timer only then.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Migrating active interrupts causes the active state to be lost
completely. This implements some additional bitmaps to track the active
state on the distributor and export this to user space.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
This helps re-factor away some of the repetitive code and makes the code
flow more nicely.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
There is an interesting bug in the vgic code, which manifests itself
when the KVM run loop has a signal pending or needs a vmid generation
rollover after having disabled interrupts but before actually switching
to the guest.
In this case, we flush the vgic as usual, but we sync back the vgic
state and exit to userspace before entering the guest. The consequence
is that we will be syncing the list registers back to the software model
using the GICH_ELRSR and GICH_EISR from the last execution of the guest,
potentially overwriting a list register containing an interrupt.
This showed up during migration testing where we would capture a state
where the VM has masked the arch timer but there were no interrupts,
resulting in a hung test.
Cc: Marc Zyngier <marc.zyngier@arm.com>
Reported-by: Alex Bennee <alex.bennee@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Add the missing unlock before return from function kvm_vgic_create()
in the error handling case.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
This patch enables irqfd on arm/arm64.
Both irqfd and resamplefd are supported. Injection is implemented
in vgic.c without routing.
This patch enables CONFIG_HAVE_KVM_EVENTFD and CONFIG_HAVE_KVM_IRQFD.
KVM_CAP_IRQFD is now advertised. KVM_CAP_IRQFD_RESAMPLE capability
automatically is advertised as soon as CONFIG_HAVE_KVM_IRQFD is set.
Irqfd injection is restricted to SPI. The rationale behind not
supporting PPI irqfd injection is that any device using a PPI would
be a private-to-the-CPU device (timer for instance), so its state
would have to be context-switched along with the VCPU and would
require in-kernel wiring anyhow. It is not a relevant use case for
irqfds.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
To prepare for irqfd addition, coarse grain locking is removed at
kvm_vgic_sync_hwstate level and finer grain locking is introduced in
vgic_process_maintenance only.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Introduce __KVM_HAVE_ARCH_INTC_INITIALIZED define and
associated kvm_arch_intc_initialized function. This latter
allows to test whether the virtual interrupt controller is initialized
and ready to accept virtual IRQ injection. On some architectures,
the virtual interrupt controller is dynamically instantiated, justifying
that kind of check.
The new function can now be used by irqfd to check whether the
virtual interrupt controller is ready on KVM_IRQFD request. If not,
KVM_IRQFD returns -EAGAIN.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Several dts only list "arm,cortex-a7-gic" or "arm,gic-400" in their GIC
compatible list, and while this is correct (and supported by the GIC
driver), KVM will fail to detect that it can support these cases.
This patch adds the missing strings to the VGIC code. The of_device_id
entries are padded to keep the probe function data aligned.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Andre Przywara <andre.przywara@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Michal Simek <monstr@monstr.eu>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
POWER supports irqfds but forgot to advertise them. Some userspace does
not check for the capability, but others check it---thus they work on
x86 and s390 but not POWER.
To avoid that other architectures in the future make the same mistake, let
common code handle KVM_CAP_IRQFD the same way as KVM_CAP_IRQFD_RESAMPLE.
Reported-and-tested-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Fixes: 297e21053a
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
ERROR: code indent should use tabs where possible
+ const struct kvm_io_range *r2)$
WARNING: please, no spaces at the start of a line
+ const struct kvm_io_range *r2)$
This patch fixes this ERROR & WARNING to reduce noise when checking new
patches in kvm_main.c.
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
WARNING: please, no space before tabs
+ * ^I^Ikvm->lock --> kvm->slots_lock --> kvm->irq_lock$
WARNING: please, no space before tabs
+^I^I * ^I- gfn_to_hva (kvm_read_guest, gfn_to_pfn)$
WARNING: please, no space before tabs
+^I^I * ^I- kvm_is_visible_gfn (mmu_check_roots)$
This patch fixes these warnings to reduce noise when checking new
patches in kvm_main.c.
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
There are many Warnings like this:
WARNING: Missing a blank line after declarations
+ struct kvm_coalesced_mmio_zone zone;
+ r = -EFAULT;
This patch fixes these warnings to reduce noise when checking new
patches in kvm_main.c.
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
WARNING: EXPORT_SYMBOL(foo); should immediately follow its
function/variable
+EXPORT_SYMBOL_GPL(gfn_to_page);
This patch fixes these warnings to reduce noise when checking new
patches in kvm_main.c.
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
ERROR: do not initialise statics to 0 or NULL
+static int kvm_usage_count = 0;
The kvm_usage_count will be placed to .bss segment when linking, so
not need to set it to 0 here obviously.
This patch fixes this ERROR to reduce noise when checking new patches
in kvm_main.c.
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
WARNING: labels should not be indented
+ out_free_irq_routing:
This patch fixes this WARNING to reduce noise when checking new patches
in kvm_main.c.
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
There are many WARNINGs like this:
WARNING: sizeof tr should be sizeof(tr)
+ if (copy_from_user(&tr, argp, sizeof tr))
In kvm_main.c many places are using 'sizeof(X)', and the other places
are using 'sizeof X', while the kernel recommands to use 'sizeof(X)',
so this patch will replace all 'sizeof X' to 'sizeof(X)' to make them
consistent and at the same time to reduce the WARNINGs noise when we
are checking new patches.
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
kvm_kvfree() provides exactly the same functionality as the
new common kvfree() function - so let's simply replace the
kvm function with the common function.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
halt_poll_ns is used only locally. Make it static.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Better alignment of loop using tabs rather than spaces, this
makes checkpatch.pl happier.
Signed-off-by: Kevin Mulvey <kmulvey@linux.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Common: Optional support for adding a small amount of polling on each HLT
instruction executed in the guest (or equivalent for other architectures).
This can improve latency up to 50% on some scenarios (e.g. O_DSYNC writes
or TCP_RR netperf tests). This also has to be enabled manually for now,
but the plan is to auto-tune this in the future.
ARM/ARM64: the highlights are support for GICv3 emulation and dirty page
tracking
s390: several optimizations and bugfixes. Also a first: a feature
exposed by KVM (UUID and long guest name in /proc/sysinfo) before
it is available in IBM's hypervisor! :)
MIPS: Bugfixes.
x86: Support for PML (page modification logging, a new feature in
Broadwell Xeons that speeds up dirty page tracking), nested virtualization
improvements (nested APICv---a nice optimization), usual round of emulation
fixes. There is also a new option to reduce latency of the TSC deadline
timer in the guest; this needs to be tuned manually.
Some commits are common between this pull and Catalin's; I see you
have already included his tree.
ARM has other conflicts where functions are added in the same place
by 3.19-rc and 3.20 patches. These are not large though, and entirely
within KVM.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJU28rkAAoJEL/70l94x66DXqQH/1TDOfJIjW7P2kb0Sw7Fy1wi
cEX1KO/VFxAqc8R0E/0Wb55CXyPjQJM6xBXuFr5cUDaIjQ8ULSktL4pEwXyyv/s5
DBDkN65mriry2w5VuEaRLVcuX9Wy+tqLQXWNkEySfyb4uhZChWWHvKEcgw5SqCyg
NlpeHurYESIoNyov3jWqvBjr4OmaQENyv7t2c6q5ErIgG02V+iCux5QGbphM2IC9
LFtPKxoqhfeB2xFxTOIt8HJiXrZNwflsTejIlCl/NSEiDVLLxxHCxK2tWK/tUXMn
JfLD9ytXBWtNMwInvtFm4fPmDouv2VDyR0xnK2db+/axsJZnbxqjGu1um4Dqbak=
=7gdx
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM update from Paolo Bonzini:
"Fairly small update, but there are some interesting new features.
Common:
Optional support for adding a small amount of polling on each HLT
instruction executed in the guest (or equivalent for other
architectures). This can improve latency up to 50% on some
scenarios (e.g. O_DSYNC writes or TCP_RR netperf tests). This
also has to be enabled manually for now, but the plan is to
auto-tune this in the future.
ARM/ARM64:
The highlights are support for GICv3 emulation and dirty page
tracking
s390:
Several optimizations and bugfixes. Also a first: a feature
exposed by KVM (UUID and long guest name in /proc/sysinfo) before
it is available in IBM's hypervisor! :)
MIPS:
Bugfixes.
x86:
Support for PML (page modification logging, a new feature in
Broadwell Xeons that speeds up dirty page tracking), nested
virtualization improvements (nested APICv---a nice optimization),
usual round of emulation fixes.
There is also a new option to reduce latency of the TSC deadline
timer in the guest; this needs to be tuned manually.
Some commits are common between this pull and Catalin's; I see you
have already included his tree.
Powerpc:
Nothing yet.
The KVM/PPC changes will come in through the PPC maintainers,
because I haven't received them yet and I might end up being
offline for some part of next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits)
KVM: ia64: drop kvm.h from installed user headers
KVM: x86: fix build with !CONFIG_SMP
KVM: x86: emulate: correct page fault error code for NoWrite instructions
KVM: Disable compat ioctl for s390
KVM: s390: add cpu model support
KVM: s390: use facilities and cpu_id per KVM
KVM: s390/CPACF: Choose crypto control block format
s390/kernel: Update /proc/sysinfo file with Extended Name and UUID
KVM: s390: reenable LPP facility
KVM: s390: floating irqs: fix user triggerable endless loop
kvm: add halt_poll_ns module parameter
kvm: remove KVM_MMIO_SIZE
KVM: MIPS: Don't leak FPU/DSP to guest
KVM: MIPS: Disable HTW while in guest
KVM: nVMX: Enable nested posted interrupt processing
KVM: nVMX: Enable nested virtual interrupt delivery
KVM: nVMX: Enable nested apic register virtualization
KVM: nVMX: Make nested control MSRs per-cpu
KVM: nVMX: Enable nested virtualize x2apic mode
KVM: nVMX: Prepare for using hardware MSR bitmap
...
Use the more generic get_user_pages_unlocked which has the additional
benefit of passing FAULT_FLAG_ALLOW_RETRY at the very first page fault
(which allows the first page fault in an unmapped area to be always able
to block indefinitely by being allowed to release the mmap_sem).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Peter Feiner <pfeiner@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We never had a 31bit QEMU/kuli running. We would need to review several
ioctls to check if this creates holes, bugs or whatever to make it work.
Lets just disable compat support for KVM on s390.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.
This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.
With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.
Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.
While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.
The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We don't have to write protect guest memory for dirty logging if architecture
supports hardware dirty logging, such as PML on VMX, so rename it to be more
generic.
Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Indeed, any invalid memslots should be new->npages = 0,
new->base_gfn = 0 and new->flags = 0 at the same time.
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The dirty patch logging series introduced both
HAVE_KVM_ARCH_DIRTY_LOG_PROTECT and KVM_GENERIC_DIRTYLOG_READ_PROTECT
config symbols, but only KVM_GENERIC_DIRTYLOG_READ_PROTECT is used.
Just remove the unused one.
(The config symbol was renamed during the development of the patch
series and the old name just creeped in by accident.()
Reported-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Although the GIC architecture requires us to map the MMIO regions
only at page aligned addresses, we currently do not enforce this from
the kernel side.
Restrict any vGICv2 regions to be 4K aligned and any GICv3 regions
to be 64K aligned. Document this requirement.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
With all of the GICv3 code in place now we allow userland to ask the
kernel for using a virtual GICv3 in the guest.
Also we provide the necessary support for guests setting the memory
addresses for the virtual distributor and redistributors.
This requires some userland code to make use of that feature and
explicitly ask for a virtual GICv3.
Document that KVM_CREATE_IRQCHIP only works for GICv2, but is
considered legacy and using KVM_CREATE_DEVICE is preferred.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
With all the necessary GICv3 emulation code in place, we can now
connect the code to the GICv3 backend in the kernel.
The LR register handling is different depending on the emulated GIC
model, so provide different implementations for each.
Also allow non-v2-compatible GICv3 implementations (which don't
provide MMIO regions for the virtual CPU interface in the DT), but
restrict those hosts to support GICv3 guests only.
If the device tree provides a GICv2 compatible GICV resource entry,
but that one is faulty, just disable the GICv2 emulation and let the
user use at least the GICv3 emulation for guests.
To provide proper support for the legacy KVM_CREATE_IRQCHIP ioctl,
note virtual GICv2 compatibility in struct vgic_params and use it
on creating a VGICv2.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
While the generation of a (virtual) inter-processor interrupt (SGI)
on a GICv2 works by writing to a MMIO register, GICv3 uses the system
register ICC_SGI1R_EL1 to trigger them.
Add a trap handler function that calls the new SGI register handler
in the GICv3 code. As ICC_SRE_EL1.SRE at this point is still always 0,
this will not trap yet, but will only be used later when all the data
structures have been initialized properly.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
With everything separated and prepared, we implement a model of a
GICv3 distributor and redistributors by using the existing framework
to provide handler functions for each register group.
Currently we limit the emulation to a model enforcing a single
security state, with SRE==1 (forcing system register access) and
ARE==1 (allowing more than 8 VCPUs).
We share some of the functions provided for GICv2 emulation, but take
the different ways of addressing (v)CPUs into account.
Save and restore is currently not implemented.
Similar to the split-off of the GICv2 specific code, the new emulation
code goes into a new file (vgic-v3-emul.c).
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
For a GICv2 there is always only one (v)CPU involved: the one that
does the access. On a GICv3 the access to a CPU redistributor is
memory-mapped, but not banked, so the (v)CPU affected is determined by
looking at the MMIO address region being accessed.
To allow passing the affected CPU into the accessors later, extend
struct kvm_exit_mmio to add an opaque private pointer parameter.
The current GICv2 emulation just does not use it.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
vgic.c is currently a mixture of generic vGIC emulation code and
functions specific to emulating a GICv2. To ease the addition of
GICv3, split off strictly v2 specific parts into a new file
vgic-v2-emul.c.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
-------
As the diff isn't always obvious here (and to aid eventual rebases),
here is a list of high-level changes done to the code:
* added new file to respective arm/arm64 Makefiles
* moved GICv2 specific functions to vgic-v2-emul.c:
- handle_mmio_misc()
- handle_mmio_set_enable_reg()
- handle_mmio_clear_enable_reg()
- handle_mmio_set_pending_reg()
- handle_mmio_clear_pending_reg()
- handle_mmio_priority_reg()
- vgic_get_target_reg()
- vgic_set_target_reg()
- handle_mmio_target_reg()
- handle_mmio_cfg_reg()
- handle_mmio_sgi_reg()
- vgic_v2_unqueue_sgi()
- read_set_clear_sgi_pend_reg()
- write_set_clear_sgi_pend_reg()
- handle_mmio_sgi_set()
- handle_mmio_sgi_clear()
- vgic_v2_handle_mmio()
- vgic_get_sgi_sources()
- vgic_dispatch_sgi()
- vgic_v2_queue_sgi()
- vgic_v2_map_resources()
- vgic_v2_init()
- vgic_v2_add_sgi_source()
- vgic_v2_init_model()
- vgic_v2_init_emulation()
- handle_cpu_mmio_misc()
- handle_mmio_abpr()
- handle_cpu_mmio_ident()
- vgic_attr_regs_access()
- vgic_create() (renamed to vgic_v2_create())
- vgic_destroy() (renamed to vgic_v2_destroy())
- vgic_has_attr() (renamed to vgic_v2_has_attr())
- vgic_set_attr() (renamed to vgic_v2_set_attr())
- vgic_get_attr() (renamed to vgic_v2_get_attr())
- struct kvm_mmio_range vgic_dist_ranges[]
- struct kvm_mmio_range vgic_cpu_ranges[]
- struct kvm_device_ops kvm_arm_vgic_v2_ops {}
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
vgic.c is currently a mixture of generic vGIC emulation code and
functions specific to emulating a GICv2. To ease the addition of
GICv3 later, we create new header file vgic.h, which holds constants
and prototypes of commonly used functions.
Rename some identifiers to avoid name space clutter.
I removed the long-standing comment about using the kvm_io_bus API
to tackle the GIC register ranges, as it wouldn't be a win for us
anymore.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
-------
As the diff isn't always obvious here (and to aid eventual rebases),
here is a list of high-level changes done to the code:
* moved definitions and prototypes from vgic.c to vgic.h:
- VGIC_ADDR_UNDEF
- ACCESS_{READ,WRITE}_*
- vgic_init()
- vgic_update_state()
- vgic_kick_vcpus()
- vgic_get_vmcr()
- vgic_set_vmcr()
- struct mmio_range {} (renamed to struct kvm_mmio_range)
* removed static keyword and exported prototype in vgic.h:
- vgic_bitmap_get_reg()
- vgic_bitmap_set_irq_val()
- vgic_bitmap_get_shared_map()
- vgic_bytemap_get_reg()
- vgic_dist_irq_set_pending()
- vgic_dist_irq_clear_pending()
- vgic_cpu_irq_clear()
- vgic_reg_access()
- handle_mmio_raz_wi()
- vgic_handle_enable_reg()
- vgic_handle_set_pending_reg()
- vgic_handle_clear_pending_reg()
- vgic_handle_cfg_reg()
- vgic_unqueue_irqs()
- find_matching_range() (renamed to vgic_find_range)
- vgic_handle_mmio_range()
- vgic_update_state()
- vgic_get_vmcr()
- vgic_set_vmcr()
- vgic_queue_irq()
- vgic_kick_vcpus()
- vgic_init()
- vgic_v2_init_emulation()
- vgic_has_attr_regs()
- vgic_set_common_attr()
- vgic_get_common_attr()
- vgic_destroy()
- vgic_create()
* moved functions to vgic.h (static inline):
- mmio_data_read()
- mmio_data_write()
- is_in_range()
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
vgic_set_attr() and vgic_get_attr() contain both code specific for
the emulated GIC as well as code for the userland facing, generic
part of the GIC.
Split the guest GIC facing code of from the generic part to allow
easier splitting later.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
The MMIO accessors for GICD_I[CS]ENABLER, GICD_I[CS]PENDR and
GICD_ICFGR behave very similar for GICv2 and GICv3, although the way
the affected VCPU is determined differs.
Since we need them to access the registers from three different
places in the future, we factor out a generic, backend-facing
implementation and use small wrappers in the current GICv2 emulation.
This will ease adding GICv3 accessors later.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
ICC_SRE_EL1 is a system register allowing msr/mrs accesses to the
GIC CPU interface for EL1 (guests). Currently we force it to 0, but
for proper GICv3 support we have to allow guests to use it (depending
on their selected virtual GIC model).
So add ICC_SRE_EL1 to the list of saved/restored registers on a
world switch, but actually disallow a guest to change it by only
restoring a fixed, once-initialized value.
This value depends on the GIC model userland has chosen for a guest.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Currently the maximum number of vCPUs supported is a global value
limited by the used GIC model. GICv3 will lift this limit, but we
still need to observe it for guests using GICv2.
So the maximum number of vCPUs is per-VM value, depending on the
GIC model the guest uses.
Store and check the value in struct kvm_arch, but keep it down to
8 for now.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
To check whether the vGIC was already initialized, we currently check
the GICH base address for not being NULL. Since with GICv3 we may
get along without this address, lets use the irqchip_in_kernel()
function to detect an already initialized vGIC.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Currently we unconditionally register the GICv2 emulation device
during the host's KVM initialization. Since with GICv3 support we
may end up with only v2 or only v3 or both supported, we move the
registration into the GIC probing function, where we will later know
which combination is valid.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Currently we only have one virtual GIC model supported, so all guests
use the same emulation code. With the addition of another model we
end up with different guests using potentially different vGIC models,
so we have to split up some functions to be per VM.
Introduce a vgic_vm_ops struct to hold function pointers for those
functions that are different and provide the necessary code to
initialize them.
Also split up the vgic_init() function to separate out VGIC model
specific functionality into a separate function, which will later be
different for a GICv3 model.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Some GICv3 registers can and will be accessed as 64 bit registers.
Currently the register handling code can only deal with 32 bit
accesses, so we do two consecutive calls to cover this.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Currently we only need to deal with one MMIO region for the GIC
emulation (the GICv2 distributor), but we soon need to extend this.
Refactor the existing code to allow easier addition of different
ranges without code duplication.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
With the introduction of a second emulated GIC model we need to let
userspace specify the GIC model to use for each VM. Pass the
userspace provided value down into the vGIC code and store it there
to differentiate later.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
kvm_get_dirty_log() provides generic handling of dirty bitmap, currently reused
by several architectures. Building on that we intrdoduce
kvm_get_dirty_log_protect() adding write protection to mark these pages dirty
for future write access, before next KVM_GET_DIRTY_LOG ioctl call from user
space.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
Allow architectures to override the generic kvm_flush_remote_tlbs()
function via HAVE_KVM_ARCH_TLB_FLUSH_ALL. ARMv7 will need this to
provide its own TLB flush interface.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
Since the advent of VGIC dynamic initialization, this latter is
initialized quite late on the first vcpu run or "on-demand", when
injecting an IRQ or when the guest sets its registers.
This initialization could be initiated explicitly much earlier
by the users-space, as soon as it has provided the requested
dimensioning parameters.
This patch adds a new entry to the VGIC KVM device that allows
the user to manually request the VGIC init:
- a new KVM_DEV_ARM_VGIC_GRP_CTRL group is introduced.
- Its first attribute is KVM_DEV_ARM_VGIC_CTRL_INIT
The rationale behind introducing a group is to be able to add other
controls later on, if needed.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
To be more explicit on vgic initialization failure, -ENODEV is
returned by vgic_init when no online vcpus can be found at init.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Several hypervisors need MSR auto load/restore feature.
We read MSRs from VM-entry MSR load area which specified by L1,
and load them via kvm_set_msr in the nested entry.
When nested exit occurs, we get MSRs via kvm_get_msr, writing
them to L1`s MSR store area. After this, we read MSRs from VM-exit
MSR load area, and load them via kvm_set_msr.
Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The current timecounter implementation will drop a variable amount
of resolution, depending on the magnitude of the time delta. In
other words, reading the clock too often or too close to a time
stamp conversion will introduce errors into the time values. This
patch fixes the issue by introducing a fractional nanosecond field
that accumulates the low order bits.
Reported-by: Janusz Użycki <j.uzycki@elproma.com.pl>
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modifying a non-existent slot is not allowed. Also check that the
first loop doesn't move a deleted slot beyond the used part of
the mslots array.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Before commit 0e60b0799f (kvm: change memslot sorting rule from size
to GFN, 2014-12-01), the memslots' sorting key was npages, meaning
that a valid memslot couldn't have its sorting key equal to zero.
On the other hand, a valid memslot can have base_gfn == 0, and invalid
memslots are identified by base_gfn == npages == 0.
Because of this, commit 0e60b0799f broke the invariant that invalid
memslots are at the end of the mslots array. When a memslot with
base_gfn == 0 was created, any invalid memslot before it were left
in place.
This can be fixed by changing the insertion to use a ">=" comparison
instead of "<=", but some care is needed to avoid breaking the case
of deleting a memslot; see the comment in update_memslots.
Thanks to Tiejun Chen for posting an initial patch for this bug.
Reported-by: Jamie Heilman <jamie@audible.transient.net>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Tested-by: Jamie Heilman <jamie@audible.transient.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- spring cleaning: removed support for IA64, and for hardware-assisted
virtualization on the PPC970
- ARM, PPC, s390 all had only small fixes
For x86:
- small performance improvements (though only on weird guests)
- usual round of hardware-compliancy fixes from Nadav
- APICv fixes
- XSAVES support for hosts and guests. XSAVES hosts were broken because
the (non-KVM) XSAVES patches inadvertently changed the KVM userspace
ABI whenever XSAVES was enabled; hence, this part is going to stable.
Guest support is just a matter of exposing the feature and CPUID leaves
support.
Right now KVM is broken for PPC BookE in your tree (doesn't compile).
I'll reply to the pull request with a patch, please apply it either
before the pull request or in the merge commit, in order to preserve
bisectability somewhat.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJUkpg+AAoJEL/70l94x66DUmoH/jzXYkptSW9NGgm79KqxGJlD
lzLnLBkitVvx++Mz5YBhdJEhKKLUlCtifFT1zPJQ/pthQhIRSaaAwZyNGgUs5w5x
yMGKHiPQFyZRbmQtZhCInW0BftJoYHHciO3nUfHCZnp34My9MP2D55W7/z+fYFfQ
DuqBSE9ThyZJtZ4zh8NRA9fCOeuqwVYRyoBs820Wbsh4cpIBoIK63Dg7k+CLE+ZV
MZa/mRL6bAfsn9W5bnOUAgHJ3SPznnWbO3/g0aV+roL/5pffblprJx9lKNR08xUM
6hDFLop2gDehDJesDkY/o8Ckp1hEouvfsVpSShry4vcgtn0hgh2O5/6Orbmj6vE=
=Zwq1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM update from Paolo Bonzini:
"3.19 changes for KVM:
- spring cleaning: removed support for IA64, and for hardware-
assisted virtualization on the PPC970
- ARM, PPC, s390 all had only small fixes
For x86:
- small performance improvements (though only on weird guests)
- usual round of hardware-compliancy fixes from Nadav
- APICv fixes
- XSAVES support for hosts and guests. XSAVES hosts were broken
because the (non-KVM) XSAVES patches inadvertently changed the KVM
userspace ABI whenever XSAVES was enabled; hence, this part is
going to stable. Guest support is just a matter of exposing the
feature and CPUID leaves support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (179 commits)
KVM: move APIC types to arch/x86/
KVM: PPC: Book3S: Enable in-kernel XICS emulation by default
KVM: PPC: Book3S HV: Improve H_CONFER implementation
KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register
KVM: PPC: Book3S HV: Remove code for PPC970 processors
KVM: PPC: Book3S HV: Tracepoints for KVM HV guest interactions
KVM: PPC: Book3S HV: Simplify locking around stolen time calculations
arch: powerpc: kvm: book3s_paired_singles.c: Remove unused function
arch: powerpc: kvm: book3s_pr.c: Remove unused function
arch: powerpc: kvm: book3s.c: Remove some unused functions
arch: powerpc: kvm: book3s_32_mmu.c: Remove unused function
KVM: PPC: Book3S HV: Check wait conditions before sleeping in kvmppc_vcore_blocked
KVM: PPC: Book3S HV: ptes are big endian
KVM: PPC: Book3S HV: Fix inaccuracies in ICP emulation for H_IPI
KVM: PPC: Book3S HV: Fix KSM memory corruption
KVM: PPC: Book3S HV: Fix an issue where guest is paused on receiving HMI
KVM: PPC: Book3S HV: Fix computation of tlbie operand
KVM: PPC: Book3S HV: Add missing HPTE unlock
KVM: PPC: BookE: Improve irq inject tracepoint
arm/arm64: KVM: Require in-kernel vgic for the arch timers
...
problems, clarifies VCPU init, and fixes a regression concerning the
VGIC init flow.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUjsVhAAoJEEtpOizt6ddy5rIH/1V/YVwhprC55YqdHelU9Qu2
Muzsx+7F71NxC7xgMGFqPD1YrPR+hxvoPhy+ADOBlvcqlolrkDnV9I+8e3geaYNc
nZ/yEnoGTtbAggiS1smx7usBv34Z88Sd5txNjmj1cmHBy+VOWlyidWMkGBTsfBRe
mVc61BDUfyC47udgRHXhwS80sbHLJHElmADisFOVmQNBYwwiHiTdx0hMBMnHcC3Y
/3T0tKxHdeTISnmA+J+n7TcChtTIM4xqC6kwf3rw3b7XX8gdtTKylDHX2GLAg646
RdebAG2twmGpIc6SxXZbo38f3oY9OFo1Le5xZGa6iUjD56VDw/e4wg4iA2juo0Y=
=J2Ut
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-3.19-take2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
Second round of changes for KVM for arm/arm64 for v3.19; fixes reboot
problems, clarifies VCPU init, and fixes a regression concerning the
VGIC init flow.
Conflicts:
arch/ia64/kvm/kvm-ia64.c [deleted in HEAD and modified in kvmarm]
It is curently possible to run a VM with architected timers support
without creating an in-kernel VGIC, which will result in interrupts from
the virtual timer going nowhere.
To address this issue, move the architected timers initialization to the
time when we run a VCPU for the first time, and then only initialize
(and enable) the architected timers if we have a properly created and
initialized in-kernel VGIC.
When injecting interrupts from the virtual timer to the vgic, the
current setup should ensure that this never calls an on-demand init of
the VGIC, which is the only call path that could return an error from
kvm_vgic_inject_irq(), so capture the return value and raise a warning
if there's an error there.
We also change the kvm_timer_init() function from returning an int to be
a void function, since the function always succeeds.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Userspace assumes that it can wire up IRQ injections after having
created all VCPUs and after having created the VGIC, but potentially
before starting the first VCPU. This can currently lead to lost IRQs
because the state of that IRQ injection is not stored anywhere and we
don't return an error to userspace.
We haven't seen this problem manifest itself yet, presumably because
guests reset the devices on boot, but this could cause issues with
migration and other non-standard startup configurations.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Some code paths will need to check to see if the internal state of the
vgic has been initialized (such as when creating new VCPUs), so
introduce such a macro that checks the nr_cpus field which is set when
the vgic has been initialized.
Also set nr_cpus = 0 in kvm_vgic_destroy, because the error path in
vgic_init() will call this function, and code should never errornously
assume the vgic to be properly initialized after an error.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
The vgic_initialized() macro currently returns the state of the
vgic->ready flag, which indicates if the vgic is ready to be used when
running a VM, not specifically if its internal state has been
initialized.
Rename the macro accordingly in preparation for a more nuanced
initialization flow.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
VGIC initialization currently happens in three phases:
(1) kvm_vgic_create() (triggered by userspace GIC creation)
(2) vgic_init_maps() (triggered by userspace GIC register read/write
requests, or from kvm_vgic_init() if not already run)
(3) kvm_vgic_init() (triggered by first VM run)
We were doing initialization of some state to correspond with the
state of a freshly-reset GIC in kvm_vgic_init(); this is too late,
since it will overwrite changes made by userspace using the
register access APIs before the VM is run. Move this initialization
earlier, into the vgic_init_maps() phase.
This fixes a bug where QEMU could successfully restore a saved
VM state snapshot into a VM that had already been run, but could
not restore it "from cold" using the -loadvm command line option
(the symptoms being that the restored VM would run but interrupts
were ignored).
Finally rename vgic_init_maps to vgic_init and renamed kvm_vgic_init to
kvm_vgic_map_resources.
[ This patch is originally written by Peter Maydell, but I have
modified it somewhat heavily, renaming various bits and moving code
around. If something is broken, I am to be blamed. - Christoffer ]
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
We currently track the pid of the task that runs the VCPU in vcpu_load.
If a yield to that VCPU is triggered while the PID of the wrong thread
is active, the wrong thread might receive a yield, but this will most
likely not help the executing thread at all. Instead, if we only track
the pid on the KVM_RUN ioctl, there are two possibilities:
1) the thread that did a non-KVM_RUN ioctl is holding a mutex that
the VCPU thread is waiting for. In this case, the VCPU thread is not
runnable, but we also do not do a wrong yield.
2) the thread that did a non-KVM_RUN ioctl is sleeping, or doing
something that does not block the VCPU thread. In this case, the
VCPU thread can receive the directed yield correctly.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
CC: Michael Mueller <mimu@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_enter_guest() has to be called with preemption disabled and will
set PF_VCPU. Current code takes PF_VCPU as a hint that the VCPU thread
is running and therefore needs no yield.
However, the check on PF_VCPU is wrong on s390, where preemption has
to stay enabled in order to correctly process page faults. Thus,
s390 reenables preemption and starts to execute the guest. The thread
might be scheduled out between kvm_enter_guest() and kvm_exit_guest(),
resulting in PF_VCPU being set but not being run. When this happens,
the opportunity for directed yield is missed.
However, this check is done already in kvm_vcpu_on_spin before calling
kvm_vcpu_yield_loop:
if (!ACCESS_ONCE(vcpu->preempted))
continue;
so the check on PF_VCPU is superfluous in general, and this patch
removes it.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Current linear search doesn't scale well when
large amount of memslots is used and looked up slot
is not in the beginning memslots array.
Taking in account that memslots don't overlap, it's
possible to switch sorting order of memslots array from
'npages' to 'base_gfn' and use binary search for
memslot lookup by GFN.
As result of switching to binary search lookup times
are reduced with large amount of memslots.
Following is a table of search_memslot() cycles
during WS2008R2 guest boot.
boot, boot + ~10 min
mostly same of using it,
slot lookup randomized lookup
max average average
cycles cycles cycles
13 slots : 1450 28 30
13 slots : 1400 30 40
binary search
117 slots : 13000 30 460
117 slots : 2000 35 180
binary search
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
it will allow to use binary search for GFN -> memslot
lookups, reducing lookup cost with large slots amount.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
UP/DOWN shift loops will shift array in needed
direction and stop at place where new slot should
be placed regardless of old slot size.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
if number of pages haven't changed sorting algorithm
will do nothing, so there is no need to do extra check
to avoid entering sorting logic.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit 85c8555ff0 ("KVM: check for !is_zero_pfn() in
kvm_is_mmio_pfn()") and renames the function to kvm_is_reserved_pfn.
The problem being addressed by the patch above was that some ARM code
based the memory mapping attributes of a pfn on the return value of
kvm_is_mmio_pfn(), whose name indeed suggests that such pfns should
be mapped as device memory.
However, kvm_is_mmio_pfn() doesn't do quite what it says on the tin,
and the existing non-ARM users were already using it in a way which
suggests that its name should probably have been 'kvm_is_reserved_pfn'
from the beginning, e.g., whether or not to call get_page/put_page on
it etc. This means that returning false for the zero page is a mistake
and the patch above should be reverted.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If we detect another vCPU is running we just exit and return 0 as if we
succesfully created the VGIC, but the VGIC wouldn't actual be created.
This shouldn't break in-kernel behavior because the kernel will not
observe the failed the attempt to create the VGIC, but userspace could
be rightfully confused.
Cc: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When call kvm_vgic_inject_irq to inject interrupt, we can known which
vcpu the interrupt for by the irq_num and the cpuid. So we should just
kick this vcpu to avoid iterating through all.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Shannon Zhao <zhaoshenglong@huawei.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
When 'injecting' an edge-triggered interrupt with a falling edge we
shouldn't clear the pending state on the distributor. In fact, we
don't, because the check in vgic_validate_injection would prevent us
from ever reaching this bit of code.
Remove the unreachable snippet.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
This reverts commit 85c8555ff0 ("KVM: check for !is_zero_pfn() in
kvm_is_mmio_pfn()") and renames the function to kvm_is_reserved_pfn.
The problem being addressed by the patch above was that some ARM code
based the memory mapping attributes of a pfn on the return value of
kvm_is_mmio_pfn(), whose name indeed suggests that such pfns should
be mapped as device memory.
However, kvm_is_mmio_pfn() doesn't do quite what it says on the tin,
and the existing non-ARM users were already using it in a way which
suggests that its name should probably have been 'kvm_is_reserved_pfn'
from the beginning, e.g., whether or not to call get_page/put_page on
it etc. This means that returning false for the zero page is a mistake
and the patch above should be reverted.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
When vgic_update_irq_pending with level-sensitive false, it is need to
deactivates an interrupt, and, it can go to out directly.
Here return a false value, because it will be not need to kick.
Signed-off-by: wanghaibin <wanghaibin.wang@huawei.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Now that ia64 is gone, we can hide deprecated device assignment in x86.
Notable changes:
- kvm_vm_ioctl_assigned_device() was moved to x86/kvm_arch_vm_ioctl()
The easy parts were removed from generic kvm code, remaining
- kvm_iommu_(un)map_pages() would require new code to be moved
- struct kvm_assigned_dev_kernel depends on struct kvm_irq_ack_notifier
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
ia64 does not need them anymore. Ack notifiers become x86-specific
too.
Suggested-by: Gleb Natapov <gleb@kernel.org>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM for ia64 has been marked as broken not just once, but twice even,
and the last patch from the maintainer is now roughly 5 years old.
Time for it to rest in peace.
Acked-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The update_memslots invocation is only needed in one case. Make
the code clearer by moving it to __kvm_set_memory_region, and
removing the wrapper around insert_memslot.
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The two kmemdup invocations can be unified. I find that the new
placement of the comment makes it easier to see what happens.
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This completes the optimization from the previous patch, by
removing the KVM_MEM_SLOTS_NUM-iteration loop from insert_memslot.
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
memslots is a sorted array. When a slot is changed, heapsort (lib/sort.c)
would take O(n log n) time to update it; an optimized insertion sort will
only cost O(n) on an array with just one item out of order.
Replace sort() with a custom sort that takes advantage of memslots usage
pattern and the known position of the changed slot.
performance change of 128 memslots insertions with gradually increasing
size (the worst case):
heap sort custom sort
max: 249747 2500 cycles
with custom sort alg taking ~98% less then original
update time.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
commit 72dc67a696 ("KVM: remove the usage of the mmap_sem for the protection of the memory slots.")
changed the lock which will be taken. This should be reflected in the function
commentary.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM does not deliver x2APIC broadcast messages with physical mode. Intel SDM
(10.12.9 ICR Operation in x2APIC Mode) states: "A destination ID value of
FFFF_FFFFH is used for broadcast of interrupts in both logical destination and
physical destination modes."
In addition, the local-apic enables cluster mode broadcast. As Intel SDM
10.6.2.2 says: "Broadcast to all local APICs is achieved by setting all
destination bits to one." This patch enables cluster mode broadcast.
The fix tries to combine broadcast in different modes through a unified code.
One rare case occurs when the source of IPI has its APIC disabled. In such
case, the source can still issue IPIs, but since the source is not obliged to
have the same LAPIC mode as the enabled ones, we cannot rely on it.
Since it is a rare case, it is unoptimized and done on the slow-path.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com>
[As per Radim's review, use unsigned int for X2APIC_BROADCAST, return bool from
kvm_apic_broadcast. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The third parameter of kvm_unpin_pages() when called from
kvm_iommu_map_pages() is wrong, it should be the number of pages to un-pin
and not the page size.
This error was facilitated with an inconsistent API: kvm_pin_pages() takes
a size, but kvn_unpin_pages() takes a number of pages, so fix the problem
by matching the two.
This was introduced by commit 350b8bd ("kvm: iommu: fix the third parameter
of kvm_iommu_put_pages (CVE-2014-3601)"), which fixes the lack of
un-pinning for pages intended to be un-pinned (i.e. memory leak) but
unfortunately potentially aggravated the number of pages we un-pin that
should have stayed pinned. As far as I understand though, the same
practical mitigations apply.
This issue was found during review of Red Hat 6.6 patches to prepare
Ksplice rebootless updates.
Thanks to Vegard for his time on a late Friday evening to help me in
understanding this code.
Fixes: 350b8bd ("kvm: iommu: fix the third parameter of... (CVE-2014-3601)")
Cc: stable@vger.kernel.org
Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
Reviewed-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Support for 48bit IPA and VA (EL2)
- A number of fixes for devices mapped into guests
- Yet another VGIC fix for BE
- A fix for CPU hotplug
- A few compile fixes (disabled VGIC, strict mm checks)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUQkyTAAoJECPQ0LrRPXpDwrUP/1WbELgB74W35CJ1zIc4KuBi
unP1muW3QAr9Vmp/KovRKyLKFiRaTRlQsszaI78f4ZQ++0vzivU8dZwV81Gn1y/v
0qF63OB0UYsOgXMRrh5JTEqzUyNyNBLUH+FAQiEO/srDoH5WLp3Zq7ThjzjwGn7Q
K2ArxFiml+p2BGIGKWe3XIrxNgpW4oWhfe1kW4WU7sshuJlut3Nee+q2lSIg9mZx
2VXYnLNzSsHizgQHuVEyXIqn8HA5FSCvjBYIUcLERlWB0I66WvzOqg9rH/BmNNR2
H+cBDY+9D8KBUBG9zZSG7hZ0mAONKcOnxGZWGzte3Oi7FMZkB3Y/zrIs0na4iB5Y
FxE8j+2qclZk9fkHQ7wn9Ws8hpGR2OrFlc2O5ZoBJJ2KJ4wMRHMeEbYRBCRQbTCN
+81SUW7mh2j/La0JBqZ6DhhTiymUdIB+6v78im9WGlHsFRAIHBt0Q0u/pIyY+GJs
OH7FoswI3vF5iODlHeRO1yjaO3rkj+IJwqTuUuhAIGu9+qnof3ge+eM1cOqrudNa
u2kDz+BC21+Q8dflOF99Ryz7cMWqMiwtR+N+OUYpxc7RL7mCeHVANJxdWIFHKa2Z
XJaHmbKjmw8AoR0fbS6YWOl2xmIhqU+FAngI+mow/Hz4pJDpR2K3w17ASXIVVQVX
go2bvGHONkdn8Ji3Asap
=3fl5
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-3.18-take-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm
Pull second batch of changes for KVM/{arm,arm64} from Marc Zyngier:
"The most obvious thing is the sizeable MMU changes to support 48bit
VAs on arm64.
Summary:
- support for 48bit IPA and VA (EL2)
- a number of fixes for devices mapped into guests
- yet another VGIC fix for BE
- a fix for CPU hotplug
- a few compile fixes (disabled VGIC, strict mm checks)"
[ I'm pulling directly from Marc at the request of Paolo Bonzini, whose
backpack was stolen at Düsseldorf airport and will do new keys and
rebuild his web of trust. - Linus ]
* tag 'kvm-arm-for-3.18-take-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm:
arm/arm64: KVM: Fix BE accesses to GICv2 EISR and ELRSR regs
arm: kvm: STRICT_MM_TYPECHECKS fix for user_mem_abort
arm/arm64: KVM: Ensure memslots are within KVM_PHYS_SIZE
arm64: KVM: Implement 48 VA support for KVM EL2 and Stage-2
arm/arm64: KVM: map MMIO regions at creation time
arm64: kvm: define PAGE_S2_DEVICE as read-only by default
ARM: kvm: define PAGE_S2_DEVICE as read-only by default
arm/arm64: KVM: add 'writable' parameter to kvm_phys_addr_ioremap
arm/arm64: KVM: fix potential NULL dereference in user_mem_abort()
arm/arm64: KVM: use __GFP_ZERO not memset() to get zeroed pages
ARM: KVM: fix vgic-disabled build
arm: kvm: fix CPU hotplug
The EIRSR and ELRSR registers are 32-bit registers on GICv2, and we
store these as an array of two such registers on the vgic vcpu struct.
However, we access them as a single 64-bit value or as a bitmap pointer
in the generic vgic code, which breaks BE support.
Instead, store them as u64 values on the vgic structure and do the
word-swapping in the assembly code, which already handles the byte order
for BE systems.
Tested-by: Victor Kamensky <victor.kamensky@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
This pull-request includes:
* Change in the IOMMU-API to convert the former iommu_domain_capable
function to just iommu_capable
* Various fixes in handling RMRR ranges for the VT-d driver (one fix
requires a device driver core change which was acked
by Greg KH)
* The AMD IOMMU driver now assigns and deassigns complete alias groups
to fix issues with devices using the wrong PCI request-id
* MMU-401 support for the ARM SMMU driver
* Multi-master IOMMU group support for the ARM SMMU driver
* Various other small fixes all over the place
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJUPNxYAAoJECvwRC2XARrjMwMP/RLSr+oA31rGVjLXcmcCHl7Q
Uj7xpcnG19qB0aqNR1JeJuZNkK/tw44pE353MQPbz4N9UVUiogklGIVD1iJvFV53
0qm84bvpDJIof4aP35B3H3Umft2USTn/lmsQg/RklQcNTW8DzNj63b8BTNR7k/GL
G7bLg7F1BUCl0shZCCsFspOIulQPAJYN2OvHlfYBav/bfDvfouQ3lrV+loGrK44r
F2Hmp+imXlIhUCjfbiWz6wKFxvPrxZx482vm2pXBCSnXEdW4/fz6nf9VHUK/Cfsq
JAimY1CfiDo1aqH9/yVHUOw5SD/NYOXq6E5bFPg/WENbipbbae5cK2u6PX5MMBAn
CG4BM8l9xicfGPqgn5YFSRY/6qC6K7NlxMnt9U8l18QIkDVDqEtUgJQISJuce7wx
FWx6eSWaxpIe5yhq19/h2ELalUUyR/fPq+UXXjYDL1kLV/vcvC/lC3mbNAQU93zU
WK0bG2tDg88JHavc25Ewa2aOn4BVM2BpwuLbYlgQReaEmsQRnEPgtmRNyLJHqbFE
wwpCj8pBWdufsJWRyvpnXQ+CfA7oSz4e7hz1G+0/5uiDmagfvg16Ql5JtPmmuLUm
Kc3dVIiG0s1ewohZIIJETGCqprQbCSqs8CCQqB6p2zDBWFKpNT7F38lm/KlehkCz
JpAiI7Y2K9Jejp0VIPrt
=OMOt
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
"This pull-request includes:
- change in the IOMMU-API to convert the former iommu_domain_capable
function to just iommu_capable
- various fixes in handling RMRR ranges for the VT-d driver (one fix
requires a device driver core change which was acked by Greg KH)
- the AMD IOMMU driver now assigns and deassigns complete alias
groups to fix issues with devices using the wrong PCI request-id
- MMU-401 support for the ARM SMMU driver
- multi-master IOMMU group support for the ARM SMMU driver
- various other small fixes all over the place"
* tag 'iommu-updates-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (41 commits)
iommu/vt-d: Work around broken RMRR firmware entries
iommu/vt-d: Store bus information in RMRR PCI device path
iommu/vt-d: Only remove domain when device is removed
driver core: Add BUS_NOTIFY_REMOVED_DEVICE event
iommu/amd: Fix devid mapping for ivrs_ioapic override
iommu/irq_remapping: Fix the regression of hpet irq remapping
iommu: Fix bus notifier breakage
iommu/amd: Split init_iommu_group() from iommu_init_device()
iommu: Rework iommu_group_get_for_pci_dev()
iommu: Make of_device_id array const
amd_iommu: do not dereference a NULL pointer address.
iommu/omap: Remove omap_iommu unused owner field
iommu: Remove iommu_domain_has_cap() API function
IB/usnic: Convert to use new iommu_capable() API function
vfio: Convert to use new iommu_capable() API function
kvm: iommu: Convert to use new iommu_capable() API function
iommu/tegra: Convert to iommu_capable() API function
iommu/msm: Convert to iommu_capable() API function
iommu/vt-d: Convert to iommu_capable() API function
iommu/fsl: Convert to iommu_capable() API function
...
Add support for read-only MMIO passthrough mappings by adding a
'writable' parameter to kvm_phys_addr_ioremap. For the moment,
mappings will be read-write even if 'writable' is false, but once
the definition of PAGE_S2_DEVICE gets changed, those mappings will
be created read-only.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Apart from the usual cleanups, here is the summary of new features:
- s390 moves closer towards host large page support
- PowerPC has improved support for debugging (both inside the guest and
via gdbstub) and support for e6500 processors
- ARM/ARM64 support read-only memory (which is necessary to put firmware
in emulated NOR flash)
- x86 has the usual emulator fixes and nested virtualization improvements
(including improved Windows support on Intel and Jailhouse hypervisor
support on AMD), adaptive PLE which helps overcommitting of huge guests.
Also included are some patches that make KVM more friendly to memory
hot-unplug, and fixes for rare caching bugs.
Two patches have trivial mm/ parts that were acked by Rik and Andrew.
Note: I will soon switch to a subkey for signing purposes. To verify
future signed pull requests from me, please update my key with
"gpg --recv-keys 9B4D86F2". You should see 3 new subkeys---the
one for signing will be a 2048-bit RSA key, 4E6B09D7.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJUL5sPAAoJEBvWZb6bTYbyfkEP/3MNhSyn6HCjPjtjLNPAl9KL
WpExZSUFL2+4CztpdGIsek1BeJYHmqv3+c5S+WvaWVA1aqh2R7FT1D1ErBLjgLQq
lq23IOr+XxmC3dXQUEEk+TlD+283UzypzEG4l4UD3JYg79fE3UrXAz82SeyewJDY
x7aPYhkZG3RHu+wAyMPasG6E3zS5LySdUtGWbiPwz5BejrhBJoJdeb2WIL/RwnUK
7ppSLB5EoFj/uMkuyeAAdAbdfSrhHA6faDZxNdxS9k9wGutrhhfUoQ49ONrKG4dV
sFo1tSPTVgRs8QFYUZ2fJUPBAmUVddsgqh2K9d0NftGTq7b8YszaCsfFrs2/Y4MU
YxssWEhxsfszerCu12bbAJrv6JBZYQ7TwGvI9L7P0iFU6IVw/djmukU4AkM9/e91
YS/cue/PN+9Pn2ccXzL9J7xRtZb8FsOuRsCXTCmbOwDkLmrKPDBN2t3RUbeF+Eam
ABrpWnLKX13kZSo4LKU+/niarzmPMp7odQfHVdr8ea0fiYLp4iN8puA20WaSPIgd
CLvm+RAvXe5Lm91L4mpFotJ2uFyK6QlIYJV4FsgeWv/0D0qppWQi0Utb/aCNHCgy
z8MyUMD48y7EpoQrFYr/7cddXIu0/NegnM8I1coVjIPEk4NfeebGUlCJ/V3D8wMG
BgEfS2x6jRc5zB3hjwDr
=iEVi
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"Fixes and features for 3.18.
Apart from the usual cleanups, here is the summary of new features:
- s390 moves closer towards host large page support
- PowerPC has improved support for debugging (both inside the guest
and via gdbstub) and support for e6500 processors
- ARM/ARM64 support read-only memory (which is necessary to put
firmware in emulated NOR flash)
- x86 has the usual emulator fixes and nested virtualization
improvements (including improved Windows support on Intel and
Jailhouse hypervisor support on AMD), adaptive PLE which helps
overcommitting of huge guests. Also included are some patches that
make KVM more friendly to memory hot-unplug, and fixes for rare
caching bugs.
Two patches have trivial mm/ parts that were acked by Rik and Andrew.
Note: I will soon switch to a subkey for signing purposes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (157 commits)
kvm: do not handle APIC access page if in-kernel irqchip is not in use
KVM: s390: count vcpu wakeups in stat.halt_wakeup
KVM: s390/facilities: allow TOD-CLOCK steering facility bit
KVM: PPC: BOOK3S: HV: CMA: Reserve cma region only in hypervisor mode
arm/arm64: KVM: Report correct FSC for unsupported fault types
arm/arm64: KVM: Fix VTTBR_BADDR_MASK and pgd alloc
kvm: Fix kvm_get_page_retry_io __gup retval check
arm/arm64: KVM: Fix set_clear_sgi_pend_reg offset
kvm: x86: Unpin and remove kvm_arch->apic_access_page
kvm: vmx: Implement set_apic_access_page_addr
kvm: x86: Add request bit to reload APIC access page address
kvm: Add arch specific mmu notifier for page invalidation
kvm: Rename make_all_cpus_request() to kvm_make_all_cpus_request() and make it non-static
kvm: Fix page ageing bugs
kvm/x86/mmu: Pass gfn and level to rmapp callback.
x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only
kvm: x86: use macros to compute bank MSRs
KVM: x86: Remove debug assertion of non-PAE reserved bits
kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
kvm: Faults which trigger IO release the mmap_sem
...
This includes a bunch of changes:
- Support read-only memory slots on arm/arm64
- Various changes to fix Sparse warnings
- Correctly detect write vs. read Stage-2 faults
- Various VGIC cleanups and fixes
- Dynamic VGIC data strcuture sizing
- Fix SGI set_clear_pend offset bug
- Fix VTTBR_BADDR Mask
- Correctly report the FSC on Stage-2 faults
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUJWAdAAoJEEtpOizt6ddy9cMH+gIoUPnRJLe+PPcOOyxOx6pr
+CnD/zAd0sLvxZLP/LBOzu99H3YrbO5kwI/172/8G1zUNI2hp6YxEEJaBCTHrz6l
RwgLy7a3EMMY51nJo5w2dkFUo8cUX9MsHqMpl2Xb7Dvo2ZHp+nDqRjwRY6yi+t4V
dWSJTRG6X+DIWyysij6jBtfKU6MpU+4NW3Zdk1fapf8QDkn+cBtV5X2QcmERCaIe
A1j9hiGi43KA3XWeeePU3aVaxC2XUhTayP8VsfVxoNG2manaS6lqjmbif5ghs/0h
rw7R3/Aj0MJny2zT016MkvKJKRukuVRD6e1lcYghqnSJhL2FossowZ9fHRADpqU=
=QgU8
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-next
Changes for KVM for arm/arm64 for 3.18
This includes a bunch of changes:
- Support read-only memory slots on arm/arm64
- Various changes to fix Sparse warnings
- Correctly detect write vs. read Stage-2 faults
- Various VGIC cleanups and fixes
- Dynamic VGIC data strcuture sizing
- Fix SGI set_clear_pend offset bug
- Fix VTTBR_BADDR Mask
- Correctly report the FSC on Stage-2 faults
Conflicts:
virt/kvm/eventfd.c
[duplicate, different patch where the kvm-arm version broke x86.
The kvm tree instead has the right one]
Confusion around -EBUSY and zero (inside a BUG_ON no less).
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The sgi values calculated in read_set_clear_sgi_pend_reg() and
write_set_clear_sgi_pend_reg() were horribly incorrectly multiplied by 4
with catastrophic results in that subfunctions ended up overwriting
memory not allocated for the expected purpose.
This showed up as bugs in kfree() and the kernel complaining a lot of
you turn on memory debugging.
This addresses: http://marc.info/?l=kvm&m=141164910007868&w=2
Reported-by: Shannon Zhao <zhaoshenglong@huawei.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
This will be used to let the guest run while the APIC access page is
not pinned. Because subsequent patches will fill in the function
for x86, place the (still empty) x86 implementation in the x86.c file
instead of adding an inline function in kvm_host.h.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Different architectures need different requests, and in fact we
will use this function in architecture-specific code later. This
will be outside kvm_main.c, so make it non-static and rename it to
kvm_make_all_cpus_request().
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
1. We were calling clear_flush_young_notify in unmap_one, but we are
within an mmu notifier invalidate range scope. The spte exists no more
(due to range_start) and the accessed bit info has already been
propagated (due to kvm_pfn_set_accessed). Simply call
clear_flush_young.
2. We clear_flush_young on a primary MMU PMD, but this may be mapped
as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
This required expanding the interface of the clear_flush_young mmu
notifier, so a lot of code has been trivially touched.
3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
the access bit by blowing the spte. This requires proper synchronizing
with MMU notifier consumers, like every other removal of spte's does.
Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vcpu ioctls can hang the calling thread if issued while a vcpu is running.
However, invalid ioctls can happen when userspace tries to probe the kind
of file descriptors (e.g. isatty() calls ioctl(TCGETS)); in that case,
we know the ioctl is going to be rejected as invalid anyway and we can
fail before trying to take the vcpu mutex.
This patch does not change functionality, it just makes invalid ioctls
fail faster.
Cc: stable@vger.kernel.org
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
has been swapped out or is behind a filemap, this will trigger async
readahead and return immediately. The rationale is that KVM will kick
back the guest with an "async page fault" and allow for some other
guest process to take over.
If async PFs are enabled the fault is retried asap from an async
workqueue. If not, it's retried immediately in the same code path. In
either case the retry will not relinquish the mmap semaphore and will
block on the IO. This is a bad thing, as other mmap semaphore users
now stall as a function of swap or filemap latency.
This patch ensures both the regular and async PF path re-enter the
fault allowing for the mmap semaphore to be relinquished in the case
of IO wait.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
/me got confused between the kernel and QEMU. In the kernel, you can
only have one module_init function, and it will prevent unloading the
module unless you also have the corresponding module_exit function.
So, commit 80ce163972 (KVM: VFIO: register kvm_device_ops dynamically,
2014-09-02) broke unloading of the kvm module, by adding a module_init
function and no module_exit.
Repair it by making kvm_vfio_ops_init weak, and checking it in
kvm_init.
Cc: Will Deacon <will.deacon@arm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Alex Williamson <Alex.Williamson@redhat.com>
Fixes: 80ce163972
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit c77dcac (KVM: Move more code under CONFIG_HAVE_KVM_IRQFD) added
functionality that depends on definitions in ioapic.h when
__KVM_HAVE_IOAPIC is defined.
At the same time, kvm-arm commit 0ba0951 (KVM: EVENTFD: remove inclusion
of irq.h) removed the inclusion of irq.h, an architecture-specific header
that is not present on ARM but which happened to include ioapic.h on x86.
Include ioapic.h directly in eventfd.c if __KVM_HAVE_IOAPIC is defined.
This fixes x86 and lets ARM use eventfd.c.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We were using an atomic bitop on the vgic_v2.vgic_elrsr field which was
not aligned to the natural size on 64-bit platforms. This bug showed up
after QEMU correctly identifies the pl011 line as being level-triggered,
and not edge-triggered.
These data structures are protected by a spinlock so simply use a
non-atomic version of the accessor instead.
Tested-by: Joel Schopp <joel.schopp@amd.com>
Reported-by: Riku Voipio <riku.voipio@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Correct a simple mistake of checking the wrong variable
before a dereference, resulting in the dereference not being
properly protected by rcu_dereference().
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In order to make the number of interrupts configurable, use the new
fancy device management API to add KVM_DEV_ARM_VGIC_GRP_NR_IRQS as
a VGIC configurable attribute.
Userspace can now specify the exact size of the GIC (by increments
of 32 interrupts).
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
It is now quite easy to delay the allocation of the vgic tables
until we actually require it to be up and running (when the first
vcpu is kicking around, or someones tries to access the GIC registers).
This allow us to allocate memory for the exact number of CPUs we
have. As nobody configures the number of interrupts just yet,
use a fallback to VGIC_NR_IRQS_LEGACY.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Nuke VGIC_NR_IRQS entierly, now that the distributor instance
contains the number of IRQ allocated to this GIC.
Also add VGIC_NR_IRQS_LEGACY to preserve the current API.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Now that we can (almost) dynamically size the number of interrupts,
we're facing an interesting issue:
We have to evaluate at runtime whether or not an access hits a valid
register, based on the sizing of this particular instance of the
distributor. Furthermore, the GIC spec says that accessing a reserved
register is RAZ/WI.
For this, add a new field to our range structure, indicating the number
of bits a single interrupts uses. That allows us to find out whether or
not the access is in range.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
We now have the information about the number of CPU interfaces in
the distributor itself. Let's get rid of VGIC_MAX_CPUS, and just
rely on KVM_MAX_VCPUS where we don't have the choice. Yet.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Having a dynamic number of supported interrupts means that we
cannot relly on VGIC_NR_SHARED_IRQS being fixed anymore.
Instead, make it take the distributor structure as a parameter,
so it can return the right value.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
So far, all the VGIC data structures are statically defined by the
*maximum* number of vcpus and interrupts it supports. It means that
we always have to oversize it to cater for the worse case.
Start by changing the data structures to be dynamically sizeable,
and allocate them at runtime.
The sizes are still very static though.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
As it stands, nothing prevents userspace from injecting an interrupt
before the guest's GIC is actually initialized.
This goes unnoticed so far (as everything is pretty much statically
allocated), but ends up exploding in a spectacular way once we switch
to a more dynamic allocation (the GIC data structure isn't there yet).
The fix is to test for the "ready" flag in the VGIC distributor before
trying to inject the interrupt. Note that in order to avoid breaking
userspace, we have to ignore what is essentially an error.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
The VGIC virtual distributor implementation documentation was written a
very long time ago, before the true nature of the beast had been
partially absorbed into my bloodstream. Clarify the docs.
Plus, it fixes an actual bug. ICFRn, pfff.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Writes to GICD_ISPENDR0 and GICD_ICPENDR0 ignore all settings of the
pending state for SGIs. Make sure the implementation handles this
correctly.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Writes to GICD_ISPENDRn and GICD_ICPENDRn are currently not handled
correctly for level-triggered interrupts. The spec states that for
level-triggered interrupts, writes to the GICD_ISPENDRn activate the
output of a flip-flop which is in turn or'ed with the actual input
interrupt signal. Correspondingly, writes to GICD_ICPENDRn simply
deactivates the output of that flip-flop, but does not (of course) affect
the external input signal. Reads from GICC_IAR will also deactivate the
flip-flop output.
This requires us to track the state of the level-input separately from
the state in the flip-flop. We therefore introduce two new variables on
the distributor struct to track these two states. Astute readers may
notice that this is introducing more state than required (because an OR
of the two states gives you the pending state), but the remaining vgic
code uses the pending bitmap for optimized operations to figure out, at
the end of the day, if an interrupt is pending or not on the distributor
side. Refactoring the code to consider the two state variables all the
places where we currently access the precomputed pending value, did not
look pretty.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
If we unqueue a level-triggered interrupt completely, and the LR does
not stick around in the active state (and will therefore no longer
generate a maintenance interrupt), then we should clear the queued flag
so that the vgic can actually queue this level-triggered interrupt at a
later time and deal with its pending state then.
Note: This should actually be properly fixed to handle the active state
on the distributor.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
We have a special bitmap on the distributor struct to keep track of when
level-triggered interrupts are queued on the list registers. This was
named irq_active, which is confusing, because the active state of an
interrupt as per the GIC spec is a different thing, not specifically
related to edge-triggered/level-triggered configurations but rather
indicates an interrupt which has been ack'ed but not yet eoi'ed.
Rename the bitmap and the corresponding accessor functions to irq_queued
to clarify what this is actually used for.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
The irq_state field on the distributor struct is ambiguous in its
meaning; the comment says it's the level of the input put, but that
doesn't make much sense for edge-triggered interrupts. The code
actually uses this state variable to check if the interrupt is in the
pending state on the distributor so clarify the comment and rename the
actual variable and accessor methods.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Now that we have a dynamic means to register kvm_device_ops, use that
for the VFIO kvm device, instead of relying on the static table.
This is achieved by a module_init call to register the ops with KVM.
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Alex Williamson <Alex.Williamson@redhat.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Using the new kvm_register_device_ops() interface makes us get rid of
an #ifdef in common code.
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that we have a dynamic means to register kvm_device_ops, use that
for the ARM VGIC, instead of relying on the static table.
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_ioctl_create_device currently has knowledge of all the device types
and their associated ops. This is fairly inflexible when adding support
for new in-kernel device emulations, so move what we currently have out
into a table, which can support dynamic registration of ops by new
drivers for virtual hardware.
Cc: Alex Williamson <Alex.Williamson@redhat.com>
Cc: Alex Graf <agraf@suse.de>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use PCI device flag helper functions when assigning or releasing device.
No functional change.
Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, we call ioapic_service() immediately when we find the irq is still
active during eoi broadcast. But for real hardware, there's some delay between
the EOI writing and irq delivery. If we do not emulate this behavior, and
re-inject the interrupt immediately after the guest sends an EOI and re-enables
interrupts, a guest might spend all its time in the ISR if it has a broken
handler for a level-triggered interrupt.
Such livelock actually happens with Windows guests when resuming from
hibernation.
As there's no way to recognize the broken handle from new raised ones, this patch
delays an interrupt if 10.000 consecutive EOIs found that the interrupt was
still high. The guest can then make a little forward progress, until a proper
IRQ handler is set or until some detection routine in the guest (such as
Linux's note_interrupt()) recognizes the situation.
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Zhang Haoyu <zhanghy@sangfor.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Read-only memory ranges may be backed by the zero page, so avoid
misidentifying it a a MMIO pfn.
This fixes another issue I identified when testing QEMU+KVM_UEFI, where
a read to an uninitialized emulated NOR flash brought in the zero page,
but mapped as a read-write device region, because kvm_is_mmio_pfn()
misidentifies it as a MMIO pfn due to its PG_reserved bit being set.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Fixes: b88657674d ("ARM: KVM: user_mem_abort: support stage 2 MMIO page mapping")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
No more needed. irq.h would be void on ARM.
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
__kvm_set_memory_region sets r to EINVAL very early.
Doing it again is not necessary. The same is true later on, where
r is assigned -ENOMEM twice.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The first statement of kvm_dev_ioctl is
long r = -EINVAL;
No need to reassign the same value.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The expression `vcpu->spin_loop.in_spin_loop' is always true,
because it is evaluated only when the condition
`!vcpu->spin_loop.in_spin_loop' is false.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vcpu exits and memslot mutations can run concurrently as long as the
vcpu does not aquire the slots mutex. Thus it is theoretically possible
for memslots to change underneath a vcpu that is handling an exit.
If we increment the memslot generation number again after
synchronize_srcu_expedited(), vcpus can safely cache memslot generation
without maintaining a single rcu_dereference through an entire vm exit.
And much of the x86/kvm code does not maintain a single rcu_dereference
of the current memslots during each exit.
We can prevent the following case:
vcpu (CPU 0) | thread (CPU 1)
--------------------------------------------+--------------------------
1 vm exit |
2 srcu_read_unlock(&kvm->srcu) |
3 decide to cache something based on |
old memslots |
4 | change memslots
| (increments generation)
5 | synchronize_srcu(&kvm->srcu);
6 retrieve generation # from new memslots |
7 tag cache with new memslot generation |
8 srcu_read_unlock(&kvm->srcu) |
... |
<action based on cache occurs even |
though the caching decision was based |
on the old memslots> |
... |
<action *continues* to occur until next |
memslot generation change, which may |
be never> |
|
By incrementing the generation after synchronizing with kvm->srcu readers,
we ensure that the generation retrieved in (6) will become invalid soon
after (8).
Keeping the existing increment is not strictly necessary, but we
do keep it and just move it for consistency from update_memslots to
install_new_memslots. It invalidates old cached MMIOs immediately,
instead of having to wait for the end of synchronize_srcu_expedited,
which makes the code more clearly correct in case CPU 1 is preempted
right after synchronize_srcu() returns.
To avoid halving the generation space in SPTEs, always presume that the
low bit of the generation is zero when reconstructing a generation number
out of an SPTE. This effectively disables MMIO caching in SPTEs during
the call to synchronize_srcu_expedited. Using the low bit this way is
somewhat like a seqcount---where the protected thing is a cache, and
instead of retrying we can simply punt if we observe the low bit to be 1.
Cc: stable@vger.kernel.org
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The next patch will give a meaning (a la seqcount) to the low bit of the
generation number. Ensure that it matches between kvm->memslots->generation
and kvm_current_mmio_generation().
Cc: stable@vger.kernel.org
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the beggining was on_each_cpu(), which required an unused argument to
kvm_arch_ops.hardware_{en,dis}able, but this was soon forgotten.
Remove unnecessary arguments that stem from this.
Signed-off-by: Radim KrÄmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The idea between capabilities and the KVM_CHECK_EXTENSION ioctl is that
userspace can, at run-time, determine if a feature is supported or not.
This allows KVM to being supporting a new feature with a new kernel
version without any need to update user space. Unfortunately, since the
definition of KVM_CAP_READONLY_MEM was guarded by #ifdef
__KVM_HAVE_READONLY_MEM, such discovery still required a user space
update.
Therefore, unconditionally export KVM_CAP_READONLY_MEM and change the
in-kernel conditional to rely on __KVM_HAVE_READONLY_MEM.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We extract the vgic probe function from the of_device_id data pointer,
which is const. Kill the sparse warning by ensuring that the local
function pointer is also marked as const.
Cc: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
vgic_ioaddr_overlap claims to return a bool, but in reality it returns
an int. Shut sparse up by fixing the type signature.
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
To support read-only memory regions on arm and arm64, we have a need to
resolve a gfn to an hva given a pointer to a memslot to avoid looping
through the memslots twice and to reuse the hva error checking of
gfn_to_hva_prot(), add a new gfn_to_hva_memslot_prot() function and
refactor gfn_to_hva_prot() to use this function.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Introduce preempt notifiers for architecture specific code.
Advantage over creating a new notifier in every arch is slightly simpler
code and guaranteed call order with respect to kvm_sched_in.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We dont have to wait for a grace period if there is no oldpid that
we are going to free. putpid also checks for NULL, so this patch
only fences synchronize_rcu.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As a generic function, deassign_guest_irq() assumes it can be called
even if assign_guest_irq() is not be called successfully (which can be
triggered by ioctl from user mode, indirectly).
So for assign_guest_irq() failure process, need set 'dev->irq_source_id'
to -1 after free 'dev->irq_source_id', or deassign_guest_irq() may free
it again.
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The third parameter of kvm_iommu_put_pages is wrong,
It should be 'gfn - slot->base_gfn'.
By making gfn very large, malicious guest or userspace can cause kvm to
go to this error path, and subsequently to pass a huge value as size.
Alternatively if gfn is small, then pages would be pinned but never
unpinned, causing host memory leak and local DOS.
Passing a reasonable but large value could be the most dangerous case,
because it would unpin a page that should have stayed pinned, and thus
allow the device to DMA into arbitrary memory. However, this cannot
happen because of the condition that can trigger the error:
- out of memory (where you can't allocate even a single page)
should not be possible for the attacker to trigger
- when exceeding the iommu's address space, guest pages after gfn
will also exceed the iommu's address space, and inside
kvm_iommu_put_pages() the iommu_iova_to_phys() will fail. The
page thus would not be unpinned at all.
Reported-by: Jack Morgenstein <jackm@mellanox.com>
Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commits e4d57e1ee1 (KVM: Move irq notifier implementation into
eventfd.c, 2014-06-30) included the irq notifier code unconditionally
in eventfd.c, while it was under CONFIG_HAVE_KVM_IRQCHIP before.
Similarly, commit 297e21053a (KVM: Give IRQFD its own separate enabling
Kconfig option, 2014-06-30) moved code from CONFIG_HAVE_IRQ_ROUTING
to CONFIG_HAVE_KVM_IRQFD but forgot to move the pieces that used to be
under CONFIG_HAVE_KVM_IRQCHIP.
Together, this broke compilation without CONFIG_KVM_XICS. Fix by adding
or changing the #ifdefs so that they point at CONFIG_HAVE_KVM_IRQFD.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, the IRQFD code is conditional on CONFIG_HAVE_KVM_IRQ_ROUTING.
So that we can have the IRQFD code compiled in without having the
IRQ routing code, this creates a new CONFIG_HAVE_KVM_IRQFD, makes
the IRQFD code conditional on it instead of CONFIG_HAVE_KVM_IRQ_ROUTING,
and makes all the platforms that currently select HAVE_KVM_IRQ_ROUTING
also select HAVE_KVM_IRQFD.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This moves the functions kvm_irq_has_notifier(), kvm_notify_acked_irq(),
kvm_register_irq_ack_notifier() and kvm_unregister_irq_ack_notifier()
from irqchip.c to eventfd.c. The reason for doing this is that those
functions are used in connection with IRQFDs, which are implemented in
eventfd.c. In future we will want to use IRQFDs on platforms that
don't implement the GSI routing implemented in irqchip.c, so we won't
be compiling in irqchip.c, but we still need the irq notifiers. The
implementation is unchanged.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that struct _irqfd does not keep a reference to storage pointed
to by the irq_routing field of struct kvm, we can move the statement
that updates it out from under the irqfds.lock and put it in
kvm_set_irq_routing() instead. That means we then have to take a
srcu_read_lock on kvm->irq_srcu around the irqfd_update call in
kvm_irqfd_assign(), since holding the kvm->irqfds.lock no longer
ensures that that the routing can't change.
Combined with changing kvm_irq_map_gsi() and kvm_irq_map_chip_pin()
to take a struct kvm * argument instead of the pointer to the routing
table, this allows us to to move all references to kvm->irq_routing
into irqchip.c. That in turn allows us to move the definition of the
kvm_irq_routing_table struct into irqchip.c as well.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This provides accessor functions for the KVM interrupt mappings, in
order to reduce the amount of code that accesses the fields of the
kvm_irq_routing_table struct, and restrict that code to one file,
virt/kvm/irqchip.c. The new functions are kvm_irq_map_gsi(), which
maps from a global interrupt number to a set of IRQ routing entries,
and kvm_irq_map_chip_pin, which maps from IRQ chip and pin numbers to
a global interrupt number.
This also moves the update of kvm_irq_routing_table::chip[][]
into irqchip.c, out of the various kvm_set_routing_entry
implementations. That means that none of the kvm_set_routing_entry
implementations need the kvm_irq_routing_table argument anymore,
so this removes it.
This does not change any locking or data lifetime rules.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This makes the irqfd code keep a copy of the irq routing table entry
for each irqfd, rather than a reference to the copy in the actual
irq routing table maintained in kvm/virt/irqchip.c. This will enable
us to change the routing table structure in future, or even not have a
routing table at all on some platforms.
The synchronization that was previously achieved using srcu_dereference
on the read side is now achieved using a seqcount_t structure. That
ensures that we don't get a halfway-updated copy of the structure if
we read it while another thread is updating it.
We still use srcu_read_lock/unlock around the read side so that when
changing the routing table we can be sure that after calling
synchronize_srcu, nothing will be using the old routing.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Highlights in this release include:
- BookE: Rework instruction fetch, not racy anymore now
- BookE HV: Fix ONE_REG accessors for some in-hardware registers
- Book3S: Good number of LE host fixes, enable HV on LE
- Book3S: Some misc bug fixes
- Book3S HV: Add in-guest debug support
- Book3S HV: Preload cache lines on context switch
- Remove 440 support
Alexander Graf (31):
KVM: PPC: Book3s PR: Disable AIL mode with OPAL
KVM: PPC: Book3s HV: Fix tlbie compile error
KVM: PPC: Book3S PR: Handle hyp doorbell exits
KVM: PPC: Book3S PR: Fix ABIv2 on LE
KVM: PPC: Book3S PR: Fix sparse endian checks
PPC: Add asm helpers for BE 32bit load/store
KVM: PPC: Book3S HV: Make HTAB code LE host aware
KVM: PPC: Book3S HV: Access guest VPA in BE
KVM: PPC: Book3S HV: Access host lppaca and shadow slb in BE
KVM: PPC: Book3S HV: Access XICS in BE
KVM: PPC: Book3S HV: Fix ABIv2 on LE
KVM: PPC: Book3S HV: Enable for little endian hosts
KVM: PPC: Book3S: Move vcore definition to end of kvm_arch struct
KVM: PPC: Deflect page write faults properly in kvmppc_st
KVM: PPC: Book3S: Stop PTE lookup on write errors
KVM: PPC: Book3S: Add hack for split real mode
KVM: PPC: Book3S: Make magic page properly 4k mappable
KVM: PPC: Remove 440 support
KVM: Rename and add argument to check_extension
KVM: Allow KVM_CHECK_EXTENSION on the vm fd
KVM: PPC: Book3S: Provide different CAPs based on HV or PR mode
KVM: PPC: Implement kvmppc_xlate for all targets
KVM: PPC: Move kvmppc_ld/st to common code
KVM: PPC: Remove kvmppc_bad_hva()
KVM: PPC: Use kvm_read_guest in kvmppc_ld
KVM: PPC: Handle magic page in kvmppc_ld/st
KVM: PPC: Separate loadstore emulation from priv emulation
KVM: PPC: Expose helper functions for data/inst faults
KVM: PPC: Remove DCR handling
KVM: PPC: HV: Remove generic instruction emulation
KVM: PPC: PR: Handle FSCR feature deselects
Alexey Kardashevskiy (1):
KVM: PPC: Book3S: Fix LPCR one_reg interface
Aneesh Kumar K.V (4):
KVM: PPC: BOOK3S: PR: Fix PURR and SPURR emulation
KVM: PPC: BOOK3S: PR: Emulate virtual timebase register
KVM: PPC: BOOK3S: PR: Emulate instruction counter
KVM: PPC: BOOK3S: HV: Update compute_tlbie_rb to handle 16MB base page
Anton Blanchard (2):
KVM: PPC: Book3S HV: Fix ABIv2 indirect branch issue
KVM: PPC: Assembly functions exported to modules need _GLOBAL_TOC()
Bharat Bhushan (10):
kvm: ppc: bookehv: Added wrapper macros for shadow registers
kvm: ppc: booke: Use the shared struct helpers of SRR0 and SRR1
kvm: ppc: booke: Use the shared struct helpers of SPRN_DEAR
kvm: ppc: booke: Add shared struct helpers of SPRN_ESR
kvm: ppc: booke: Use the shared struct helpers for SPRN_SPRG0-7
kvm: ppc: Add SPRN_EPR get helper function
kvm: ppc: bookehv: Save restore SPRN_SPRG9 on guest entry exit
KVM: PPC: Booke-hv: Add one reg interface for SPRG9
KVM: PPC: Remove comment saying SPRG1 is used for vcpu pointer
KVM: PPC: BOOKEHV: rename e500hv_spr to bookehv_spr
Michael Neuling (1):
KVM: PPC: Book3S HV: Add H_SET_MODE hcall handling
Mihai Caraman (8):
KVM: PPC: e500mc: Enhance tlb invalidation condition on vcpu schedule
KVM: PPC: e500: Fix default tlb for victim hint
KVM: PPC: e500: Emulate power management control SPR
KVM: PPC: e500mc: Revert "add load inst fixup"
KVM: PPC: Book3e: Add TLBSEL/TSIZE defines for MAS0/1
KVM: PPC: Book3s: Remove kvmppc_read_inst() function
KVM: PPC: Allow kvmppc_get_last_inst() to fail
KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
Paul Mackerras (4):
KVM: PPC: Book3S: Controls for in-kernel sPAPR hypercall handling
KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled
KVM: PPC: Book3S PR: Take SRCU read lock around RTAS kvm_read_guest() call
KVM: PPC: Book3S: Make kvmppc_ld return a more accurate error indication
Stewart Smith (2):
Split out struct kvmppc_vcore creation to separate function
Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJT21skAAoJECszeR4D/txgeFEP/AzJopN7s//W33CfyBqURHXp
XALCyAw+S67gtcaTZbxomcG1xuT8Lj9WEw28iz3rCtAnJwIxsY63xrI1nXMzTaI2
p1rC0ai5Qy+nlEbd6L78spZy/Nzh8DFYGWx78iUSO1mYD8xywJwtoiBA539pwp8j
8N+mgn61Hwhv31bKtsZlmzXymVr/jbTp5LVuxsBLJwD2lgT49g+4uBnX2cG/iXkg
Rzbh7LxoNNXrSPI8sYmTWu/81aeXteeX70ja6DHuV5dWLNTuAXJrh5EUfeAZqBrV
aYcLWUYmIyB87txNmt6ZGVar2p3jr2Xhb9mKx+EN4dbehblanLc1PUqlHd0q3dKc
Nt60ByqpZn+qDAK86dShSZLEe+GT3lovvE76CqVXD4Er+OUEkc9JoxhN1cof/Gb0
o6uwZ2isXHRdGoZx5vb4s3UTOlwZGtoL/CyY/HD/ujYDSURkCGbxLj3kkecSY8ut
QdDAWsC15BwsHtKLr5Zwjp2w+0eGq2QJgfvO0zqWFiz9k33SCBCUpwluFeqh27Hi
aR5Wir3j+MIw9G8XlYlDJWYfi0h/SZ4G7hh7jSu26NBNBzQsDa8ow/cLzdMhdUwH
OYSaeqVk5wiRb9to1uq1NQWPA0uRAx3BSjjvr9MCGRqmvn+FV5nj637YWUT+53Hi
aSvg/U2npghLPPG2cihu
=JuLr
-----END PGP SIGNATURE-----
Merge tag 'signed-kvm-ppc-next' of git://github.com/agraf/linux-2.6 into kvm
Patch queue for ppc - 2014-08-01
Highlights in this release include:
- BookE: Rework instruction fetch, not racy anymore now
- BookE HV: Fix ONE_REG accessors for some in-hardware registers
- Book3S: Good number of LE host fixes, enable HV on LE
- Book3S: Some misc bug fixes
- Book3S HV: Add in-guest debug support
- Book3S HV: Preload cache lines on context switch
- Remove 440 support
Alexander Graf (31):
KVM: PPC: Book3s PR: Disable AIL mode with OPAL
KVM: PPC: Book3s HV: Fix tlbie compile error
KVM: PPC: Book3S PR: Handle hyp doorbell exits
KVM: PPC: Book3S PR: Fix ABIv2 on LE
KVM: PPC: Book3S PR: Fix sparse endian checks
PPC: Add asm helpers for BE 32bit load/store
KVM: PPC: Book3S HV: Make HTAB code LE host aware
KVM: PPC: Book3S HV: Access guest VPA in BE
KVM: PPC: Book3S HV: Access host lppaca and shadow slb in BE
KVM: PPC: Book3S HV: Access XICS in BE
KVM: PPC: Book3S HV: Fix ABIv2 on LE
KVM: PPC: Book3S HV: Enable for little endian hosts
KVM: PPC: Book3S: Move vcore definition to end of kvm_arch struct
KVM: PPC: Deflect page write faults properly in kvmppc_st
KVM: PPC: Book3S: Stop PTE lookup on write errors
KVM: PPC: Book3S: Add hack for split real mode
KVM: PPC: Book3S: Make magic page properly 4k mappable
KVM: PPC: Remove 440 support
KVM: Rename and add argument to check_extension
KVM: Allow KVM_CHECK_EXTENSION on the vm fd
KVM: PPC: Book3S: Provide different CAPs based on HV or PR mode
KVM: PPC: Implement kvmppc_xlate for all targets
KVM: PPC: Move kvmppc_ld/st to common code
KVM: PPC: Remove kvmppc_bad_hva()
KVM: PPC: Use kvm_read_guest in kvmppc_ld
KVM: PPC: Handle magic page in kvmppc_ld/st
KVM: PPC: Separate loadstore emulation from priv emulation
KVM: PPC: Expose helper functions for data/inst faults
KVM: PPC: Remove DCR handling
KVM: PPC: HV: Remove generic instruction emulation
KVM: PPC: PR: Handle FSCR feature deselects
Alexey Kardashevskiy (1):
KVM: PPC: Book3S: Fix LPCR one_reg interface
Aneesh Kumar K.V (4):
KVM: PPC: BOOK3S: PR: Fix PURR and SPURR emulation
KVM: PPC: BOOK3S: PR: Emulate virtual timebase register
KVM: PPC: BOOK3S: PR: Emulate instruction counter
KVM: PPC: BOOK3S: HV: Update compute_tlbie_rb to handle 16MB base page
Anton Blanchard (2):
KVM: PPC: Book3S HV: Fix ABIv2 indirect branch issue
KVM: PPC: Assembly functions exported to modules need _GLOBAL_TOC()
Bharat Bhushan (10):
kvm: ppc: bookehv: Added wrapper macros for shadow registers
kvm: ppc: booke: Use the shared struct helpers of SRR0 and SRR1
kvm: ppc: booke: Use the shared struct helpers of SPRN_DEAR
kvm: ppc: booke: Add shared struct helpers of SPRN_ESR
kvm: ppc: booke: Use the shared struct helpers for SPRN_SPRG0-7
kvm: ppc: Add SPRN_EPR get helper function
kvm: ppc: bookehv: Save restore SPRN_SPRG9 on guest entry exit
KVM: PPC: Booke-hv: Add one reg interface for SPRG9
KVM: PPC: Remove comment saying SPRG1 is used for vcpu pointer
KVM: PPC: BOOKEHV: rename e500hv_spr to bookehv_spr
Michael Neuling (1):
KVM: PPC: Book3S HV: Add H_SET_MODE hcall handling
Mihai Caraman (8):
KVM: PPC: e500mc: Enhance tlb invalidation condition on vcpu schedule
KVM: PPC: e500: Fix default tlb for victim hint
KVM: PPC: e500: Emulate power management control SPR
KVM: PPC: e500mc: Revert "add load inst fixup"
KVM: PPC: Book3e: Add TLBSEL/TSIZE defines for MAS0/1
KVM: PPC: Book3s: Remove kvmppc_read_inst() function
KVM: PPC: Allow kvmppc_get_last_inst() to fail
KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
Paul Mackerras (4):
KVM: PPC: Book3S: Controls for in-kernel sPAPR hypercall handling
KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled
KVM: PPC: Book3S PR: Take SRCU read lock around RTAS kvm_read_guest() call
KVM: PPC: Book3S: Make kvmppc_ld return a more accurate error indication
Stewart Smith (2):
Split out struct kvmppc_vcore creation to separate function
Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
Conflicts:
Documentation/virtual/kvm/api.txt
- Fixes and code refactoring for stage2 kvm MMU unmap_range
- Support unmapping IPAs on deleting memslots for arm and arm64
- Support MMIO mappings in stage2 faults
- KVM VGIC v2 emulation on GICv3 hardware
- Big-Endian support for arm/arm64 (guest and host)
- Debug Architecture support for arm64 (arm32 is on Christoffer's todo list)
- Detect non page-aligned GICV regions and bail out (plugs guest-can-crash host bug)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJT3oZTAAoJEEtpOizt6ddyKQIH/1Bj/cZYSkkSf3IJfQhHRbWN
jS37IBsvcHwjHkRJxCNmQuKP/Ho5XEusluPGrVY25PAgBMl+ouPqAuKzUk+GEab6
snjJjDFqw0zs0x0h3tg6UwfZdF+eyyIkmFGn8/IATD5P3PPd8kWBVtYnSnZmYK+R
KJNVcp6RPDrt9kvUDY8Ln9fW99Jl+7CdgQAnc3QkHcXUlanLyrfq+fE1lSzyrbhZ
ETzyMFAX4kCdc8tflgyyBS4A7+RvfQ6ZIQummxoAMFHIoSk90dtK7ovX68rd9U3e
yL+mpe130+dTIFpUMbxCnIdE7C0eud3vcgXC6MuWtFjUrxQoaEgsVE+ffGC5tX0=
=axkp
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm
KVM/ARM New features for 3.17 include:
- Fixes and code refactoring for stage2 kvm MMU unmap_range
- Support unmapping IPAs on deleting memslots for arm and arm64
- Support MMIO mappings in stage2 faults
- KVM VGIC v2 emulation on GICv3 hardware
- Big-Endian support for arm/arm64 (guest and host)
- Debug Architecture support for arm64 (arm32 is on Christoffer's todo list)
Conflicts:
virt/kvm/arm/vgic.c [last minute cherry-pick from 3.17 to 3.16]
few days.
MIPS and s390 have little going on this release; just bugfixes, some
small, some larger.
The highlights for x86 are nested VMX improvements (Jan Kiszka), optimizations
for old processor (up to Nehalem, by me and Bandan Das), and a lot of x86
emulator bugfixes (Nadav Amit).
Stephen Rothwell reported a trivial conflict with the tracing branch.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJT300XAAoJEBvWZb6bTYby3V8QAJz+XyajnhJ8wH55Vxczz22L
i2gtUGmBLhEXsBcaVKO4BBfek88lLzg0SGLjfW5wCMQmKtxVlrwTCXNkBoPGjapd
NwHtWkMKym44PDhRovn7zkSumkxC43uFIBR/ebrhP6Bvhh9s+MnkQUxfw9ILB+YV
EeKyEG8sSgxFCciuHbp3mIXpDcO6r/ldy6I7009OdyhLoMY+Kvmk7kRe9wtAivdg
CGJi60QvGOn2RGRPOCEtF6UWr8Ae8fe1t84o0hkXPv/j3jtabzAatXKJa4dYNbIs
7Mp4NQpxaGV6rq3WCYVeZRxGs+UReGDAS3Il4Z8C9eTOTooSfxdVr8acpM8PY6I8
UmLT6ECLGycc4ELXrETtR+QLmiXACyJqyVxz4aiLV3kWSWfamKD3hBeQK9NizNcE
VoPDl+PyISvR1tW4KstBuzfUWAEXi+gO78cqqFr/VW6cl7HKpA1DFQaPfGkYKDae
2CPwcLwI5/M6RtSgkyXTkEqNZLc2BjldqSeM1lmWjhZVW56X2iqePUL46Vab3Yvt
U+sELtwEE560NLN3hbaHUsLR1tcUix5w8vTzcXPxgoHQBszHCcAZTWd1XHulr64F
rp/cangqtkPKcu5j1mNhQs38oLjHI1MUsbQrqFoD4tmHjQ75iXHRFzYGoIVKXyHG
AnGbQzJzBcdAANhm3LW0
=UXxV
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM changes from Paolo Bonzini:
"These are the x86, MIPS and s390 changes; PPC and ARM will come in a
few days.
MIPS and s390 have little going on this release; just bugfixes, some
small, some larger.
The highlights for x86 are nested VMX improvements (Jan Kiszka),
optimizations for old processor (up to Nehalem, by me and Bandan Das),
and a lot of x86 emulator bugfixes (Nadav Amit).
Stephen Rothwell reported a trivial conflict with the tracing branch"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (104 commits)
x86/kvm: Resolve shadow warnings in macro expansion
KVM: s390: rework broken SIGP STOP interrupt handling
KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table
KVM: vmx: remove duplicate vmx_mpx_supported() prototype
KVM: s390: Fix memory leak on busy SIGP stop
x86/kvm: Resolve shadow warning from min macro
kvm: Resolve missing-field-initializers warnings
Replace NR_VMX_MSR with its definition
KVM: x86: Assertions to check no overrun in MSR lists
KVM: x86: set rflags.rf during fault injection
KVM: x86: Setting rflags.rf during rep-string emulation
KVM: x86: DR6/7.RTM cannot be written
KVM: nVMX: clean up nested_release_vmcs12 and code around it
KVM: nVMX: fix lifetime issues for vmcs02
KVM: x86: Defining missing x86 vectors
KVM: x86: emulator injects #DB when RFLAGS.RF is set
KVM: x86: Cleanup of rflags.rf cleaning
KVM: x86: Clear rflags.rf on emulated instructions
KVM: x86: popf emulation should not change RF
KVM: x86: Clearing rflags.rf upon skipped emulated instruction
...
Just like GICv2 was fixed in 63afbe7a0a
(kvm: arm64: vgic: fix hyp panic with 64k pages on juno platform),
mandate the GICV region to be both aligned on a page boundary and
its size to be a multiple of page size.
This prevents a guest from being able to poke at regions where we
have no idea what is sitting there.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Currently, the EOI exit bitmap (used for APICv) does not include
interrupts that are masked. However, this can cause a bug that manifests
as an interrupt storm inside the guest. Alex Williamson reported the
bug and is the one who really debugged this; I only wrote the patch. :)
The scenario involves a multi-function PCI device with OHCI and EHCI
USB functions and an audio function, all assigned to the guest, where
both USB functions use legacy INTx interrupts.
As soon as the guest boots, interrupts for these devices turn into an
interrupt storm in the guest; the host does not see the interrupt storm.
Basically the EOI path does not work, and the guest continues to see the
interrupt over and over, even after it attempts to mask it at the APIC.
The bug is only visible with older kernels (RHEL6.5, based on 2.6.32
with not many changes in the area of APIC/IOAPIC handling).
Alex then tried forcing bit 59 (corresponding to the USB functions' IRQ)
on in the eoi_exit_bitmap and TMR, and things then work. What happens
is that VFIO asserts IRQ11, then KVM recomputes the EOI exit bitmap.
It does not have set bit 59 because the RTE was masked, so the IOAPIC
never sees the EOI and the interrupt continues to fire in the guest.
My guess was that the guest is masking the interrupt in the redirection
table in the interrupt routine, i.e. while the interrupt is set in a
LAPIC's ISR, The simplest fix is to ignore the masking state, we would
rather have an unnecessary exit rather than a missed IRQ ACK and anyway
IOAPIC interrupts are not as performance-sensitive as for example MSIs.
Alex tested this patch and it fixed his bug.
[Thanks to Alex for his precise description of the problem
and initial debugging effort. A lot of the text above is
based on emails exchanged with him.]
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the physical address of GICV isn't page-aligned, then we end up
creating a stage-2 mapping of the page containing it, which causes us to
map neighbouring memory locations directly into the guest.
As an example, consider a platform with GICV at physical 0x2c02f000
running a 64k-page host kernel. If qemu maps this into the guest at
0x80010000, then guest physical addresses 0x80010000 - 0x8001efff will
map host physical region 0x2c020000 - 0x2c02efff. Accesses to these
physical regions may cause UNPREDICTABLE behaviour, for example, on the
Juno platform this will cause an SError exception to EL3, which brings
down the entire physical CPU resulting in RCU stalls / HYP panics / host
crashing / wasted weeks of debugging.
SBSA recommends that systems alias the 4k GICV across the bounding 64k
region, in which case GICV physical could be described as 0x2c020000 in
the above scenario.
This patch fixes the problem by failing the vgic probe if the physical
base address or the size of GICV aren't page-aligned. Note that this
generated a warning in dmesg about freeing enabled IRQs, so I had to
move the IRQ enabling later in the probe.
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joel Schopp <joel.schopp@amd.com>
Cc: Don Dutile <ddutile@redhat.com>
Acked-by: Peter Maydell <peter.maydell@linaro.org>
Acked-by: Joel Schopp <joel.schopp@amd.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
The KVM_CHECK_EXTENSION is only available on the kvm fd today. Unfortunately
on PPC some of the capabilities change depending on the way a VM was created.
So instead we need a way to expose capabilities as VM ioctl, so that we can
see which VM type we're using (HV or PR). To enable this, add the
KVM_CHECK_EXTENSION ioctl to our vm ioctl portfolio.
Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation to make the check_extension function available to VM scope
we add a struct kvm * argument to the function header and rename the function
accordingly. It will still be called from the /dev/kvm fd, but with a NULL
argument for struct kvm *.
Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Resolve missing-field-initializers warnings seen in W=2 kernel
builds by having macros generate more elaborated initializers.
That is enough to silence the warnings.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix vgic_bitmap_get_reg function to return 'right' word address of
'unsigned long' bitmap value in case of BE 64bit image.
Signed-off-by: Victor Kamensky <victor.kamensky@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
According to recent clarifications of mmio.data array meaning -
the mmio.data array should hold bytes as they would appear in
memory. Vgic is little endian device. And in case of BE image
kernel side that emulates vgic, holds data in BE form. So we
need to byteswap cpu<->le32 vgic registers when we read/write them
from mmio.data[].
Change has no effect in LE case because cpu already runs in le32.
Signed-off-by: Victor Kamensky <victor.kamensky@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Introduce the support code for emulating a GICv2 on top of GICv3
hardware.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Move the GICv2 world switch code into its own file, and add the
necessary indirection to the arm64 switch code.
Also introduce a new type field to the vgic_params structure.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
So far, irqchip_in_kernel() was implemented by testing the value of
vctrl_base, which worked fine with GICv2.
With GICv3, this field is useless, as we're using system registers
instead of a emmory mapped interface. To solve this, add a boolean
flag indicating if the we're using a vgic or not.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Brutally hack the innocent vgic code, and move the GICv2 specific code
to its own file, using vgic_ops and vgic_params as a way to pass
information between the two blocks.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Move all the data specific to a given GIC implementation into its own
little structure.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Move the code dealing with enabling the VGIC on to vgic_ops.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Instead of directly messing with with the GICH_VMCR bits for the CPU
interface save/restore code, add accessors that encode/decode the
entire set of registers exposed by VMCR.
Not the most efficient thing, but given that this code is only used
by the save/restore code, performance is far from being critical.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Move the code dealing with LR underflow handling to its own functions,
and make them accessible through vgic_ops.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Instead of directly dealing with the GICH_MISR bits, move the code to
its own function and use a couple of public flags to represent the
actual state.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Move the GICH_EISR access to its own function.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Move the GICH_ELRSR access to its own functions, and add them to
the vgic_ops structure.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
In order to split the various register manipulation from the main vgic
code, introduce a vgic_ops structure, and start by abstracting the
LR manipulation code with a couple of accessors.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
In order to make way for the GICv3 registers, move the v2-specific
registers to their own structure.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Pull more scheduler updates from Ingo Molnar:
"Second round of scheduler changes:
- try-to-wakeup and IPI reduction speedups, from Andy Lutomirski
- continued power scheduling cleanups and refactorings, from Nicolas
Pitre
- misc fixes and enhancements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Delete extraneous extern for to_ratio()
sched/idle: Optimize try-to-wake-up IPI
sched/idle: Simplify wake_up_idle_cpu()
sched/idle: Clear polling before descheduling the idle thread
sched, trace: Add a tracepoint for IPI-less remote wakeups
cpuidle: Set polling in poll_idle
sched: Remove redundant assignment to "rt_rq" in update_curr_rt(...)
sched: Rename capacity related flags
sched: Final power vs. capacity cleanups
sched: Remove remaining dubious usage of "power"
sched: Let 'struct sched_group_power' care about CPU capacity
sched/fair: Disambiguate existing/remaining "capacity" usage
sched/fair: Change "has_capacity" to "has_free_capacity"
sched/fair: Remove "power" from 'struct numa_stats'
sched: Fix signedness bug in yield_to()
sched/fair: Use time_after() in record_wakee()
sched/balancing: Reduce the rate of needless idle load balancing
sched/fair: Fix unlocked reads of some cfs_b->quota/period
yield_to() is supposed to return -ESRCH if there is no task to
yield to, but because the type is bool that is the same as returning
true.
The only place I see which cares is kvm_vcpu_on_spin().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Raghavendra <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Link: http://lkml.kernel.org/r/20140523102042.GA7267@mwanda
Signed-off-by: Ingo Molnar <mingo@kernel.org>
was a pretty active cycle for KVM. Changes include:
- a lot of s390 changes: optimizations, support for migration,
GDB support and more
- ARM changes are pretty small: support for the PSCI 0.2 hypercall
interface on both the guest and the host (the latter acked by Catalin)
- initial POWER8 and little-endian host support
- support for running u-boot on embedded POWER targets
- pretty large changes to MIPS too, completing the userspace interface
and improving the handling of virtualized timer hardware
- for x86, a larger set of changes is scheduled for 3.17. Still,
we have a few emulator bugfixes and support for running nested
fully-virtualized Xen guests (para-virtualized Xen guests have
always worked). And some optimizations too.
The only missing architecture here is ia64. It's not a coincidence
that support for KVM on ia64 is scheduled for removal in 3.17.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJTjtlBAAoJEBvWZb6bTYbyMOUP/2NAePghE3IjG99ikHFdn+BX
BfrURsuR6GD0AhYQnBidBmpFbAmN/LwSJxv/M7sV7OBRWLu3qbt69DrPTU2e/FK1
j9q25peu8jRyHzJ1q9rBroo74nD9lQYuVr3uXNxxcg0DRnw14JHGlM3y8LDEknO8
W+gpWTeAQ+2AuOX98MpRbCRMuzziCSv5bP5FhBVnsWHiZfvMbcUrbeJt+zYSiDAZ
0tHm/5dFKzfj/vVrrnjD4EZcRr688Bs5rztG96hY6aoVJryjZGLtLp92wCWkRRmH
CCvZwd245NmNthuKHzcs27/duSWfU0uOlu7AMrD44QYhzeDGyB/2nbCxbGqLLoBA
nnOviXH4cC65/CnisZ79zfo979HbZcX+Lzg747EjBgCSxJmLlwgiG8yXtDvk5otB
TH6GUeGDiEEPj//JD3XtgSz0sF2NvjREWRyemjDMvhz6JC/bLytXKb3sn+NXSj8m
ujzF9eQoa4qKDcBL4IQYGTJ4z5nY3Pd68dHFIPHB7n82OxFLSQUBKxXw8/1fb5og
VVb8PL4GOcmakQlAKtTMlFPmuy4bbL2r/2iV5xJiOZKmXIu8Hs1JezBE3SFAltbl
3cAGwSM9/dDkKxUbTFblyOE9bkKbg4WYmq0LkdzsPEomb3IZWntOT25rYnX+LrBz
bAknaZpPiOrW11Et1htY
=j5Od
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into next
Pull KVM updates from Paolo Bonzini:
"At over 200 commits, covering almost all supported architectures, this
was a pretty active cycle for KVM. Changes include:
- a lot of s390 changes: optimizations, support for migration, GDB
support and more
- ARM changes are pretty small: support for the PSCI 0.2 hypercall
interface on both the guest and the host (the latter acked by
Catalin)
- initial POWER8 and little-endian host support
- support for running u-boot on embedded POWER targets
- pretty large changes to MIPS too, completing the userspace
interface and improving the handling of virtualized timer hardware
- for x86, a larger set of changes is scheduled for 3.17. Still, we
have a few emulator bugfixes and support for running nested
fully-virtualized Xen guests (para-virtualized Xen guests have
always worked). And some optimizations too.
The only missing architecture here is ia64. It's not a coincidence
that support for KVM on ia64 is scheduled for removal in 3.17"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (203 commits)
KVM: add missing cleanup_srcu_struct
KVM: PPC: Book3S PR: Rework SLB switching code
KVM: PPC: Book3S PR: Use SLB entry 0
KVM: PPC: Book3S HV: Fix machine check delivery to guest
KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
KVM: PPC: Book3S HV: Fix dirty map for hugepages
KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()
KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
KVM: PPC: Book3S: Add ONE_REG register names that were missed
KVM: PPC: Add CAP to indicate hcall fixes
KVM: PPC: MPIC: Reset IRQ source private members
KVM: PPC: Graciously fail broken LE hypercalls
PPC: ePAPR: Fix hypercall on LE guest
KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler
KVM: PPC: BOOK3S: Always use the saved DAR value
PPC: KVM: Make NX bit available with magic page
KVM: PPC: Disable NX for old magic page using guests
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
...
When starting lots of dataplane devices the bootup takes very long on
Christian's s390 with irqfd patches. With larger setups he is even
able to trigger some timeouts in some components. Turns out that the
KVM_SET_GSI_ROUTING ioctl takes very long (strace claims up to 0.1 sec)
when having multiple CPUs. This is caused by the synchronize_rcu and
the HZ=100 of s390. By changing the code to use a private srcu we can
speed things up. This patch reduces the boot time till mounting root
from 8 to 2 seconds on my s390 guest with 100 disks.
Uses of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu
are fine because they do not have lockdep checks (hlist_for_each_entry_rcu
uses rcu_dereference_raw rather than rcu_dereference, and write-sides
do not do rcu lockdep at all).
Note that we're hardly relying on the "sleepable" part of srcu. We just
want SRCU's faster detection of grace periods.
Testing was done by Andrew Theurer using netperf tests STREAM, MAERTS
and RR. The difference between results "before" and "after" the patch
has mean -0.2% and standard deviation 0.6%. Using a paired t-test on the
data points says that there is a 2.5% probability that the patch is the
cause of the performance difference (rather than a random fluctuation).
(Restricting the t-test to RR, which is the most likely to be affected,
changes the numbers to respectively -0.3% mean, 0.7% stdev, and 8%
probability that the numbers actually say something about the patch.
The probability increases mostly because there are fewer data points).
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # s390
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Includes vgic fixes, a possible kernel corruption bug due to
misalignment of pages and disabling of KVM in KConfig on big-endian
systems, because the last one breaks the build.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQEcBAABAgAGBQJTYNiqAAoJEEtpOizt6ddy4NwH/3ZVN7sgC5vKiKEf0n5wNdN2
zCMNOnjKfaZN7dUval3eT3qF6h0emDqW5pOFstHwoFvuMAFauLMWPQCbU1m+bl3K
gD745kVniLKGHyE4rEwOiUNEiYGbiP44DeC1oGlirSiGNptMQjeAi3dhEtJpedES
xtn3jY26bWrIdOZ75/pvFix2qE8CXmRJU2oEvsZ0B5gGkqsblrlcY+ascot4Rm8t
M88SAhGs6pzMWpjfOOm55E2BXISQw18KMzETRWZgmmYgYQOaR2sH0USwQuI/Uhvx
1UZBZSYz3KYEx3kxKnXyS7qZyWQOY8p+y487Ty9VTlzuat2gxXH9TMMA39ZIGak=
=gpor
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-3.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
First round of KVM/ARM Fixes for 3.15
Includes vgic fixes, a possible kernel corruption bug due to
misalignment of pages and disabling of KVM in KConfig on big-endian
systems, because the last one breaks the build.
Currently below check in vgic_ioaddr_overlap will always succeed,
because the vgic dist base and vgic cpu base are still kept UNDEF
after initialization. The code as follows will be return forever.
if (IS_VGIC_ADDR_UNDEF(dist) || IS_VGIC_ADDR_UNDEF(cpu))
return 0;
So, before invoking the vgic_ioaddr_overlap, it needs to set the
corresponding base address firstly.
Signed-off-by: Haibin Wang <wanghaibin.wang@huawei.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
async_pf_execute() passes tsk == current to gup(), this is doesn't
hurt but unnecessary and misleading. "tsk" is only used to account
the number of faults and current is the random workqueue thread.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
async_pf_execute() has no reasons to adopt apf->mm, gup(current, mm)
should work just fine even if current has another or NULL ->mm.
Recently kvm_async_page_present_sync() was added insedie the "use_mm"
section, but it seems that it doesn't need current->mm too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since KVM internally represents the ICFGR registers by stuffing two
of them into one word, the offset for accessing the internal
representation and the one for the MMIO based access are different.
So keep the original offset around, but adjust the internal array
offset by one bit.
Reported-by: Haibin Wang <wanghaibin.wang@huawei.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
get_user_pages(mm) is simply wrong if mm->mm_users == 0 and exit_mmap/etc
was already called (or is in progress), mm->mm_count can only pin mm->pgd
and mm_struct itself.
Change kvm_setup_async_pf/async_pf_execute to inc/dec mm->mm_users.
kvm_create_vm/kvm_destroy_vm play with ->mm_count too but this case looks
fine at first glance, it seems that this ->mm is only used to verify that
current->mm == kvm->mm.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When dispatch SGI(mode == 0), that is the vcpu of VM should send
sgi to the cpu which the target_cpus list.
So, there must add the "break" to branch of case 0.
Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: Haibin Wang <wanghaibin.wang@huawei.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
As result of deprecation of MSI-X/MSI enablement functions
pci_enable_msix() and pci_enable_msi_block() all drivers
using these two interfaces need to be updated to use the
new pci_enable_msi_range() or pci_enable_msi_exact()
and pci_enable_msix_range() or pci_enable_msix_exact()
interfaces.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit 5befdc385d.
Since we will allow flush tlb out of mmu-lock in the later
patch
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
-------------------------
Linux does not use the ACC and F bits of the storage key. Newer Linux
versions also do not use the storage keys for dirty and reference
tracking. We can optimize the guest handling for those guests for faults
as well as page-in and page-out by simply not caring about the guest
visible storage key. We trap guest storage key instruction to enable
those keys only on demand.
Migration bitmap
Until now s390 never provided a proper dirty bitmap. Let's provide a
proper migration bitmap for s390. We also change the user dirty tracking
to a fault based mechanism. This makes the host completely independent
from the storage keys. Long term this will allow us to back guest memory
with large pages.
per-VM device attributes
------------------------
To avoid the introduction of new ioctls, let's provide the
attribute semanantic also on the VM-"device".
Userspace controlled CMMA
-------------------------
The CMMA assist is changed from "always on" to "on if requested" via
per-VM device attributes. In addition a callback to reset all usage
states is provided.
Proper guest DAT handling for intercepts
----------------------------------------
While instructions handled by SIE take care of all addressing aspects,
KVM/s390 currently does not care about guest address translation of
intercepts. This worked out fine, because
- the s390 Linux kernel has a 1:1 mapping between kernel virtual<->real
for all pages up to memory size
- intercepts happen only for a small amount of cases
- all of these intercepts happen to be in the kernel text for current
distros
Of course we need to be better for other intercepts, kernel modules etc.
We provide the infrastructure and rework all in-kernel intercepts to work
on logical addresses (paging etc) instead of real ones. The code has
been running internally for several months now, so it is time for going
public.
GDB support
-----------
We provide breakpoints, single stepping and watchpoints.
Fixes/Cleanups
--------------
- Improve program check delivery
- Factor out the handling of transactional memory on program checks
- Use the existing define __LC_PGM_TDB
- Several cleanups in the lowcore structure
- Documentation
NOTES
-----
- All patches touching base s390 are either ACKed or written by the s390
maintainers
- One base KVM patch "KVM: add kvm_is_error_gpa() helper"
- One patch introduces the notion of VM device attributes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJTVlHZAAoJEBF7vIC1phx8REgP/1P0EUzfBpoS53z1v60n2uLT
lW79LY9Op4/ZacEgHtU9LzmGa88X0arDsIpBZQsTNLF77AGFcMCCV3X2il/lQrRG
KSE+ycKLoFjCcES442DwF4gHoGldD+KL/+5LPWSQZtvb9dDpHDft9aeMRBbpUL0Q
M2kKQDlmJ2XqQu3D5PwSHgVRByHiHOzmTe2ejSSbdppkwBpaiqSBBBk0jVYDW9Jh
eqUnBcrrYW2p+QS37ELM6hOkfDXN/vXoHBQeyca19TuZVCPNA7HeJaPc2mJ/GZk9
wrNWEmY3f/lY0lk0zMwBwsDOS5K7jbtvXzcex6m+NsIqQuOvKsmPBy1BWb/axcK5
uZq/JGFC0fxsFU+7ImtvQrJ/DMHnVuvSKF4WUVle2GdMlDIqkguwX27WwHSiH4/r
Au02KlVIMUZdLAEUrw/W/S4MPLeZYoGfetHGCOmSaP2qGc97BVFedZaqekDlUgMw
3gIoQmSIBcfrgF4k9N4nLjdhAX2S4gkviwF3pTlIkecNfa7RcI3Xk7U9mVPmIhL4
IquVqjdXZH4m0e4gViBMtQ0IPwGt1qFlV6Wv3O9MExhfi7VQ8M8TMYNhEvtGpY75
cuZwZYGM4FqszDAy9hbk0avTLqCxqlTiBKi3tHoQMappQmsJPrIdxIpev3MZPHCp
vZMkbzhM9l3eefNJVw66
=jxBp
-----END PGP SIGNATURE-----
Merge tag 'kvm-s390-20140422' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into queue
Lazy storage key handling
-------------------------
Linux does not use the ACC and F bits of the storage key. Newer Linux
versions also do not use the storage keys for dirty and reference
tracking. We can optimize the guest handling for those guests for faults
as well as page-in and page-out by simply not caring about the guest
visible storage key. We trap guest storage key instruction to enable
those keys only on demand.
Migration bitmap
Until now s390 never provided a proper dirty bitmap. Let's provide a
proper migration bitmap for s390. We also change the user dirty tracking
to a fault based mechanism. This makes the host completely independent
from the storage keys. Long term this will allow us to back guest memory
with large pages.
per-VM device attributes
------------------------
To avoid the introduction of new ioctls, let's provide the
attribute semanantic also on the VM-"device".
Userspace controlled CMMA
-------------------------
The CMMA assist is changed from "always on" to "on if requested" via
per-VM device attributes. In addition a callback to reset all usage
states is provided.
Proper guest DAT handling for intercepts
----------------------------------------
While instructions handled by SIE take care of all addressing aspects,
KVM/s390 currently does not care about guest address translation of
intercepts. This worked out fine, because
- the s390 Linux kernel has a 1:1 mapping between kernel virtual<->real
for all pages up to memory size
- intercepts happen only for a small amount of cases
- all of these intercepts happen to be in the kernel text for current
distros
Of course we need to be better for other intercepts, kernel modules etc.
We provide the infrastructure and rework all in-kernel intercepts to work
on logical addresses (paging etc) instead of real ones. The code has
been running internally for several months now, so it is time for going
public.
GDB support
-----------
We provide breakpoints, single stepping and watchpoints.
Fixes/Cleanups
--------------
- Improve program check delivery
- Factor out the handling of transactional memory on program checks
- Use the existing define __LC_PGM_TDB
- Several cleanups in the lowcore structure
- Documentation
NOTES
-----
- All patches touching base s390 are either ACKed or written by the s390
maintainers
- One base KVM patch "KVM: add kvm_is_error_gpa() helper"
- One patch introduces the notion of VM device attributes
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Conflicts:
include/uapi/linux/kvm.h
Replace the kvm_s390_sync_dirty_log() stub with code to construct the KVM
dirty_bitmap from S390 memory change bits. Also add code to properly clear
the dirty_bitmap size when clearing the bitmap.
Signed-off-by: Jason J. Herne <jjherne@us.ibm.com>
CC: Dominik Dingel <dingel@linux.vnet.ibm.com>
[Dominik Dingel: use gmap_test_and_clear_dirty, locking fixes]
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
With KVM, MMIO is much slower than PIO, due to the need to
do page walk and emulation. But with EPT, it does not have to be: we
know the address from the VMCS so if the address is unique, we can look
up the eventfd directly, bypassing emulation.
Unfortunately, this only works if userspace does not need to match on
access length and data. The implementation adds a separate FAST_MMIO
bus internally. This serves two purposes:
- minimize overhead for old userspace that does not use eventfd with lengtth = 0
- minimize disruption in other code (since we don't know the length,
devices on the MMIO bus only get a valid address in write, this
way we don't need to touch all devices to teach them to handle
an invalid length)
At the moment, this optimization only has effect for EPT on x86.
It will be possible to speed up MMIO for NPT and MMU using the same
idea in the future.
With this patch applied, on VMX MMIO EVENTFD is essentially as fast as PIO.
I was unable to detect any measureable slowdown to non-eventfd MMIO.
Making MMIO faster is important for the upcoming virtio 1.0 which
includes an MMIO signalling capability.
The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
pre-review and suggestions.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
It is sometimes benefitial to ignore IO size, and only match on address.
In hindsight this would have been a better default than matching length
when KVM_IOEVENTFD_FLAG_DATAMATCH is not set, In particular, this kind
of access can be optimized on VMX: there no need to do page lookups.
This can currently be done with many ioeventfds but in a suboptimal way.
However we can't change kernel/userspace ABI without risk of breaking
some applications.
Use len = 0 to mean "ignore length for matching" in a more optimal way.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Pull KVM fixes from Marcelo Tosatti:
- Fix for guest triggerable BUG_ON (CVE-2014-0155)
- CR4.SMAP support
- Spurious WARN_ON() fix
* git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: remove WARN_ON from get_kernel_ns()
KVM: Rename variable smep to cr4_smep
KVM: expose SMAP feature to guest
KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
KVM: Add SMAP support when setting CR4
KVM: Remove SMAP bit from CR4_RESERVED_BITS
KVM: ioapic: try to recover if pending_eoi goes out of range
KVM: ioapic: fix assignment of ioapic->rtc_status.pending_eoi (CVE-2014-0155)
Commit 8146875de7 (arm, kvm: Fix CPU hotplug callback registration)
holds the lock before calling the two functions:
kvm_vgic_hyp_init()
kvm_timer_hyp_init()
and both the two functions are calling register_cpu_notifier()
to register cpu notifier, so cause double lock on cpu_add_remove_lock.
Considered that both two functions are only called inside
kvm_arch_init() with holding cpu_add_remove_lock, so simply use
__register_cpu_notifier() to fix the problem.
Fixes: 8146875de7 (arm, kvm: Fix CPU hotplug callback registration)
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The RTC tracking code tracks the cardinality of rtc_status.dest_map
into rtc_status.pending_eoi. It has some WARN_ONs that trigger if
pending_eoi ever becomes negative; however, these do not do anything
to recover, and it bad things will happen soon after they trigger.
When the next RTC interrupt is triggered, rtc_check_coalesced() will
return false, but ioapic_service will find pending_eoi != 0 and
do a BUG_ON. To avoid this, should pending_eoi ever be nonzero,
call kvm_rtc_eoi_tracking_restore_all to recompute a correct
dest_map and pending_eoi.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
QE reported that they got the BUG_ON in ioapic_service to trigger.
I cannot reproduce it, but there are two reasons why this could happen.
The less likely but also easiest one, is when kvm_irq_delivery_to_apic
does not deliver to any APIC and returns -1.
Because irqe.shorthand == 0, the kvm_for_each_vcpu loop in that
function is never reached. However, you can target the similar loop in
kvm_irq_delivery_to_apic_fast; just program a zero logical destination
address into the IOAPIC, or an out-of-range physical destination address.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Allow the vfio-type1 IOMMU to support multiple domains within a container
- Plumb path to query whether all domains are cache-coherent
- Wire query into kvm-vfio device to avoid KVM x86 WBINVD emulation
- Always select CONFIG_ANON_INODES, vfio depends on it (Arnd)
The first patch also makes the vfio-type1 IOMMU driver completely independent
of the bus_type of the devices it's handling, which enables it to be used for
both vfio-pci and a future vfio-platform (and hopefully combinations involving
both simultaneously).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJTPbikAAoJECObm247sIsiG9cP/jnurW84YuAHmybzy3R4nMaa
8BWcQST+TyQ78GWvubFDcRt+vHmJUI4iFWPBIm1twBSVrMy6F1GcT/spmSne1Dfb
OvcfGEy59tGO+BeklHg5Qq2Hj1UzfeV9uQoy+PrpTF/sYzsyL6g5+O3Zv39SNvr0
zCwnO2JAKXIpQlkK3wVha6V13X07Z/+d2b4JSsvmON2cnlPzOUYNrgBvoSeV2IQe
3kMAU9qfAfQlpNDhRRZL/HshgWazCg4XZUp9UdNjUkOxrUp4vXHrVTUJnUGBDytk
8V9UGRKF3mik9PqpZJk4jLV5urgUVpnUR5747uqs0KF+9GWxClXvh3gp1XX8Zn7f
NCfDSMn/wrDr6fQBglKUadITaDhF49KV+J0cX083q46BbYxhAfgxv1I9UOvLreG6
SGAezlTB0mefa1BmSwJdfKwDBsuDOhbF0TsHC6zT7yFc8pqe3hl/vQzUyu3HtwuM
yPycQiQQmvOaymaiiBRwyLocwJbDNKPigDomv8NZ8Mcd97Fu9YsF4ydaxMJmmeKf
TffYS4z4tUElYiU2vv1ipB8T+ReCmdtj2cIvyfwTL9jycY9ouO4bJ56dOGWd3WNl
m7AVokbrrJQ03itUpFMuYjHAzTNUlXVvXlGr9n6L4VTR0RTL1zwJVrMVDsXdhxm4
XgDbVB5PrxinDAC1vJvI
=mnzO
-----END PGP SIGNATURE-----
Merge tag 'vfio-v3.15-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
"VFIO updates for v3.15 include:
- Allow the vfio-type1 IOMMU to support multiple domains within a
container
- Plumb path to query whether all domains are cache-coherent
- Wire query into kvm-vfio device to avoid KVM x86 WBINVD emulation
- Always select CONFIG_ANON_INODES, vfio depends on it (Arnd)
The first patch also makes the vfio-type1 IOMMU driver completely
independent of the bus_type of the devices it's handling, which
enables it to be used for both vfio-pci and a future vfio-platform
(and hopefully combinations involving both simultaneously)"
* tag 'vfio-v3.15-rc1' of git://github.com/awilliam/linux-vfio:
vfio: always select ANON_INODES
kvm/vfio: Support for DMA coherent IOMMUs
vfio: Add external user check extension interface
vfio/type1: Add extension to test DMA cache coherence of IOMMU
vfio/iommu_type1: Multi-IOMMU domain support
Pull kvm updates from Paolo Bonzini:
"PPC and ARM do not have much going on this time. Most of the cool
stuff, instead, is in s390 and (after a few releases) x86.
ARM has some caching fixes and PPC has transactional memory support in
guests. MIPS has some fixes, with more probably coming in 3.16 as
QEMU will soon get support for MIPS KVM.
For x86 there are optimizations for debug registers, which trigger on
some Windows games, and other important fixes for Windows guests. We
now expose to the guest Broadwell instruction set extensions and also
Intel MPX. There's also a fix/workaround for OS X guests, nested
virtualization features (preemption timer), and a couple kvmclock
refinements.
For s390, the main news is asynchronous page faults, together with
improvements to IRQs (floating irqs and adapter irqs) that speed up
virtio devices"
* tag 'kvm-3.15-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (96 commits)
KVM: PPC: Book3S HV: Save/restore host PMU registers that are new in POWER8
KVM: PPC: Book3S HV: Fix decrementer timeouts with non-zero TB offset
KVM: PPC: Book3S HV: Don't use kvm_memslots() in real mode
KVM: PPC: Book3S HV: Return ENODEV error rather than EIO
KVM: PPC: Book3S: Trim top 4 bits of physical address in RTAS code
KVM: PPC: Book3S HV: Add get/set_one_reg for new TM state
KVM: PPC: Book3S HV: Add transactional memory support
KVM: Specify byte order for KVM_EXIT_MMIO
KVM: vmx: fix MPX detection
KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n
KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE
KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write
KVM: s390: clear local interrupts at cpu initial reset
KVM: s390: Fix possible memory leak in SIGP functions
KVM: s390: fix calculation of idle_mask array size
KVM: s390: randomize sca address
KVM: ioapic: reinject pending interrupts on KVM_SET_IRQCHIP
KVM: Bump KVM_MAX_IRQ_ROUTES for s390
KVM: s390: irq routing for adapter interrupts.
KVM: s390: adapter interrupt sources
...
Pull x86 LTO changes from Peter Anvin:
"More infrastructure work in preparation for link-time optimization
(LTO). Most of these changes is to make sure symbols accessed from
assembly code are properly marked as visible so the linker doesn't
remove them.
My understanding is that the changes to support LTO are still not
upstream in binutils, but are on the way there. This patchset should
conclude the x86-specific changes, and remaining patches to actually
enable LTO will be fed through the Kbuild tree (other than keeping up
with changes to the x86 code base, of course), although not
necessarily in this merge window"
* 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
Kbuild, lto: Handle basic LTO in modpost
Kbuild, lto: Disable LTO for asm-offsets.c
Kbuild, lto: Add a gcc-ld script to let run gcc as ld
Kbuild, lto: add ld-version and ld-ifversion macros
Kbuild, lto: Drop .number postfixes in modpost
Kbuild, lto, workaround: Don't warn for initcall_reference in modpost
lto: Disable LTO for sys_ni
lto: Handle LTO common symbols in module loader
lto, workaround: Add workaround for initcall reordering
lto: Make asmlinkage __visible
x86, lto: Disable LTO for the x86 VDSO
initconst, x86: Fix initconst mistake in ts5500 code
initconst: Fix initconst mistake in dcdbas
asmlinkage: Make trace_hardirqs_on/off_caller visible
asmlinkage, x86: Fix 32bit memcpy for LTO
asmlinkage Make __stack_chk_failed and memcmp visible
asmlinkage: Mark rwsem functions that can be called from assembler asmlinkage
asmlinkage: Make main_extable_sort_needed visible
asmlinkage, mutex: Mark __visible
asmlinkage: Make trace_hardirq visible
...
After the previous patches, an interrupt whose bit is set in the IRR
register will never be in the LAPIC's IRR and has never been injected
on the migration source. So inject it on the destination.
This fixes migration of Windows guests without HPET (they use the RTC
to trigger the scheduler tick, and lose it after migration).
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We will reuse it to process a nonzero IRR that is passed to KVM_SET_IRQCHIP.
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This ensures that IRR bits are set in the KVM_GET_IRQCHIP result only if
the interrupt is still sitting in the IOAPIC. After the next patches, it
avoids spurious reinjection of the interrupt when KVM_SET_IRQCHIP is
called.
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commonize the handling of masking, which was absent for kvm_ioapic_set_irq.
Setting remote_irr does not need a separate function either, and merging
the two functions avoids confusion.
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When registering a new irqfd, we call its ->poll method to collect any
event that might have previously been pending so that we can trigger it.
This is done under the kvm->irqfds.lock, which means the eventfd's ctx
lock is taken under it.
However, if we get a POLLHUP in irqfd_wakeup, we will be called with the
ctx lock held before getting the irqfds.lock to deactivate the irqfd,
causing lockdep to complain.
Calling the ->poll method does not really need the irqfds.lock, so let's
just move it after we've given up the irqfds.lock in kvm_irqfd_assign().
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Both QEMU and KVM have already accumulated a significant number of
optimizations based on the hard-coded assumption that ioapic polarity
will always use the ActiveHigh convention, where the logical and
physical states of level-triggered irq lines always match (i.e.,
active(asserted) == high == 1, inactive == low == 0). QEMU guests
are expected to follow directions given via ACPI and configure the
ioapic with polarity 0 (ActiveHigh). However, even when misbehaving
guests (e.g. OS X <= 10.9) set the ioapic polarity to 1 (ActiveLow),
QEMU will still use the ActiveHigh signaling convention when
interfacing with KVM.
This patch modifies KVM to completely ignore ioapic polarity as set by
the guest OS, enabling misbehaving guests to work alongside those which
comply with the ActiveHigh polarity specified by QEMU's ACPI tables.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
[Move documentation to KVM_IRQ_LINE, add ia64. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VFIO now has support for using the IOMMU_CACHE flag and a mechanism
for an external user to test the current operating mode of the IOMMU.
Add support for this to the kvm-vfio pseudo device so that we only
register noncoherent DMA when necessary.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Use the arch specific function kvm_arch_vcpu_runnable() to add a further
criterium to identify a suitable vcpu to yield to during undirected yield
processing.
Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When this was introduced, kvm_flush_remote_tlbs() could be called
without holding mmu_lock. It is now acknowledged that the function
must be called before releasing mmu_lock, and all callers have already
been changed to do so.
There is no need to use smp_mb() and cmpxchg() any more.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This fixes the build breakage introduced by
c07a0191ef and adds support for the device
control API and save/restore of the VGIC state for ARMv8.
The defines were simply missing from the arm64 header files and
uaccess.h must be implicitly imported from somewhere else on arm.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_rebooting is referenced from assembler code, thus
needs to be visible.
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1391845930-28580-1-git-send-email-ak@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Commit KVM: async_pf: Provide additional direct page notification
missed the call from kvm_check_async_pf_completion to the new introduced function.
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On s390 we are not able to cancel work. Instead we will flush the work and wait for
completion.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
By setting a Kconfig option, the architecture can control when
guest notifications will be presented by the apf backend.
There is the default batch mechanism, working as before, where the vcpu
thread should pull in this information.
Opposite to this, there is now the direct mechanism, that will push the
information to the guest.
This way s390 can use an already existing architecture interface.
Still the vcpu thread should call check_completion to cleanup leftovers.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
If kvm_io_bus_register_dev() fails then it returns success but it should
return an error code.
I also did a little cleanup like removing an impossible NULL test.
Cc: stable@vger.kernel.org
Fixes: 2b3c246a68 ('KVM: Make coalesced mmio use a device per zone')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds a floating irq controller as a kvm_device.
It will be necessary for migration of floating interrupts as well
as for hardening the reset code by allowing user space to explicitly
remove all pending floating interrupts.
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Nothing major here, just bugfixes all over the place. The most
interesting part is the ARM guys' virtualized interrupt controller
overhaul, which lets userspace get/set the state and thus enables
migration of ARM VMs.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJS3TVKAAoJEBvWZb6bTYbyIFgP/2cmt4ifCuFMaZv4+G1S8jZU
uC9ZB/+7vzht/p6zAy+4BxurKbHmSBFkC1OKcxYuy7yB4CQkHabzj4V2vRtqFdwH
5lExP9qh3kqaVLuhnvxLTmkktR3EW4PFy6OI53l5kRNktOXSuZ0aN6K3V7tCg/X0
iL7ASo4bJKlxeWcDpmuVrNgAajmZVfXrjKY7robgBQno+yIsgKhRZRBQHjozA6B8
FpCo/k48RZd/EzIbV/PDDRI4hmmry/lgrO9SKjzq56wSqff2bd/k/KYze4dbAPfd
Ps60enPTuHmeEjjb4MMMU4EKHVdTQFUMx/xZCmT4xzoh8s4of6RHphXbfE0SUznQ
dTveyEQAR7E3JNS0k1+3WEX5fWlFesp0hO2NeE0wzUq4TAr9ztgVO9NQ6Si15e7Z
2HysO0T5Ojtt0lY08/PvS6i48eCAuuBomrejJS8hLW4SUZ5adn+yW4Qo7Fp9JeBR
l9a3LsVT8BZMtUWrUuFcVhlM4MbzElUPjDbgWhR8UYU/kpfVZOQu8qWgGKR4UWXy
X7/t9l/tjR99CmfMJBAOzJid+ScSpAfg77BdaKiQrVfVIJmsjEjlO8vUMyj5b1HF
hPX5wNyJjHAOfridLeHSs4Rdm4a8sk8Az5d4h76pLVz8M4jyTi2v0rO3N4/dU/pu
x7N8KR5hAj+mLBoM9/Al
=8sYU
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"First round of KVM updates for 3.14; PPC parts will come next week.
Nothing major here, just bugfixes all over the place. The most
interesting part is the ARM guys' virtualized interrupt controller
overhaul, which lets userspace get/set the state and thus enables
migration of ARM VMs"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits)
kvm: make KVM_MMU_AUDIT help text more readable
KVM: s390: Fix memory access error detection
KVM: nVMX: Update guest activity state field on L2 exits
KVM: nVMX: Fix nested_run_pending on activity state HLT
KVM: nVMX: Clean up handling of VMX-related MSRs
KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject
KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit
KVM: nVMX: Leave VMX mode on clearing of feature control MSR
KVM: VMX: Fix DR6 update on #DB exception
KVM: SVM: Fix reading of DR6
KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS
add support for Hyper-V reference time counter
KVM: remove useless write to vcpu->hv_clock.tsc_timestamp
KVM: x86: fix tsc catchup issue with tsc scaling
KVM: x86: limit PIT timer frequency
KVM: x86: handle invalid root_hpa everywhere
kvm: Provide kvm_vcpu_eligible_for_directed_yield() stub
kvm: vfio: silence GCC warning
KVM: ARM: Remove duplicate include
arm/arm64: KVM: relax the requirements of VMA alignment for THP
...
Commit 7940876e13 ("kvm: make local
functions static") broke KVM PPC builds due to removing (rather than
moving) the stub version of kvm_vcpu_eligible_for_directed_yield().
This patch reintroduces it.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Alexander Graf <agraf@suse.de>
[Move the #ifdef inside the function. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Building vfio.o triggers a GCC warning (when building for 32 bits x86):
arch/x86/kvm/../../../virt/kvm/vfio.c: In function 'kvm_vfio_set_group':
arch/x86/kvm/../../../virt/kvm/vfio.c:104:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
void __user *argp = (void __user *)arg;
^
Silence this warning by casting arg to unsigned long.
argp's current type, "void __user *", is always casted to "int32_t
__user *". So its type might as well be changed to "int32_t __user *".
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The function kvm_io_bus_read_cookie is defined but never used
in current in-tree code.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Running 'make namespacecheck' found lots of functions that
should be declared static, since only used in one file.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Implement support for the CPU interface register access driven by MMIO
address offsets from the CPU interface base address. Useful for user
space to support save/restore of the VGIC state.
This commit adds support only for the same logic as the current VGIC
support, and no more. For example, the active priority registers are
handled as RAZ/WI, just like setting priorities on the emulated
distributor.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Handle MMIO accesses to the two registers which should support both the
case where the VMs want to read/write either of these registers and the
case where user space reads/writes these registers to do save/restore of
the VGIC state.
Note that the added complexity compared to simple set/clear enable
registers stems from the bookkeping of source cpu ids. It may be
possible to change the underlying data structure to simplify the
complexity, but since this is not in the critical path at all, this will
do.
Also note that reading this register from a live guest will not be
accurate compared to on hardware, because some state may be living on
the CPU LRs and the only way to give a consistent read would be to force
stop all the VCPUs and request them to unqueu the LR state onto the
distributor. Until we have an actual user of live reading this
register, we can live with the difference.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
To properly access the VGIC state from user space it is very unpractical
to have to loop through all the LRs in all register access functions.
Instead, support moving all pending state from LRs to the distributor,
but leave active state LRs alone.
Note that to accurately present the active and pending state to VCPUs
reading these distributor registers from a live VM, we would have to
stop all other VPUs than the calling VCPU and ask each CPU to unqueue
their LR state onto the distributor and add fields to track active state
on the distributor side as well. We don't have any users of such
functionality yet and there are other inaccuracies of the GIC emulation,
so don't provide accurate synchronized access to this state just yet.
However, when the time comes, having this function should help.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Add infrastructure to handle distributor and cpu interface register
accesses through the KVM_{GET/SET}_DEVICE_ATTR interface by adding the
KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS groups
and defining the semantics of the attr field to be the MMIO offset as
specified in the GICv2 specs.
Missing register accesses or other changes in individual register access
functions to support save/restore of the VGIC state is added in
subsequent patches.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Rename the vgic_ranges array to vgic_dist_ranges to be more specific and
to prepare for handling CPU interface register access as well (for
save/restore of VGIC state).
Pass offset from distributor or interface MMIO base to
find_matching_range function instead of the physical address of the
access in the VM memory map. This allows other callers unaware of the
VM specifics, but with generic VGIC knowledge to reuse the function.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Support setting the distributor and cpu interface base addresses in the
VM physical address space through the KVM_{SET,GET}_DEVICE_ATTR API
in addition to the ARM specific API.
This has the added benefit of being able to share more code in user
space and do things in a uniform manner.
Also deprecate the older API at the same time, but backwards
compatibility will be maintained.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Support creating the ARM VGIC device through the KVM_CREATE_DEVICE
ioctl, which can then later be leveraged to use the
KVM_{GET/SET}_DEVICE_ATTR, which is useful both for setting addresses in
a more generic API than the ARM-specific one and is useful for
save/restore of VGIC state.
Adds KVM_CAP_DEVICE_CTRL to ARM capabilities.
Note that we change the check for creating a VGIC from bailing out if
any VCPUs were created, to bailing out if any VCPUs were ever run. This
is an important distinction that shouldn't break anything, but allows
creating the VGIC after the VCPUs have been created.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Rework the VGIC initialization slightly to allow initialization of the
vgic cpu-specific state even if the irqchip (the VGIC) hasn't been
created by user space yet. This is safe, because the vgic data
structures are already allocated when the CPU is allocated if VGIC
support is compiled into the kernel. Further, the init process does not
depend on any other information and the sacrifice is a slight
performance degradation for creating VMs in the no-VGIC case.
The reason is that the new device control API doesn't mandate creating
the VGIC before creating the VCPU and it is unreasonable to require user
space to create the VGIC before creating the VCPUs.
At the same time move the irqchip_in_kernel check out of
kvm_vcpu_first_run_init and into the init function to make the per-vcpu
and global init functions symmetric and add comments on the exported
functions making it a bit easier to understand the init flow by only
looking at vgic.c.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
For migration to work we need to save (and later restore) the state of
each core's virtual generic timer.
Since this is per VCPU, we can use the [gs]et_one_reg ioctl and export
the three needed registers (control, counter, compare value).
Though they live in cp15 space, we don't use the existing list, since
they need special accessor functions and the arch timer is optional.
Acked-by: Marc Zynger <marc.zyngier@arm.com>
Signed-off-by: Andre Przywara <andre.przywara@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Initialize the cntvoff at kvm_init_vm time, not before running the VCPUs
at the first time because that will overwrite any potentially restored
values from user space.
Cc: Andre Przywara <andre.przywara@linaro.org>
Acked-by: Marc Zynger <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Since the commit 15ad7146 ("KVM: Use the scheduler preemption notifiers
to make kvm preemptible"), the remaining stuff in this function is a
simple cond_resched() call with an extra need_resched() check which was
there to avoid dropping VCPUs unnecessarily. Now it is meaningless.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In multiple functions the vcpu_id is used as an offset into a bitfield. Ag
malicious user could specify a vcpu_id greater than 255 in order to set or
clear bits in kernel memory. This could be used to elevate priveges in the
kernel. This patch verifies that the vcpu_id provided is less than 255.
The api documentation already specifies that the vcpu_id must be less than
max_vcpus, but this is currently not checked.
Reported-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Using the address of 'empty_zero_page' as source address in order to
clear a page is wrong. On some architectures empty_zero_page is only the
pointer to the struct page of the empty_zero_page. Therefore the clear
page operation would copy the contents of a couple of struct pages instead
of clearing a page. For kvm only arm/arm64 are affected by this bug.
To fix this use the ZERO_PAGE macro instead which will return the struct
page address of the empty_zero_page on all architectures.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
side: the HV and emulation flavors can now coexist in a single kernel
is probably the most interesting change from a user point of view.
On the x86 side there are nested virtualization improvements and a
few bugfixes. ARM got transparent huge page support, improved
overcommit, and support for big endian guests.
Finally, there is a new interface to connect KVM with VFIO. This
helps with devices that use NoSnoop PCI transactions, letting the
driver in the guest execute WBINVD instructions. This includes
some nVidia cards on Windows, that fail to start without these
patches and the corresponding userspace changes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJShPAhAAoJEBvWZb6bTYbyl48P/297GgmELHAGBgjvb6q7yyGu
L8+eHjKbh4XBAkPwyzbvUjuww5z2hM0N3JQ0BDV9oeXlO+zwwCEns/sg2Q5/NJXq
XxnTeShaKnp9lqVBnE6G9rAOUWKoyLJ2wItlvUL8JlaO9xJ0Vmk0ta4n2Nv5GqDp
db6UD7vju6rHtIAhNpvvAO51kAOwc01xxRixCVb7KUYOnmO9nvpixzoI/S0Rp1gu
w/OWMfCosDzBoT+cOe79Yx1OKcpaVW94X6CH1s+ShCw3wcbCL2f13Ka8/E3FIcuq
vkZaLBxio7vjUAHRjPObw0XBW4InXEbhI1DjzIvm8dmc4VsgmtLQkTCG8fj+jINc
dlHQUq6Do+1F4zy6WMBUj8tNeP1Z9DsABp98rQwR8+BwHoQpGQBpAxW0TE0ZMngC
t1caqyvjZ5pPpFUxSrAV+8Kg4AvobXPYOim0vqV7Qea07KhFcBXLCfF7BWdwq/Jc
0CAOlsLL4mHGIQWZJuVGw0YGP7oATDCyewlBuDObx+szYCoV4fQGZVBEL0KwJx/1
7lrLN7JWzRyw6xTgJ5VVwgYE1tUY4IFQcHu7/5N+dw8/xg9KWA3f4PeMavIKSf+R
qteewbtmQsxUnvuQIBHLs8NRWPnBPy+F3Sc2ckeOLIe4pmfTte6shtTXcLDL+LqH
NTmT/cfmYp2BRkiCfCiS
=rWNf
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM changes from Paolo Bonzini:
"Here are the 3.13 KVM changes. There was a lot of work on the PPC
side: the HV and emulation flavors can now coexist in a single kernel
is probably the most interesting change from a user point of view.
On the x86 side there are nested virtualization improvements and a few
bugfixes.
ARM got transparent huge page support, improved overcommit, and
support for big endian guests.
Finally, there is a new interface to connect KVM with VFIO. This
helps with devices that use NoSnoop PCI transactions, letting the
driver in the guest execute WBINVD instructions. This includes some
nVidia cards on Windows, that fail to start without these patches and
the corresponding userspace changes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
kvm, vmx: Fix lazy FPU on nested guest
arm/arm64: KVM: PSCI: propagate caller endianness to the incoming vcpu
arm/arm64: KVM: MMIO support for BE guest
kvm, cpuid: Fix sparse warning
kvm: Delete prototype for non-existent function kvm_check_iopl
kvm: Delete prototype for non-existent function complete_pio
hung_task: add method to reset detector
pvclock: detect watchdog reset at pvclock read
kvm: optimize out smp_mb after srcu_read_unlock
srcu: API for barrier after srcu read unlock
KVM: remove vm mmap method
KVM: IOMMU: hva align mapping page size
KVM: x86: trace cpuid emulation when called from emulator
KVM: emulator: cleanup decode_register_operand() a bit
KVM: emulator: check rex prefix inside decode_register()
KVM: x86: fix emulation of "movzbl %bpl, %eax"
kvm_host: typo fix
KVM: x86: emulate SAHF instruction
MAINTAINERS: add tree for kvm.git
Documentation/kvm: add a 00-INDEX file
...
It was used in conjunction with KVM_SET_MEMORY_REGION ioctl which was
removed by b74a07beed in 2010, QEMU stopped using it in 2008, so
it is time to remove the code finally.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
When determining the page size we could use to map with the IOMMU, the
page size should also be aligned with the hva, not just the gfn. The
gfn may not reflect the real alignment within the hugetlbfs file.
Most of the time, this works fine. However, if the hugetlbfs file is
backed by non-contiguous huge pages, a multi-huge page memslot starts at
an unaligned offset within the hugetlbfs file, and the gfn is aligned
with respect to the huge page size, kvm_host_page_size() will return the
huge page size and we will use that to map with the IOMMU.
When we later unpin that same memslot, the IOMMU returns the unmap size
as the huge page size, and we happily unpin that many pfns in
monotonically increasing order, not realizing we are spanning
non-contiguous huge pages and partially unpin the wrong huge page.
Ensure the IOMMU mapping page size is aligned with the hva corresponding
to the gfn, which does reflect the alignment within the hugetlbfs file.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Cc: stable@vger.kernel.org
Signed-off-by: Gleb Natapov <gleb@redhat.com>
We currently use some ad-hoc arch variables tied to legacy KVM device
assignment to manage emulation of instructions that depend on whether
non-coherent DMA is present. Create an interface for this, adapting
legacy KVM device assignment and adding VFIO via the KVM-VFIO device.
For now we assume that non-coherent DMA is possible any time we have a
VFIO group. Eventually an interface can be developed as part of the
VFIO external user interface to query the coherency of a group.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Default to operating in coherent mode. This simplifies the logic when
we switch to a model of registering and unregistering noncoherent I/O
with KVM.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
So far we've succeeded at making KVM and VFIO mostly unaware of each
other, but areas are cropping up where a connection beyond eventfds
and irqfds needs to be made. This patch introduces a KVM-VFIO device
that is meant to be a gateway for such interaction. The user creates
the device and can add and remove VFIO groups to it via file
descriptors. When a group is added, KVM verifies the group is valid
and gets a reference to it via the VFIO external user interface.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
I don't know if this was due to cut and paste, or somebody was really
using a D20 to pick the error code for kvm_init_debugfs as suggested by
Linus (EFAULT is 14, so the possibility cannot be entirely ruled out).
In any case, this patch fixes it.
Reported-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In kvm_iommu_map_pages(), we need to know the page size via call
kvm_host_page_size(). And it will check whether the target slot
is valid before return the right page size.
Currently, we will map the iommu pages when creating a new slot.
But we call kvm_iommu_map_pages() during preparing the new slot.
At that time, the new slot is not visible by domain(still in preparing).
So we cannot get the right page size from kvm_host_page_size() and
this will break the IOMMU super page logic.
The solution is to map the iommu pages after we insert the new slot
into domain.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Tested-by: Patrick Lu <patrick.lu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We will use that in the later patch to find the kvm ops handler
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Page pinning is not mandatory in kvm async page fault processing since
after async page fault event is delivered to a guest it accesses page once
again and does its own GUP. Drop the FOLL_GET flag in GUP in async_pf
code, and do some simplifying in check/clear processing.
Suggested-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Gu zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
When KVM (de)assigns PCI(e) devices to VMs, a debug message is printed
including the BDF notation of the respective device. Currently, the BDF
notation does not have the commonly used leading zeros. This produces
messages like "assign device 0:1:8.0", which look strange at first sight.
The patch fixes this by exchanging the printk(KERN_DEBUG ...) with dev_info()
and also inserts "kvm" into the debug message, so that it is obvious where
the message comes from. Also reduces LoC.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Andre Richter <andre.o.richter@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
gfn_to_memslot() can return NULL or invalid slot. We need to check slot
validity before accessing it.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
In commit e935b8372c ("KVM: Convert kvm_lock to raw_spinlock"),
the kvm_lock was made a raw lock. However, the kvm mmu_shrink()
function tries to grab the (non-raw) mmu_lock within the scope of
the raw locked kvm_lock being held. This leads to the following:
BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
in_atomic(): 1, irqs_disabled(): 0, pid: 55, name: kswapd0
Preemption disabled at:[<ffffffffa0376eac>] mmu_shrink+0x5c/0x1b0 [kvm]
Pid: 55, comm: kswapd0 Not tainted 3.4.34_preempt-rt
Call Trace:
[<ffffffff8106f2ad>] __might_sleep+0xfd/0x160
[<ffffffff817d8d64>] rt_spin_lock+0x24/0x50
[<ffffffffa0376f3c>] mmu_shrink+0xec/0x1b0 [kvm]
[<ffffffff8111455d>] shrink_slab+0x17d/0x3a0
[<ffffffff81151f00>] ? mem_cgroup_iter+0x130/0x260
[<ffffffff8111824a>] balance_pgdat+0x54a/0x730
[<ffffffff8111fe47>] ? set_pgdat_percpu_threshold+0xa7/0xd0
[<ffffffff811185bf>] kswapd+0x18f/0x490
[<ffffffff81070961>] ? get_parent_ip+0x11/0x50
[<ffffffff81061970>] ? __init_waitqueue_head+0x50/0x50
[<ffffffff81118430>] ? balance_pgdat+0x730/0x730
[<ffffffff81060d2b>] kthread+0xdb/0xe0
[<ffffffff8106e122>] ? finish_task_switch+0x52/0x100
[<ffffffff817e1e94>] kernel_thread_helper+0x4/0x10
[<ffffffff81060c50>] ? __init_kthread_worker+0x
After the previous patch, kvm_lock need not be a raw spinlock anymore,
so change it back.
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: kvm@vger.kernel.org
Cc: gleb@redhat.com
Cc: jan.kiszka@siemens.com
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The VM list need not be protected by a raw spinlock. Separate the
two so that kvm_lock can be made non-raw.
Cc: kvm@vger.kernel.org
Cc: gleb@redhat.com
Cc: jan.kiszka@siemens.com
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the useless argument, and do not do anything if there are no
VMs running at the time of the hotplug.
Cc: kvm@vger.kernel.org
Cc: gleb@redhat.com
Cc: jan.kiszka@siemens.com
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
'.done' is used to mark the completion of 'async_pf_execute()', but
'cancel_work_sync()' returns true when the work was canceled, so we
use it instead.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When we cancel 'async_pf_execute()', we should behave as if the work was
never scheduled in 'kvm_setup_async_pf()'.
Fixes a bug when we can't unload module because the vm wasn't destroyed.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Page tables in a read-only memory slot will currently cause a triple
fault because the page walker uses gfn_to_hva and it fails on such a slot.
OVMF uses such a page table; however, real hardware seems to be fine with
that as long as the accessed/dirty bits are set. Save whether the slot
is readonly, and later check it when updating the accessed and dirty bits.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull vfs pile 1 from Al Viro:
"Unfortunately, this merge window it'll have a be a lot of small piles -
my fault, actually, for not keeping #for-next in anything that would
resemble a sane shape ;-/
This pile: assorted fixes (the first 3 are -stable fodder, IMO) and
cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last
components) + several long-standing patches from various folks.
There definitely will be a lot more (starting with Miklos'
check_submount_and_drop() series)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
direct-io: Handle O_(D)SYNC AIO
direct-io: Implement generic deferred AIO completions
add formats for dentry/file pathnames
kvm eventfd: switch to fdget
powerpc kvm: use fdget
switch fchmod() to fdget
switch epoll_ctl() to fdget
switch copy_module_from_fd() to fdget
git simplify nilfs check for busy subtree
ibmasmfs: don't bother passing superblock when not needed
don't pass superblock to hypfs_{mkdir,create*}
don't pass superblock to hypfs_diag_create_files
don't pass superblock to hypfs_vm_create_files()
oprofile: get rid of pointless forward declarations of struct super_block
oprofilefs_create_...() do not need superblock argument
oprofilefs_mkdir() doesn't need superblock argument
don't bother with passing superblock to oprofile_create_stats_files()
oprofile: don't bother with passing superblock to ->create_files()
don't bother passing sb to oprofile_create_files()
coh901318: don't open-code simple_read_from_buffer()
...
For bytemaps each IRQ field is 1 byte wide, so we pack 4 irq fields in
one word and since there are 32 private (per cpu) irqs, we have 8
private u32 fields on the vgic_bytemap struct. We shift the offset from
the base of the register group right by 2, giving us the word index
instead of the field index. But then there are 8 private words, not 4,
which is also why we subtract 8 words from the offset of the shared
words.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
All the code in handle_mmio_cfg_reg() assumes the offset has
been shifted right to accomodate for the 2:1 bit compression,
but this is only done when getting the register address.
Shift the offset early so the code works mostly unchanged.
Reported-by: Zhaobo (Bob, ERC) <zhaobo@huawei.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
vgic_get_target_reg is quite complicated, for no good reason.
Actually, it is fairly easy to write it in a much more efficient
way by using the target CPU array instead of the bitmap.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
This is the type-safe comparison function, so the double-underscore is
not related.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
The checks on PG_reserved in the page structure on head and tail pages
aren't necessary because split_huge_page wouldn't transfer the
PG_reserved bit from head to tail anyway.
This was a forward-thinking check done in the case PageReserved was
set by a driver-owned page mapped in userland with something like
remap_pfn_range in a VM_PFNMAP region, but using hugepmds (not
possible right now). It was meant to be very safe, but it's overkill
as it's unlikely split_huge_page could ever run without the driver
noticing and tearing down the hugepage itself.
And if a driver in the future will really want to map a reserved
hugepage in userland using an huge pmd it should simply take care of
marking all subpages reserved too to keep KVM safe. This of course
would require such a hypothetical driver to tear down the huge pmd
itself and splitting the hugepage itself, instead of relaying on
split_huge_page, but that sounds very reasonable, especially
considering split_huge_page wouldn't currently transfer the reserved
bit anyway.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
KVM uses anon_inode_get() to allocate file descriptors as part
of some of its ioctls. But those ioctls are lacking a flag argument
allowing userspace to choose options for the newly opened file descriptor.
In such case it's advised to use O_CLOEXEC by default so that
userspace is allowed to choose, without race, if the file descriptor
is going to be inherited across exec().
This patch set O_CLOEXEC flag on all file descriptors created
with anon_inode_getfd() to not leak file descriptors across exec().
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://lkml.kernel.org/r/cover.1377372576.git.ydroneaud@opteya.com
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
kvm_io_bus_sort_cmp is used also directly, not just as a callback for
sort and bsearch. In these cases, it is handy to have a type-safe
variant. This patch introduces such a variant, __kvm_io_bus_sort_cmp,
and uses it throughout kvm_main.c.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is called right after the memslots is updated, i.e. when the result
of update_memslots() gets installed in install_new_memslots(). Since
the memslots needs to be updated twice when we delete or move a memslot,
kvm_arch_commit_memory_region() does not correspond to this exactly.
In the following patch, x86 will use this new API to check if the mmio
generation has reached its maximum value, in which case mmio sptes need
to be flushed out.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Acked-by: Alexander Graf <agraf@suse.de>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add new functions kvm_io_bus_{read,write}_cookie() that allows users of
the kvm io infrastructure to use a cookie value to speed up lookup of a
device on an io bus.
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
On the x86 side, there are some optimizations and documentation updates.
The big ARM/KVM change for 3.11, support for AArch64, will come through
Catalin Marinas's tree. s390 and PPC have misc cleanups and bugfixes.
There is a conflict due to "s390/pgtable: fix ipte notify bit" having
entered 3.10 through Martin Schwidefsky's s390 tree. This pull request
has additional changes on top, so this tree's version is the correct one.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQIcBAABAgAGBQJR0oU6AAoJEBvWZb6bTYbynnsP/RSUrrHrA8Wu1tqVfAKu+1y5
6OIihqZ9x11/YMaNofAfv86jqxFu0/j7CzMGphNdjzujqKI+Q1tGe7oiVCmKzoG+
UvSctWsz0lpllgBtnnrm5tcfmG6rrddhLtpA7m320+xCVx8KV5P4VfyHZEU+Ho8h
ziPmb2mAQ65gBNX6nLHEJ3ITTgad6gt4NNbrKIYpyXuWZQJypzaRqT/vpc4md+Ed
dCebMXsL1xgyb98EcnOdrWH1wV30MfucR7IpObOhXnnMKeeltqAQPvaOlKzZh4dK
+QfxJfdRZVS0cepcxzx1Q2X3dgjoKQsHq1nlIyz3qu1vhtfaqBlixLZk0SguZ/R9
1S1YqucZiLRO57RD4q0Ak5oxwobu18ZoqJZ6nledNdWwDe8bz/W2wGAeVty19ky0
qstBdM9jnwXrc0qrVgZp3+s5dsx3NAm/KKZBoq4sXiDLd/yBzdEdWIVkIrU3X9wU
3X26wOmBxtsB7so/JR7ciTsQHelmLicnVeXohAEP9CjIJffB81xVXnXs0P0SYuiQ
RzbSCwjPzET4JBOaHWT0Dhv0DTS/EaI97KzlN32US3Bn3WiLlS1oDCoPFoaLqd2K
LxQMsXS8anAWxFvexfSuUpbJGPnKSidSQoQmJeMGBa9QhmZCht3IL16/Fb641ToN
xBohzi49L9FDbpOnTYfz
=1zpG
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"On the x86 side, there are some optimizations and documentation
updates. The big ARM/KVM change for 3.11, support for AArch64, will
come through Catalin Marinas's tree. s390 and PPC have misc cleanups
and bugfixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (87 commits)
KVM: PPC: Ignore PIR writes
KVM: PPC: Book3S PR: Invalidate SLB entries properly
KVM: PPC: Book3S PR: Allow guest to use 1TB segments
KVM: PPC: Book3S PR: Don't keep scanning HPTEG after we find a match
KVM: PPC: Book3S PR: Fix invalidation of SLB entry 0 on guest entry
KVM: PPC: Book3S PR: Fix proto-VSID calculations
KVM: PPC: Guard doorbell exception with CONFIG_PPC_DOORBELL
KVM: Fix RTC interrupt coalescing tracking
kvm: Add a tracepoint write_tsc_offset
KVM: MMU: Inform users of mmio generation wraparound
KVM: MMU: document fast invalidate all mmio sptes
KVM: MMU: document fast invalidate all pages
KVM: MMU: document fast page fault
KVM: MMU: document mmio page fault
KVM: MMU: document write_flooding_count
KVM: MMU: document clear_spte_count
KVM: MMU: drop kvm_mmu_zap_mmio_sptes
KVM: MMU: init kvm generation close to mmio wrap-around value
KVM: MMU: add tracepoint for check_mmio_spte
KVM: MMU: fast invalidate all mmio sptes
...
KVM/ARM pull request for 3.11 merge window
* tag 'kvm-arm-3.11' of git://git.linaro.org/people/cdall/linux-kvm-arm.git:
ARM: kvm: don't include drivers/virtio/Kconfig
Update MAINTAINERS: KVM/ARM work now funded by Linaro
arm/kvm: Cleanup KVM_ARM_MAX_VCPUS logic
ARM: KVM: clear exclusive monitor on all exception returns
ARM: KVM: add missing dsb before invalidating Stage-2 TLBs
ARM: KVM: perform save/restore of PAR
ARM: KVM: get rid of S2_PGD_SIZE
ARM: KVM: don't special case PC when doing an MMIO
ARM: KVM: use phys_addr_t instead of unsigned long long for HYP PGDs
ARM: KVM: remove dead prototype for __kvm_tlb_flush_vmid
ARM: KVM: Don't handle PSCI calls via SMC
ARM: KVM: Allow host virt timer irq to be different from guest timer virt irq
This reverts most of the f1ed0450a5. After
the commit kvm_apic_set_irq() no longer returns accurate information
about interrupt injection status if injection is done into disabled
APIC. RTC interrupt coalescing tracking relies on the information to be
accurate and cannot recover if it is not.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
The arch_timer irq numbers (or PPI numbers) are implementation dependent,
so the host virtual timer irq number can be different from guest virtual
timer irq number.
This patch ensures that host virtual timer irq number is read from DTB and
guest virtual timer irq is determined based on vcpu target type.
Signed-off-by: Anup Patel <anup.patel@linaro.org>
Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
Signed-off-by: Christoffer Dall <cdall@cs.columbia.edu>
We can easily reach the 1000 limit by start VM with a couple
hundred I/O devices (multifunction=on). The hardcode limit
already been adjusted 3 times (6 ~ 200 ~ 300 ~ 1000).
In userspace, we already have maximum file descriptor to
limit ioeventfd count. But kvm_io_bus devices also are used
for pit, pic, ioapic, coalesced_mmio. They couldn't be limited
by maximum file descriptor.
Currently only ioeventfds take too much kvm_io_bus devices,
so just exclude it from counting kvm_io_range limit.
Also fixed one indent issue in kvm_host.h
Signed-off-by: Amos Kong <akong@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
As KVM/arm64 is looming on the horizon, it makes sense to move some
of the common code to a single location in order to reduce duplication.
The code could live anywhere. Actually, most of KVM is already built
with a bunch of ugly ../../.. hacks in the various Makefiles, so we're
not exactly talking about style here. But maybe it is time to start
moving into a less ugly direction.
The include files must be in a "public" location, as they are accessed
from non-KVM files (arch/arm/kernel/asm-offsets.c).
For this purpose, introduce two new locations:
- virt/kvm/arm/ : x86 and ia64 already share the ioapic code in
virt/kvm, so this could be seen as a (very ugly) precedent.
- include/kvm/ : there is already an include/xen, and while the
intent is slightly different, this seems as good a location as
any
Eventually, we should probably have independant Makefiles at every
levels (just like everywhere else in the kernel), but this is just
the first step.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Since the arrival of posted interrupt support we can no longer guarantee
that coalesced IRQs are always reported to the IRQ source. Moreover,
accumulated APIC timer events could cause a busy loop when a VCPU should
rather be halted. The consensus is to remove coalesced tracking from the
LAPIC.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Add the missing misc_deregister() before return from kvm_init()
in the debugfs init error handling case.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Pull kvm fixes from Gleb Natapov:
"Most of the fixes are in the emulator since now we emulate more than
we did before for correctness sake we see more bugs there, but there
is also an OOPS fixed and corruption of xcr0 register."
* tag 'kvm-3.10-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: emulator: emulate SALC
KVM: emulator: emulate XLAT
KVM: emulator: emulate AAM
KVM: VMX: fix halt emulation while emulating invalid guest sate
KVM: Fix kvm_irqfd_init initialization
KVM: x86: fix maintenance of guest/host xcr0 state
Pull MIPS updates from Ralf Baechle:
- More work on DT support for various platforms
- Various fixes that were to late to make it straight into 3.9
- Improved platform support, in particular the Netlogic XLR and
BCM63xx, and the SEAD3 and Malta eval boards.
- Support for several Ralink SOC families.
- Complete support for the microMIPS ASE which basically reencodes the
existing MIPS32/MIPS64 ISA to use non-constant size instructions.
- Some fallout from LTO work which remove old cruft and will generally
make the MIPS kernel easier to maintain and resistant to compiler
optimization, even in absence of LTO.
- KVM support. While MIPS has announced hardware virtualization
extensions this KVM extension uses trap and emulate mode for
virtualization of MIPS32. More KVM work to add support for VZ
hardware virtualizaiton extensions and MIPS64 will probably already
be merged for 3.11.
Most of this has been sitting in -next for a long time. All defconfigs
have been build or run time tested except three for which fixes are being
sent by other maintainers.
Semantic conflict with kvm updates done as per Ralf
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (118 commits)
MIPS: Add new GIC clockevent driver.
MIPS: Formatting clean-ups for clocksources.
MIPS: Refactor GIC clocksource code.
MIPS: Move 'gic_frequency' to common location.
MIPS: Move 'gic_present' to common location.
MIPS: MIPS16e: Add unaligned access support.
MIPS: MIPS16e: Support handling of delay slots.
MIPS: MIPS16e: Add instruction formats.
MIPS: microMIPS: Optimise 'strnlen' core library function.
MIPS: microMIPS: Optimise 'strlen' core library function.
MIPS: microMIPS: Optimise 'strncpy' core library function.
MIPS: microMIPS: Optimise 'memset' core library function.
MIPS: microMIPS: Add configuration option for microMIPS kernel.
MIPS: microMIPS: Disable LL/SC and fix linker bug.
MIPS: microMIPS: Add vdso support.
MIPS: microMIPS: Add unaligned access support.
MIPS: microMIPS: Support handling of delay slots.
MIPS: microMIPS: Add support for exception handling.
MIPS: microMIPS: Floating point support.
MIPS: microMIPS: Fix macro naming in micro-assembler.
...
Pull kvm updates from Gleb Natapov:
"Highlights of the updates are:
general:
- new emulated device API
- legacy device assignment is now optional
- irqfd interface is more generic and can be shared between arches
x86:
- VMCS shadow support and other nested VMX improvements
- APIC virtualization and Posted Interrupt hardware support
- Optimize mmio spte zapping
ppc:
- BookE: in-kernel MPIC emulation with irqfd support
- Book3S: in-kernel XICS emulation (incomplete)
- Book3S: HV: migration fixes
- BookE: more debug support preparation
- BookE: e6500 support
ARM:
- reworking of Hyp idmaps
s390:
- ioeventfd for virtio-ccw
And many other bug fixes, cleanups and improvements"
* tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
kvm: Add compat_ioctl for device control API
KVM: x86: Account for failing enable_irq_window for NMI window request
KVM: PPC: Book3S: Add API for in-kernel XICS emulation
kvm/ppc/mpic: fix missing unlock in set_base_addr()
kvm/ppc: Hold srcu lock when calling kvm_io_bus_read/write
kvm/ppc/mpic: remove users
kvm/ppc/mpic: fix mmio region lists when multiple guests used
kvm/ppc/mpic: remove default routes from documentation
kvm: KVM_CAP_IOMMU only available with device assignment
ARM: KVM: iterate over all CPUs for CPU compatibility check
KVM: ARM: Fix spelling in error message
ARM: KVM: define KVM_ARM_MAX_VCPUS unconditionally
KVM: ARM: Fix API documentation for ONE_REG encoding
ARM: KVM: promote vfp_host pointer to generic host cpu context
ARM: KVM: add architecture specific hook for capabilities
ARM: KVM: perform HYP initilization for hotplugged CPUs
ARM: KVM: switch to a dual-step HYP init code
ARM: KVM: rework HYP page table freeing
ARM: KVM: enforce maximum size for identity mapped code
ARM: KVM: move to a KVM provided HYP idmap
...
This API shouldn't have 32/64-bit issues, but VFS assumes it does
unless told otherwise.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
This adds the API for userspace to instantiate an XICS device in a VM
and connect VCPUs to it. The API consists of a new device type for
the KVM_CREATE_DEVICE ioctl, a new capability KVM_CAP_IRQ_XICS, which
functions similarly to KVM_CAP_IRQ_MPIC, and the KVM_IRQ_LINE ioctl,
which is used to assert and deassert interrupt inputs of the XICS.
The XICS device has one attribute group, KVM_DEV_XICS_GRP_SOURCES.
Each attribute within this group corresponds to the state of one
interrupt source. The attribute number is the same as the interrupt
source number.
This does not support irq routing or irqfd yet.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
The hassle of getting refcounting right was greater than the hassle
of keeping a list of devices to destroy on VM exit.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Hook the MPIC code up to the KVM interfaces, add locking, etc.
Signed-off-by: Scott Wood <scottwood@freescale.com>
[agraf: add stub function for kvmppc_mpic_set_epr, non-booke, 64bit]
Signed-off-by: Alexander Graf <agraf@suse.de>
Currently, devices that are emulated inside KVM are configured in a
hardcoded manner based on an assumption that any given architecture
only has one way to do it. If there's any need to access device state,
it is done through inflexible one-purpose-only IOCTLs (e.g.
KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is
cumbersome and depletes a limited numberspace.
This API provides a mechanism to instantiate a device of a certain
type, returning an ID that can be used to set/get attributes of the
device. Attributes may include configuration parameters (e.g.
register base address), device state, operational commands, etc. It
is similar to the ONE_REG API, except that it acts on devices rather
than vcpus.
Both device types and individual attributes can be tested without having
to create the device or get/set the attribute, without the need for
separately managing enumerated capabilities.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Now that we have most irqfd code completely platform agnostic, let's move
irqfd's resample capability return to generic code as well.
Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Setting up IRQ routes is nothing IOAPIC specific. Extract everything
that really is generic code into irqchip.c and only leave the ioapic
specific bits to irq_comm.c.
Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
The current irq_comm.c file contains pieces of code that are generic
across different irqchip implementations, as well as code that is
fully IOAPIC specific.
Split the generic bits out into irqchip.c.
Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
The IRQ routing set ioctl lives in the hacky device assignment code inside
of KVM today. This is definitely the wrong place for it. Move it to the much
more natural kvm_main.c.
Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Quite a bit of code in KVM has been conditionalized on availability of
IOAPIC emulation. However, most of it is generically applicable to
platforms that don't have an IOPIC, but a different type of irq chip.
Make code that only relies on IRQ routing, not an APIC itself, on
CONFIG_HAVE_KVM_IRQ_ROUTING, so that we can reuse it later.
Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
The concept of routing interrupt lines to an irqchip is nothing
that is IOAPIC specific. Every irqchip has a maximum number of pins
that can be linked to irq lines.
So let's add a new define that allows us to reuse generic code for
non-IOAPIC platforms.
Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Only deliver the posted interrupt when target vcpu is running
and there is no previous interrupt pending in pir.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We already know the trigger mode of a given interrupt when programming
the ioapice entry. So it's not necessary to set it in each interrupt
delivery.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Both TMR and EOI exit bitmap need to be updated when ioapic changed
or vcpu's id/ldr/dfr changed. So use common function instead eoi exit
bitmap specific function.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Current interrupt coalescing logci which only used by RTC has conflict
with Posted Interrupt.
This patch introduces a new mechinism to use eoi to track interrupt:
When delivering an interrupt to vcpu, the pending_eoi set to number of
vcpu that received the interrupt. And decrease it when each vcpu writing
eoi. No subsequent RTC interrupt can deliver to vcpu until all vcpus
write eoi.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Userspace may deliver RTC interrupt without query the status. So we
want to track RTC EOI for this case.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Need the EOI to track interrupt deliver status, so force vmexit
on EOI for rtc interrupt when enabling virtual interrupt delivery.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Add a new parameter to know vcpus who received the interrupt.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
rtc_status is used to track RTC interrupt delivery status. The pending_eoi
will be increased by vcpu who received RTC interrupt and will be decreased
when EOI to this interrupt.
Also, we use dest_map to record the destination vcpu to avoid the case that
vcpu who didn't get the RTC interupt, but issued EOI with same vector of RTC
and descreased pending_eoi by mistake.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Add vcpu info to ioapic_update_eoi, so we can know which vcpu
issued this EOI.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The routine kvm_spurious_fault() is an x86 specific routine, so
move it from virt/kvm/kvm_main.c to arch/x86/kvm/x86.c.
Fixes this sparse warning when building on arm64:
virt/kvm/kvm_main.c⚠️ symbol 'kvm_spurious_fault' was not declared. Should it be static?
Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
The routines get_user_page_nowait(), kvm_io_bus_sort_cmp(), kvm_io_bus_insert_dev()
and kvm_io_bus_get_first_dev() are only referenced within kvm_main.c, so give them
static linkage.
Fixes sparse warnings like these:
virt/kvm/kvm_main.c: warning: symbol 'get_user_page_nowait' was not declared. Should it be static?
Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
PIO and MMIO are separate address spaces, but
ioeventfd registration code mistakenly detected
two eventfds as duplicate if they use the same address,
even if one is PIO and another one MMIO.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
For a given vcpu, kvm_apic_match_dest() will tell you whether
the vcpu in the destination list quickly. Drop kvm_calculate_eoi_exitmap()
and use kvm_apic_match_dest() instead.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
This patch adds support for kvm_gfn_to_hva_cache_init functions for
reads and writes that will cross a page. If the range falls within
the same memslot, then this will be a fast operation. If the range
is split between two memslots, then the slower kvm_read_guest and
kvm_write_guest are used.
Tested: Test against kvm_clock unit tests.
Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Merge reason:
From: Alexander Graf <agraf@suse.de>
"Just recently this really important patch got pulled into Linus' tree for 3.9:
commit 1674400aae
Author: Anton Blanchard <anton <at> samba.org>
Date: Tue Mar 12 01:51:51 2013 +0000
Without that commit, I can not boot my G5, thus I can't run automated tests on it against my queue.
Could you please merge kvm/next against linus/master, so that I can base my trees against that?"
* upstream/master: (653 commits)
PCI: Use ROM images from firmware only if no other ROM source available
sparc: remove unused "config BITS"
sparc: delete "if !ULTRA_HAS_POPULATION_COUNT"
KVM: Fix bounds checking in ioapic indirect register reads (CVE-2013-1798)
KVM: x86: Convert MSR_KVM_SYSTEM_TIME to use gfn_to_hva_cache functions (CVE-2013-1797)
KVM: x86: fix for buffer overflow in handling of MSR_KVM_SYSTEM_TIME (CVE-2013-1796)
arm64: Kconfig.debug: Remove unused CONFIG_DEBUG_ERRORS
arm64: Do not select GENERIC_HARDIRQS_NO_DEPRECATED
inet: limit length of fragment queue hash table bucket lists
qeth: Fix scatter-gather regression
qeth: Fix invalid router settings handling
qeth: delay feature trace
sgy-cts1000: Remove __dev* attributes
KVM: x86: fix deadlock in clock-in-progress request handling
KVM: allow host header to be included even for !CONFIG_KVM
hwmon: (lm75) Fix tcn75 prefix
hwmon: (lm75.h) Update header inclusion
MAINTAINERS: Remove Mark M. Hoffman
xfs: ensure we capture IO errors correctly
xfs: fix xfs_iomap_eof_prealloc_initial_size type
...
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If the guest specifies a IOAPIC_REG_SELECT with an invalid value and follows
that with a read of the IOAPIC_REG_WINDOW KVM does not properly validate
that request. ioapic_read_indirect contains an
ASSERT(redir_index < IOAPIC_NUM_PINS), but the ASSERT has no effect in
non-debug builds. In recent kernels this allows a guest to cause a kernel
oops by reading invalid memory. In older kernels (pre-3.3) this allows a
guest to read from large ranges of host memory.
Tested: tested against apic unit tests.
Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This helps in filtering out the eligible candidates further and
thus potentially helps in quickly allowing preempted lockholders to run.
Note that if a vcpu was spinning during preemption we filter them
by checking whether they are preempted due to pause loop exit.
Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Note that we mark as preempted only when vcpu's task state was
Running during preemption.
Thanks Jiannan, Avi for preemption notifier ideas. Thanks Gleb, PeterZ
for their precious suggestions. Thanks Srikar for an idea on avoiding
rcu lock while checking task state that improved overcommit numbers.
Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Enhance KVM_IOEVENTFD with a new flag that allows to attach to virtio-ccw
devices on s390 via the KVM_VIRTIO_CCW_NOTIFY_BUS.
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently, eventfd introduces module_init/module_exit functions
to initialize/cleanup the irqfd workqueue. This only works, however,
if no other module_init/module_exit functions are built into the
same module.
Let's just move the initialization and cleanup to kvm_init and kvm_exit.
This way, it is also clearer where kvm startup may fail.
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch makes the parameter old a const pointer to the old memory
slot and adds a new parameter named change to know the change being
requested: the former is for removing extra copying and the latter is
for cleaning up the code.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch drops the parameter old, a copy of the old memory slot, and
adds a new parameter named change to know the change being requested.
This not only cleans up the code but also removes extra copying of the
memory slot structure.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This will be used for cleaning up prepare/commit_memory_region() later.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Except ia64's stale code, KVM_SET_MEMORY_REGION support, this is only
used for sanity checks in __kvm_set_memory_region() which can easily
be changed to use slot id instead.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
X86 does not use this any more. The remaining user, s390's !user_alloc
check, can be simply removed since KVM_SET_MEMORY_REGION ioctl is no
longer supported.
Note: fixed powerpc's indentations with spaces to suppress checkpatch
errors.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This field was needed to differentiate memory slots created by the new
API, KVM_SET_USER_MEMORY_REGION, from those by the old equivalent,
KVM_SET_MEMORY_REGION, whose support was dropped long before:
commit b74a07beed
KVM: Remove kernel-allocated memory regions
Although we also have private memory slots to which KVM allocates
memory with vm_mmap(), !user_alloc slots in other words, the slot id
should be enough for differentiating them.
Note: corresponding function parameters will be removed later.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
As Xiao pointed out, there are a few problems with it:
- kvm_arch_commit_memory_region() write protects the memory slot only
for GET_DIRTY_LOG when modifying the flags.
- FNAME(sync_page) uses the old spte value to set a new one without
checking KVM_MEM_READONLY flag.
Since we flush all shadow pages when creating a new slot, the simplest
fix is to disallow such problematic flag changes: this is safe because
no one is doing such things.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
KVM_SET_USER_MEMORY_REGION forces __kvm_set_memory_region() to identify
what kind of change is being requested by checking the arguments. The
current code does this checking at various points in code and each
condition being used there is not easy to understand at first glance.
This patch consolidates these checks and introduces an enum to name the
possible changes to clean up the code.
Although this does not introduce any functional changes, there is one
change which optimizes the code a bit: if we have nothing to change, the
new code returns 0 immediately.
Note that the return value for this case cannot be changed since QEMU
relies on it: we noticed this when we changed it to -EINVAL and got a
section mismatch error at the final stage of live migration.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
yield_to returns -ESRCH, When source and target of yield_to
run queue length is one. When we see three successive failures of
yield_to we assume we are in potential undercommit case and abort
from PLE handler.
The assumption is backed by low probability of wrong decision
for even worst case scenarios such as average runqueue length
between 1 and 2.
More detail on rationale behind using three tries:
if p is the probability of finding rq length one on a particular cpu,
and if we do n tries, then probability of exiting ple handler is:
p^(n+1) [ because we would have come across one source with rq length
1 and n target cpu rqs with length 1 ]
so
num tries: probability of aborting ple handler (1.5x overcommit)
1 1/4
2 1/8
3 1/16
We can increase this probability with more tries, but the problem is
the overhead.
Also, If we have tried three times that means we would have iterated
over 3 good eligible vcpus along with many non-eligible candidates. In
worst case if we iterate all the vcpus, we reduce 1x performance and
overcommit performance get hit.
note that we do not update last boosted vcpu in failure cases.
Thank Avi for raising question on aborting after first fail from yield_to.
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
manually, which is fully taken care of by the hardware. This needs
some special awareness into existing interrupr injection path:
- for pending interrupt, instead of direct injection, we may need
update architecture specific indicators before resuming to guest.
- A pending interrupt, which is masked by ISR, should be also
considered in above update action, since hardware will decide
when to inject it at right time. Current has_interrupt and
get_interrupt only returns a valid vector from injection p.o.v.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
We've been ignoring read-only mappings and programming everything
into the iommu as read-write. Fix this to only include the write
access flag when read-only is not set.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Memory slot flags can be altered without changing other parameters of
the slot. The read-only attribute is the only one the IOMMU cares
about, so generate an un-map, re-map when this occurs. This also
avoid unnecessarily re-mapping the slot when no IOMMU visible changes
are made.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
One such variable, slot, is enough for holding a pointer temporarily.
We also remove another local variable named slot, which is limited in
a block, since it is confusing to have the same name in this function.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Don't need the check for deleting an existing slot or just modifiying
the flags.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
This makes the separation between the sanity checks and the rest of the
code a bit clearer.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Calling kvm_mmu_slot_remove_write_access() for a deleted slot does
nothing but search for non-existent mmu pages which have mappings to
that deleted memory; this is safe but a waste of time.
Since we want to make the function rmap based in a later patch, in a
manner which makes it unsafe to be called for a deleted slot, we makes
the caller see if the slot is non-zero and being dirty logged.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Move repetitive code sequence to a separate function.
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Previous patch "kvm: Minor memory slot optimization" (b7f69c555c)
overlooked the generation field of the memory slots. Re-using the
original memory slots left us with with two slightly different memory
slots with the same generation. To fix this, make update_memslots()
take a new parameter to specify the last generation. This also makes
generation management more explicit to avoid such problems in the future.
Reported-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
This hack is wrong. The pin number of PIT is connected to
2 not 0. This means this hack never takes effect. So it is ok
to remove it.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
We're currently offering a whopping 32 memory slots to user space, an
int is a bit excessive for storing this. We would like to increase
our memslots, but SHRT_MAX should be more than enough.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
There's no need for this to be an int, it holds a boolean.
Move to the end of the struct for alignment.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
It's easy to confuse KVM_MEMORY_SLOTS and KVM_MEM_SLOTS_NUM. One is
the user accessible slots and the other is user + private. Make this
more obvious.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If a slot is removed or moved in the guest physical address space, we
first allocate and install a new slot array with the invalidated
entry. The old array is then freed. We then proceed to allocate yet
another slot array to install the permanent replacement. Re-use the
original array when this occurs and avoid the extra kfree/kmalloc.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The iommu integration into memory slots expects memory slots to be
added or removed and doesn't handle the move case. We can unmap
slots from the iommu after we mark them invalid and map them before
installing the final memslot array. Also re-order the kmemdup vs
map so we don't leave iommu mappings if we get ENOMEM.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The API documents that only flags and guest physical memory space can
be modified on an existing slot, but we don't enforce that the
userspace address cannot be modified. Instead we just ignore it.
This means that a user may think they've successfully moved both the
guest and user addresses, when in fact only the guest address changed.
Check and error instead.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The API documentation states:
When changing an existing slot, it may be moved in the guest
physical memory space, or its flags may be modified.
An "existing slot" requires a non-zero npages (memory_size). The only
transition we should therefore allow for a non-existing slot should be
to create the slot, which includes setting a non-zero memory_size. We
currently allow calls to modify non-existing slots, which is pointless,
confusing, and possibly wrong.
With this we know that the invalidation path of __kvm_set_memory_region
is always for a delete or move and never for adding a zero size slot.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Typo for the next pointer means we're walking random data here.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The current eventfd code assumes that when we have eventfd, we also have
irqfd for in-kernel interrupt delivery. This is not necessarily true. On
PPC we don't have an in-kernel irqchip yet, but we can still support easily
support eventfd.
Signed-off-by: Alexander Graf <agraf@suse.de>
We can deliver certain interrupts, notably MSI,
from atomic context. Use kvm_set_irq_inatomic,
to implement an irq handler for msi.
This reduces the pressure on scheduler in case
where host and guest irq share a host cpu.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Add an API to inject IRQ from atomic context.
Return EWOULDBLOCK if impossible (e.g. for multicast).
Only MSI is supported ATM.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Prior to memory slot sorting this loop compared all of the user memory
slots for overlap with new entries. With memory slot sorting, we're
just checking some number of entries in the array that may or may not
be user slots. Instead, walk all the slots with kvm_for_each_memslot,
which has the added benefit of terminating early when we hit the first
empty slot, and skip comparison to private slots.
Cc: stable@vger.kernel.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
KVM added a global variable to guarantee monotonicity in the guest.
One of the reasons for that is that the time between
1. ktime_get_ts(×pec);
2. rdtscll(tsc);
Is variable. That is, given a host with stable TSC, suppose that
two VCPUs read the same time via ktime_get_ts() above.
The time required to execute 2. is not the same on those two instances
executing in different VCPUS (cache misses, interrupts...).
If the TSC value that is used by the host to interpolate when
calculating the monotonic time is the same value used to calculate
the tsc_timestamp value stored in the pvclock data structure, and
a single <system_timestamp, tsc_timestamp> tuple is visible to all
vcpus simultaneously, this problem disappears. See comment on top
of pvclock_update_vm_gtod_copy for details.
Monotonicity is then guaranteed by synchronicity of the host TSCs
and guest TSCs.
Set TSC stable pvclock flag in that case, allowing the guest to read
clock from userspace.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
No need to check return value before breaking switch.
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We should avoid kfree()ing error pointer in kvm_vcpu_ioctl() and
kvm_arch_vcpu_ioctl().
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch filters noslot pfn out from error pfns based on Marcelo comment:
noslot pfn is not a error pfn
After this patch,
- is_noslot_pfn indicates that the gfn is not in slot
- is_error_pfn indicates that the gfn is in slot but the error is occurred
when translate the gfn to pfn
- is_error_noslot_pfn indicates that the pfn either it is error pfns or it
is noslot pfn
And is_invalid_pfn can be removed, it makes the code more clean
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Merge reason: development work has dependency on kvm patches merged
upstream.
Conflicts:
arch/powerpc/include/asm/Kbuild
arch/powerpc/include/asm/kvm_para.h
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We can not directly call kvm_release_pfn_clean to release the pfn
since we can meet noslot pfn which is used to cache mmio info into
spte
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
Change existing kernel error message to include return value from
iommu_attach_device() when it fails. This will help debug device
assignment failures more effectively.
Signed-off-by: Shuah Khan <shuah.khan@hp.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Now that we have defined generic set_bit_le() we do not need to use
test_and_set_bit_le() for atomically setting a bit.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQbY/2AAoJEI7yEDeUysxlymQQAIv5svpAI/FUe3FhvBi3IW2h
WWMIpbdhHyocaINT18qNp8prO0iwoaBfgsnU8zuB34MrbdUgiwSHgM6T4Ff4NGa+
R4u+gpyKYwxNQYKeJyj04luXra/krxwHL1u9OwN7o44JuQXAmzrw2tZ9ad1ArvL3
eoZ6kGsPcdHPZMZWw2jN5xzBsRtqybm0GPPQh1qPXdn8UlPPd1X7owvbaud2y4+e
StVIpGY6wrsO36f7UcA4Gm1EP/1E6Lm5KMXJyHgM9WBRkEfp92jTY5+XKv91vK8Z
VKUd58QMdZE5NCNBkAR9U5N9aH0oSXnFU/g8hgiwGvrhS3IsSkKUePE6sVyMVTIO
VptKRYe0AdmD/g25p6ApJsguV7ITlgoCPaE4rMmRcW9/bw8+iY098r7tO7w11H8M
TyFOXihc3B+rlH8WdzOblwxHMC4yRuiPIktaA3WwbX7eA7Xv/ZRtdidifXKtgsVE
rtubVqwGyYcHoX1Y+JiByIW1NN0pYncJhPEdc8KbRe2wKs3amA9rio1mUpBYYBPO
B0ygcITftyXbhcTtssgcwBDGXB0AAGqI7wqdtJhFeIrKwHXD7fNeAGRwO8oKxmlj
0aPwo9fDtpI+e6BFTohEgjZBocRvXXNWLnDSFB0E7xDR31bACck2FG5FAp1DxdS7
lb/nbAsXf9UJLgGir4I1
=kN6V
-----END PGP SIGNATURE-----
Merge tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Avi Kivity:
"Highlights of the changes for this release include support for vfio
level triggered interrupts, improved big real mode support on older
Intels, a streamlines guest page table walker, guest APIC speedups,
PIO optimizations, better overcommit handling, and read-only memory."
* tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (138 commits)
KVM: s390: Fix vcpu_load handling in interrupt code
KVM: x86: Fix guest debug across vcpu INIT reset
KVM: Add resampling irqfds for level triggered interrupts
KVM: optimize apic interrupt delivery
KVM: MMU: Eliminate pointless temporary 'ac'
KVM: MMU: Avoid access/dirty update loop if all is well
KVM: MMU: Eliminate eperm temporary
KVM: MMU: Optimize is_last_gpte()
KVM: MMU: Simplify walk_addr_generic() loop
KVM: MMU: Optimize pte permission checks
KVM: MMU: Update accessed and dirty bits after guest pagetable walk
KVM: MMU: Move gpte_access() out of paging_tmpl.h
KVM: MMU: Optimize gpte_access() slightly
KVM: MMU: Push clean gpte write protection out of gpte_access()
KVM: clarify kvmclock documentation
KVM: make processes waiting on vcpu mutex killable
KVM: SVM: Make use of asm.h
KVM: VMX: Make use of asm.h
KVM: VMX: Make lto-friendly
KVM: x86: lapic: Clean up find_highest_vector() and count_vectors()
...
Conflicts:
arch/s390/include/asm/processor.h
arch/x86/kvm/i8259.c
Pull workqueue changes from Tejun Heo:
"This is workqueue updates for v3.7-rc1. A lot of activities this
round including considerable API and behavior cleanups.
* delayed_work combines a timer and a work item. The handling of the
timer part has always been a bit clunky leading to confusing
cancelation API with weird corner-case behaviors. delayed_work is
updated to use new IRQ safe timer and cancelation now works as
expected.
* Another deficiency of delayed_work was lack of the counterpart of
mod_timer() which led to cancel+queue combinations or open-coded
timer+work usages. mod_delayed_work[_on]() are added.
These two delayed_work changes make delayed_work provide interface
and behave like timer which is executed with process context.
* A work item could be executed concurrently on multiple CPUs, which
is rather unintuitive and made flush_work() behavior confusing and
half-broken under certain circumstances. This problem doesn't
exist for non-reentrant workqueues. While non-reentrancy check
isn't free, the overhead is incurred only when a work item bounces
across different CPUs and even in simulated pathological scenario
the overhead isn't too high.
All workqueues are made non-reentrant. This removes the
distinction between flush_[delayed_]work() and
flush_[delayed_]_work_sync(). The former is now as strong as the
latter and the specified work item is guaranteed to have finished
execution of any previous queueing on return.
* In addition to the various bug fixes, Lai redid and simplified CPU
hotplug handling significantly.
* Joonsoo introduced system_highpri_wq and used it during CPU
hotplug.
There are two merge commits - one to pull in IRQ safe timer from
tip/timers/core and the other to pull in CPU hotplug fixes from
wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."
Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.
Tejun pointed out a few of them, I fixed a couple more.
* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
workqueue: remove @delayed from cwq_dec_nr_in_flight()
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
workqueue: use __cpuinit instead of __devinit for cpu callbacks
workqueue: rename manager_mutex to assoc_mutex
workqueue: WORKER_REBIND is no longer necessary for idle rebinding
workqueue: WORKER_REBIND is no longer necessary for busy rebinding
workqueue: reimplement idle worker rebinding
workqueue: deprecate __cancel_delayed_work()
workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
workqueue: use mod_delayed_work() instead of __cancel + queue
workqueue: use irqsafe timer for delayed_work
workqueue: clean up delayed_work initializers and add missing one
workqueue: make deferrable delayed_work initializer names consistent
workqueue: cosmetic whitespace updates for macro definitions
workqueue: deprecate system_nrt[_freezable]_wq
workqueue: deprecate flush[_delayed]_work_sync()
...
To emulate level triggered interrupts, add a resample option to
KVM_IRQFD. When specified, a new resamplefd is provided that notifies
the user when the irqchip has been resampled by the VM. This may, for
instance, indicate an EOI. Also in this mode, posting of an interrupt
through an irqfd only asserts the interrupt. On resampling, the
interrupt is automatically de-asserted prior to user notification.
This enables level triggered interrupts to be posted and re-enabled
from vfio with no userspace intervention.
All resampling irqfds can make use of a single irq source ID, so we
reserve a new one for this interface.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Most interrupt are delivered to only one vcpu. Use pre-build tables to
find interrupt destination instead of looping through all vcpus. In case
of logical mode loop only through vcpus in a logical cluster irq is sent
to.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
vcpu mutex can be held for unlimited time so
taking it with mutex_lock on an ioctl is wrong:
one process could be passed a vcpu fd and
call this ioctl on the vcpu used by another process,
it will then be unkillable until the owner exits.
Call mutex_lock_killable instead and return status.
Note: mutex_lock_interruptible would be even nicer,
but I am not sure all users are prepared to handle EINTR
from these ioctls. They might misinterpret it as an error.
Cleanup paths expect a vcpu that can't be used by
any userspace so this will always succeed - catch bugs
by calling BUG_ON.
Catch callers that don't check return state by adding
__must_check.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Other arches do not need this.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
v2: fix incorrect deletion of mmio sptes on gpa move (noticed by Takuya)
Signed-off-by: Avi Kivity <avi@redhat.com>
PPC must flush all translations before the new memory slot
is visible.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introducing kvm_arch_flush_shadow_memslot, to invalidate the
translations of a single memory slot.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The build error was caused by that builtin functions are calling
the functions implemented in modules. This error was introduced by
commit 4d8b81abc4 ("KVM: introduce readonly memslot").
The patch fixes the build error by moving function __gfn_to_hva_memslot()
from kvm_main.c to kvm_host.h and making that "inline" so that the
builtin function (kvmppc_h_enter) can use that.
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
KVM_SET_SIGNAL_MASK passed a NULL argument leaves the on stack signal
sets uninitialized. It then passes them through to
kvm_vcpu_ioctl_set_sigmask.
We should be passing a NULL in this case not translated garbage.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
In current code, if we map a readonly memory space from host to guest
and the page is not currently mapped in the host, we will get a fault
pfn and async is not allowed, then the vm will crash
We introduce readonly memory region to map ROM/ROMD to the guest, read access
is happy for readonly memslot, write access on readonly memslot will cause
KVM_EXIT_MMIO exit
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In current code, we always map writable pfn for the read fault, in order
to support readonly memslot, we map writable pfn only if 'writable'
is not NULL
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We do too many things in hva_to_pfn, this patch reorganize the code,
let it be better readable
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This set of functions is only used to read data from host space, in the
later patch, we will only get a readonly hva in gfn_to_hva_read, and
the function name is a good hint to let gfn_to_hva_read to pair with
kvm_read_hva()/kvm_read_hva_atomic()
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Check flags when memslot is registered from userspace as Avi's suggestion
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
flush[_delayed]_work_sync() are now spurious. Mark them deprecated
and convert all users to flush[_delayed]_work().
If you're cc'd and wondering what's going on: Now all workqueues are
non-reentrant and the regular flushes guarantee that the work item is
not pending or running on any CPU on return, so there's no reason to
use the sync flushes at all and they're going away.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mattia Dongili <malattia@linux.it>
Cc: Kent Yoder <key@linux.vnet.ibm.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Bryan Wu <bryan.wu@canonical.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-wireless@vger.kernel.org
Cc: Anton Vorontsov <cbou@mail.ru>
Cc: Sangbeom Kim <sbkim73@samsung.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Avi Kivity <avi@redhat.com>
We validate irq pin number when routing is setup, so
code handling illegal irq # in pic and ioapic on each injection
is never called.
Drop it, replace with BUG_ON to catch out of bounds access bugs.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
After commit a2766325cf, the error page is replaced by the
error code, it need not be released anymore
[ The patch has been compiling tested for powerpc ]
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
After commit a2766325cf, the error pfn is replaced by the
error code, it need not be released anymore
[ The patch has been compiling tested for powerpc ]
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
It is used to eliminate the overload of function call and cleanup
the code
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
These functions are exported and can not inline, move them
to kvm_host.h to eliminate the overload of function call
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Then, get_hwpoison_pfn and is_hwpoison_pfn can be removed
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
After that, the exported and un-inline function, get_fault_pfn,
can be removed
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
There are two bugs:
- the 'error page' is forgot to be released
[ it is unneeded after commit a2766325cf, for backport, we
still do kvm_release_pfn_clean for the error pfn ]
- guest pages are always released regardless of the unmapped page
(e,g, caused by hwpoison)
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Two reasons:
- x86 can integrate rmap and rmap_pde and remove heuristics in
__gfn_to_rmap().
- Some architectures do not need rmap.
Since rmap is one of the most memory consuming stuff in KVM, ppc'd
better restrict the allocation to Book3S HV.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
Handle KVM_IRQ_LINE and KVM_IRQ_LINE_STATUS in the generic
kvm_vm_ioctl() function and call into kvm_vm_ioctl_irq_line().
This is even more relevant when KVM/ARM also uses this ioctl.
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently, kvm allocates some pages and use them as error indicators,
it wastes memory and is not good for scalability
Base on Avi's suggestion, we use the error codes instead of these pages
to indicate the error conditions
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In kvm_async_pf_wakeup_all, it uses bad_page to generate broadcast wakeup,
and uses put_page to release bad_page, the work depends on the fact that
bad_page is the normal page. But we will use the error code instead of
bad_page, so use kvm_release_page_clean to release the page which will
release the error code properly
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Merge patches queued during the run-up to the merge window.
* queue: (25 commits)
KVM: Choose better candidate for directed yield
KVM: Note down when cpu relax intercepted or pause loop exited
KVM: Add config to support ple or cpu relax optimzation
KVM: switch to symbolic name for irq_states size
KVM: x86: Fix typos in pmu.c
KVM: x86: Fix typos in lapic.c
KVM: x86: Fix typos in cpuid.c
KVM: x86: Fix typos in emulate.c
KVM: x86: Fix typos in x86.c
KVM: SVM: Fix typos
KVM: VMX: Fix typos
KVM: remove the unused parameter of gfn_to_pfn_memslot
KVM: remove is_error_hpa
KVM: make bad_pfn static to kvm_main.c
KVM: using get_fault_pfn to get the fault pfn
KVM: MMU: track the refcount when unmap the page
KVM: x86: remove unnecessary mark_page_dirty
KVM: MMU: Avoid handling same rmap_pde in kvm_handle_hva_range()
KVM: MMU: Push trace_kvm_age_page() into kvm_age_rmapp()
KVM: MMU: Add memslot parameter to hva handlers
...
Signed-off-by: Avi Kivity <avi@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQDRDNAAoJEI7yEDeUysxlkl8P/3C2AHx2webOU8sVzhfU6ONZ
ZoGevwBjyZIeJEmiWVpFTTEew1l0PXtpyOocXGNUXIddVnhXTQOKr/Scj4uFbmx8
ROqgK8NSX9+xOGrBPCoN7SlJkmp+m6uYtwYkl2SGnsEVLWMKkc7J7oqmszCcTQvN
UXMf7G47/Ul2NUSBdv4Yvizhl4kpvWxluiweDw3E/hIQKN0uyP7CY58qcAztw8nG
csZBAnnuPFwIAWxHXW3eBBv4UP138HbNDqJ/dujjocM6GnOxmXJmcZ6b57gh+Y64
3+w9IR4qrRWnsErb/I8inKLJ1Jdcf7yV2FmxYqR4pIXay2Yzo1BsvFd6EB+JavUv
pJpixrFiDDFoQyXlh4tGpsjpqdXNMLqyG4YpqzSZ46C8naVv9gKE7SXqlXnjyDlb
Llx3hb9Fop8O5ykYEGHi+gIISAK5eETiQl4yw9RUBDpxydH4qJtqGIbLiDy8y9wi
Xyi8PBlNl+biJFsK805lxURqTp/SJTC3+Zb7A7CzYEQm5xZw3W/CKZx1ZYBfpaa/
pWaP6tB7JwgLIVXi4HQayLWqMVwH0soZIn9yazpOEFv6qO8d5QH5RAxAW2VXE3n5
JDlrajar/lGIdiBVWfwTJLb86gv3QDZtIWoR9mZuLKeKWE/6PRLe7HQpG1pJovsm
2AsN5bS0BWq+aqPpZHa5
=pECD
-----END PGP SIGNATURE-----
Merge tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Avi Kivity:
"Highlights include
- full big real mode emulation on pre-Westmere Intel hosts (can be
disabled with emulate_invalid_guest_state=0)
- relatively small ppc and s390 updates
- PCID/INVPCID support in guests
- EOI avoidance; 3.6 guests should perform better on 3.6 hosts on
interrupt intensive workloads)
- Lockless write faults during live migration
- EPT accessed/dirty bits support for new Intel processors"
Fix up conflicts in:
- Documentation/virtual/kvm/api.txt:
Stupid subchapter numbering, added next to each other.
- arch/powerpc/kvm/booke_interrupts.S:
PPC asm changes clashing with the KVM fixes
- arch/s390/include/asm/sigp.h, arch/s390/kvm/sigp.c:
Duplicated commits through the kvm tree and the s390 tree, with
subsequent edits in the KVM tree.
* tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits)
KVM: fix race with level interrupts
x86, hyper: fix build with !CONFIG_KVM_GUEST
Revert "apic: fix kvm build on UP without IOAPIC"
KVM guest: switch to apic_set_eoi_write, apic_write
apic: add apic_set_eoi_write for PV use
KVM: VMX: Implement PCID/INVPCID for guests with EPT
KVM: Add x86_hyper_kvm to complete detect_hypervisor_platform check
KVM: PPC: Critical interrupt emulation support
KVM: PPC: e500mc: Fix tlbilx emulation for 64-bit guests
KVM: PPC64: booke: Set interrupt computation mode for 64-bit host
KVM: PPC: bookehv: Add ESR flag to Data Storage Interrupt
KVM: PPC: bookehv64: Add support for std/ld emulation.
booke: Added crit/mc exception handler for e500v2
booke/bookehv: Add host crit-watchdog exception support
KVM: MMU: document mmu-lock and fast page fault
KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint
KVM: MMU: trace fast page fault
KVM: MMU: fast path of handling guest page fault
KVM: MMU: introduce SPTE_MMU_WRITEABLE bit
KVM: MMU: fold tlb flush judgement into mmu_spte_update
...
Currently, on a large vcpu guests, there is a high probability of
yielding to the same vcpu who had recently done a pause-loop exit or
cpu relax intercepted. Such a yield can lead to the vcpu spinning
again and hence degrade the performance.
The patchset keeps track of the pause loop exit/cpu relax interception
and gives chance to a vcpu which:
(a) Has not done pause loop exit or cpu relax intercepted at all
(probably he is preempted lock-holder)
(b) Was skipped in last iteration because it did pause loop exit or
cpu relax intercepted, and probably has become eligible now
(next eligible lock holder)
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # on s390x
Signed-off-by: Avi Kivity <avi@redhat.com>
Noting pause loop exited vcpu or cpu relax intercepted helps in
filtering right candidate to yield. Wrong selection of vcpu;
i.e., a vcpu that just did a pl-exit or cpu relax intercepted may
contribute to performance degradation.
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # on s390x
Signed-off-by: Avi Kivity <avi@redhat.com>
Suggested-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # on s390x
Signed-off-by: Avi Kivity <avi@redhat.com>
Use PIC_NUM_PINS instead of hard-coded 16 for pic pins.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
When more than 1 source id is in use for the same GSI, we have the
following race related to handling irq_states race:
CPU 0 clears bit 0. CPU 0 read irq_state as 0. CPU 1 sets level to 1.
CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0).
Now ioapic thinks the level is 0 but irq_state is not 0.
Fix by performing all irq_states bitmap handling under pic/ioapic lock.
This also removes the need for atomics with irq_states handling.
Reported-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The parameter, 'kvm', is not used in gfn_to_pfn_memslot, we can happily remove
it
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
bad_pfn is not used out of kvm_main.c, so mark it static, also move it near
hwpoison_pfn and fault_pfn
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Using get_fault_pfn to cleanup the code
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
When we tested KVM under memory pressure, with THP enabled on the host,
we noticed that MMU notifier took a long time to invalidate huge pages.
Since the invalidation was done with mmu_lock held, it not only wasted
the CPU but also made the host harder to respond.
This patch mitigates this by using kvm_handle_hva_range().
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Alexander Graf <agraf@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The kernel no longer allows us to pass NULL for the hard handler
without also specifying IRQF_ONESHOT. IRQF_ONESHOT imposes latency
in the exit path that we don't need for MSI interrupts. Long term
we'd like to inject these interrupts from the hard handler when
possible. In the short term, we can create dummy hard handlers
that return us to the previous behavior. Credit to Michael for
original patch.
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=43328
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If last_boosted_vcpu == 0, then we fall through all test cases and
may end up with all VCPUs pouncing on vcpu 0. With a large enough
guest, this can result in enormous runqueue lock contention, which
can prevent vcpu0 from running, leading to a livelock.
Changing < to <= makes sure we properly handle that case.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Prune this down to just the struct kvm_irqfd so we can avoid
changing function definition for every flag or field we use.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The KVM code sometimes uses CONFIG_HAVE_KVM_IRQCHIP to protect
code that is related to IRQ routing, which not all in-kernel
irqchips may support.
Use KVM_CAP_IRQ_ROUTING instead.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The masking was wrong (must have been 0x7f), and there is no need to
re-read the value as pci_setup_device already does this for us.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=43339
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
kvm_set_irq() has an internal buffer of three irq routing entries, allowing
connecting a GSI to three IRQ chips or on MSI. However setup_routing_entry()
does not properly enforce this, allowing three irqchip routes followed by
an MSI route to overflow the buffer.
Fix by ensuring that an MSI entry is added to an empty list.
Signed-off-by: Avi Kivity <avi@redhat.com>
lpage_info is created for each large level even when the memory slot is
not for RAM. This means that when we add one slot for a PCI device, we
end up allocating at least KVM_NR_PAGE_SIZES - 1 pages by vmalloc().
To make things worse, there is an increasing number of devices which
would result in more pages being wasted this way.
This patch mitigates this problem by using kvm_kvzalloc().
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
Will be used for lpage_info allocation later.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch implements the directed yield hypercall found on other
System z hypervisors. It delegates execution time to the virtual cpu
specified in the instruction's parameter.
Useful to avoid long spinlock waits in the guest.
Christian Borntraeger: moved common code in virt/kvm/
Signed-off-by: Konstantin Weitz <WEITZKON@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently, MSI messages can only be injected to in-kernel irqchips by
defining a corresponding IRQ route for each message. This is not only
unhandy if the MSI messages are generated "on the fly" by user space,
IRQ routes are a limited resource that user space has to manage
carefully.
By providing a direct injection path, we can both avoid using up limited
resources and simplify the necessary steps for user land.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Merge reason: development work has dependency on kvm patches merged
upstream.
Conflicts:
Documentation/feature-removal-schedule.txt
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
As pointed out by Jason Baron, when assigning a device to a guest
we first set the iommu domain pointer, which enables mapping
and unmapping of memory slots to the iommu. This leaves a window
where this path is enabled, but we haven't synchronized the iommu
mappings to the existing memory slots. Thus a slot being removed
at that point could send us down unexpected code paths removing
non-existent pinnings and iommu mappings. Take the slots_lock
around creating the iommu domain and initial mappings as well as
around iommu teardown to avoid this race.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Intel spec says that TMR needs to be set/cleared
when IRR is set, but kvm also clears it on EOI.
I did some tests on a real (AMD based) system,
and I see same TMR values both before
and after EOI, so I think it's a minor bug in kvm.
This patch fixes TMR to be set/cleared on IRR set
only as per spec.
And now that we don't clear TMR, we can save
an atomic read of TMR on EOI that's not propagated
to ioapic, by checking whether ioapic needs
a specific vector first and calculating
the mode afterwards.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We've been adding new mappings, but not destroying old mappings.
This can lead to a page leak as pages are pinned using
get_user_pages, but only unpinned with put_page if they still
exist in the memslots list on vm shutdown. A memslot that is
destroyed while an iommu domain is enabled for the guest will
therefore result in an elevated page reference count that is
never cleared.
Additionally, without this fix, the iommu is only programmed
with the first translation for a gpa. This can result in
peer-to-peer errors if a mapping is destroyed and replaced by a
new mapping at the same gpa as the iommu will still be pointing
to the original, pinned memory address.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Now that we do neither double buffering nor heuristic selection of the
write protection method these are not needed anymore.
Note: some drivers have their own implementation of set_bit_le() and
making it generic needs a bit of work; so we use test_and_set_bit_le()
and will later replace it with generic set_bit_le().
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
S390's kvm_vcpu_stat does not contain halt_wakeup member.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The kvm_vcpu_kick function performs roughly the same funcitonality on
most all architectures, so we shouldn't have separate copies.
PowerPC keeps a pointer to interchanging waitqueues on the vcpu_arch
structure and to accomodate this special need a
__KVM_HAVE_ARCH_VCPU_GET_WQ define and accompanying function
kvm_arch_vcpu_wq have been defined. For all other architectures this
is a generic inline that just returns &vcpu->wq;
Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch makes the kvm_io_range array can be resized dynamically.
Signed-off-by: Amos Kong <akong@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
As kvm_notify_acked_irq calls kvm_assigned_dev_ack_irq under
rcu_read_lock, we cannot use a mutex in the latter function. Switch to a
spin lock to address this.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Using 'int' type is not suitable for a 'long' object. So, correct it.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
PCI 2.3 allows to generically disable IRQ sources at device level. This
enables us to share legacy IRQs of such devices with other host devices
when passing them to a guest.
The new IRQ sharing feature introduced here is optional, user space has
to request it explicitly. Moreover, user space can inform us about its
view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the
interrupt and signaling it if the guest masked it via the virtualized
PCI config space.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If some vcpus are created before KVM_CREATE_IRQCHIP, then
irqchip_in_kernel() and vcpu->arch.apic will be inconsistent, leading
to potential NULL pointer dereferences.
Fix by:
- ensuring that no vcpus are installed when KVM_CREATE_IRQCHIP is called
- ensuring that a vcpu has an apic if it is installed after KVM_CREATE_IRQCHIP
This is somewhat long winded because vcpu->arch.apic is created without
kvm->lock held.
Based on earlier patch by Michael Ellerman.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Avi Kivity <avi@redhat.com>
Other threads may process the same page in that small window and skip
TLB flush and then return before these functions do flush.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Some members of kvm_memory_slot are not used by every architecture.
This patch is the first step to make this difference clear by
introducing kvm_memory_slot::arch; lpage_info is moved into it.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Narrow down the controlled text inside the conditional so that it will
include lpage_info and rmap stuff only.
For this we change the way we check whether the slot is being created
from "if (npages && !new.rmap)" to "if (npages && !old.npages)".
We also stop checking if lpage_info is NULL when we create lpage_info
because we do it from inside the slot creation code block.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This makes it easy to make lpage_info architecture specific.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch cleans up the code and removes the "(void)level;" warning
suppressor.
Note that we can also use this for PT_PAGE_TABLE_LEVEL to treat every
level uniformly later.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This moves __gfn_to_memslot() and search_memslots() from kvm_main.c to
kvm_host.h to reduce the code duplication caused by the need for
non-modular code in arch/powerpc/kvm/book3s_hv_rm_mmu.c to call
gfn_to_memslot() in real mode.
Rather than putting gfn_to_memslot() itself in a header, which would
lead to increased code size, this puts __gfn_to_memslot() in a header.
Then, the non-modular uses of gfn_to_memslot() are changed to call
__gfn_to_memslot() instead. This way there is only one place in the
source code that needs to be changed should the gfn_to_memslot()
implementation need to be modified.
On powerpc, the Book3S HV style of KVM has code that is called from
real mode which needs to call gfn_to_memslot() and thus needs this.
(Module code is allocated in the vmalloc region, which can't be
accessed in real mode.)
With this, we can remove builtin_gfn_to_memslot() from book3s_hv_rm_mmu.c.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
find_index_from_host_irq returns 0 on error
but callers assume < 0 on error. This should
not matter much: an out of range irq should never happen since
irq handler was registered with this irq #,
and even if it does we get a spurious msix irq in guest
and typically nothing terrible happens.
Still, better to make it consistent.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This adds an smp_wmb in kvm_mmu_notifier_invalidate_range_end() and an
smp_rmb in mmu_notifier_retry() so that mmu_notifier_retry() will give
the correct answer when called without kvm->mmu_lock being held.
PowerPC Book3S HV KVM wants to use a bitlock per guest page rather than
a single global spinlock in order to improve the scalability of updates
to the guest MMU hashed page table, and so needs this.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch exports the s390 SIE hardware control block to userspace
via the mapping of the vcpu file descriptor. In order to do so,
a new arch callback named kvm_arch_vcpu_fault is introduced for all
architectures. It allows to map architecture specific pages.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch introduces a new config option for user controlled kernel
virtual machines. It introduces a parameter to KVM_CREATE_VM that
allows to set bits that alter the capabilities of the newly created
virtual machine.
The parameter is passed to kvm_arch_init_vm for all architectures.
The only valid modifier bit for now is KVM_VM_S390_UCONTROL.
This requires CAP_SYS_ADMIN privileges and creates a user controlled
virtual machine on s390 architectures.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
It is possible that the __set_bit() in mark_page_dirty() is called
simultaneously on the same region of memory, which may result in only
one bit being set, because some callers do not take mmu_lock before
mark_page_dirty().
This problem is hard to produce because when we reach mark_page_dirty()
beginning from, e.g., tdp_page_fault(), mmu_lock is being held during
__direct_map(): making kvm-unit-tests' dirty log api test write to two
pages concurrently was not useful for this reason.
So we have confirmed that there can actually be race condition by
checking if some callers really reach there without holding mmu_lock
using spin_is_locked(): probably they were from kvm_write_guest_page().
To fix this race, this patch changes the bit operation to the atomic
version: note that nr_dirty_pages also suffers from the race but we do
not need exactly correct numbers for now.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
module_param(bool) used to counter-intuitively take an int. In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.
It's time to remove the int/unsigned int option. For this version
it'll simply give a warning, but it'll break next kernel version.
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
by checking the return value from kvm_init_debug, we
can ensure that the entries under debugfs for KVM have
been created correctly.
Signed-off-by: Yang Bai <hamo.by@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Drop bsp_vcpu pointer from kvm struct since its only use is incorrect
anyway.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Switch to using memdup_user when possible. This makes code more
smaller and compact, and prevents errors.
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Switch to kmemdup() in two places to shorten the code and avoid possible bugs.
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This fixes byte accesses to IOAPIC_REG_SELECT as mandated by at least the
ICH10 and Intel Series 5 chipset specs. It also makes ioapic_mmio_write
consistent with ioapic_mmio_read, which also allows byte and word accesses.
Signed-off-by: Julian Stecklina <js@alien8.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
The operation of getting dirty log is frequent when framebuffer-based
displays are used(for example, Xwindow), so, we introduce a mapping table
to speed up id_to_memslot()
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Sort memslots base on its size and use line search to find it, so that the
larger memslots have better fit
The idea is from Avi
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce id_to_memslot to get memslot by slot id
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce kvm_for_each_memslot to walk all valid memslot
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce update_memslots to update slot which will be update to
kvm->memslots
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Needed for the next patch which uses this number to decide how to write
protect a slot.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
Use kmemdup rather than duplicating its implementation
The semantic patch that makes this change is available
in scripts/coccinelle/api/memdup.cocci.
More information about semantic patching is available at
http://coccinelle.lip6.fr/
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
My testing version of Smatch complains that addr and len come from
the user and they can wrap. The path is:
-> kvm_vm_ioctl()
-> kvm_vm_ioctl_unregister_coalesced_mmio()
-> coalesced_mmio_in_range()
I don't know what the implications are of wrapping here, but we may
as well fix it, if only to silence the warning.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Only allow KVM device assignment to attach to devices which:
- Are not bridges
- Have BAR resources (assume others are special devices)
- The user has permissions to use
Assigning a bridge is a configuration error, it's not supported, and
typically doesn't result in the behavior the user is expecting anyway.
Devices without BAR resources are typically chipset components that
also don't have host drivers. We don't want users to hold such devices
captive or cause system problems by fencing them off into an iommu
domain. We determine "permission to use" by testing whether the user
has access to the PCI sysfs resource files. By default a normal user
will not have access to these files, so it provides a good indication
that an administration agent has granted the user access to the device.
[Yang Bai: add missing #include]
[avi: fix comment style]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yang Bai <hamo.by@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This option has no users and it exposes a security hole that we
can allow devices to be assigned without iommu protection. Make
KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory option.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
When mapping a memory region, split it to page sizes as supported
by the iommu hardware. Always prefer bigger pages, when possible,
in order to reduce the TLB pressure.
The logic to do that is now added to the IOMMU core, so neither the iommu
drivers themselves nor users of the IOMMU API have to duplicate it.
This allows a more lenient granularity of mappings; traditionally the
IOMMU API took 'order' (of a page) as a mapping size, and directly let
the low level iommu drivers handle the mapping, but now that the IOMMU
core can split arbitrary memory regions into pages, we can remove this
limitation, so users don't have to split those regions by themselves.
Currently the supported page sizes are advertised once and they then
remain static. That works well for OMAP and MSM but it would probably
not fly well with intel's hardware, where the page size capabilities
seem to have the potential to be different between several DMA
remapping devices.
register_iommu() currently sets a default pgsize behavior, so we can convert
the IOMMU drivers in subsequent patches. After all the drivers
are converted, the temporary default settings will be removed.
Mainline users of the IOMMU API (kvm and omap-iovmm) are adopted
to deal with bytes instead of page order.
Many thanks to Joerg Roedel <Joerg.Roedel@amd.com> for significant review!
Signed-off-by: Ohad Ben-Cohen <ohad@wizery.com>
Cc: David Brown <davidb@codeaurora.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <Joerg.Roedel@amd.com>
Cc: Stepan Moskovchenko <stepanm@codeaurora.org>
Cc: KyongHo Cho <pullip.cho@samsung.com>
Cc: Hiroshi DOYU <hdoyu@nvidia.com>
Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This file has things like module_param_named() and MODULE_PARM_DESC()
so it needs the full module.h header present. Without it, you'll get:
CC arch/x86/kvm/../../../virt/kvm/iommu.o
virt/kvm/iommu.c:37: error: expected ‘)’ before ‘bool’
virt/kvm/iommu.c:39: error: expected ‘)’ before string constant
make[3]: *** [arch/x86/kvm/../../../virt/kvm/iommu.o] Error 1
make[2]: *** [arch/x86/kvm] Error 2
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
This was coming in via an implicit module.h (and its sub-includes)
before, but we'll be cleaning that up shortly. Call out the stat.h
include requirement in advance.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (33 commits)
iommu/core: Remove global iommu_ops and register_iommu
iommu/msm: Use bus_set_iommu instead of register_iommu
iommu/omap: Use bus_set_iommu instead of register_iommu
iommu/vt-d: Use bus_set_iommu instead of register_iommu
iommu/amd: Use bus_set_iommu instead of register_iommu
iommu/core: Use bus->iommu_ops in the iommu-api
iommu/core: Convert iommu_found to iommu_present
iommu/core: Add bus_type parameter to iommu_domain_alloc
Driver core: Add iommu_ops to bus_type
iommu/core: Define iommu_ops and register_iommu only with CONFIG_IOMMU_API
iommu/amd: Fix wrong shift direction
iommu/omap: always provide iommu debug code
iommu/core: let drivers know if an iommu fault handler isn't installed
iommu/core: export iommu_set_fault_handler()
iommu/omap: Fix build error with !IOMMU_SUPPORT
iommu/omap: Migrate to the generic fault report mechanism
iommu/core: Add fault reporting mechanism
iommu/core: Use PAGE_SIZE instead of hard-coded value
iommu/core: use the existing IS_ALIGNED macro
iommu/msm: ->unmap() should return order of unmapped page
...
Fixup trivial conflicts in drivers/iommu/Makefile: "move omap iommu to
dedicated iommu folder" vs "Rename the DMAR and INTR_REMAP config
options" just happened to touch lines next to each other.
* 'kvm-updates/3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (75 commits)
KVM: SVM: Keep intercepting task switching with NPT enabled
KVM: s390: implement sigp external call
KVM: s390: fix register setting
KVM: s390: fix return value of kvm_arch_init_vm
KVM: s390: check cpu_id prior to using it
KVM: emulate lapic tsc deadline timer for guest
x86: TSC deadline definitions
KVM: Fix simultaneous NMIs
KVM: x86 emulator: convert push %sreg/pop %sreg to direct decode
KVM: x86 emulator: switch lds/les/lss/lfs/lgs to direct decode
KVM: x86 emulator: streamline decode of segment registers
KVM: x86 emulator: simplify OpMem64 decode
KVM: x86 emulator: switch src decode to decode_operand()
KVM: x86 emulator: qualify OpReg inhibit_byte_regs hack
KVM: x86 emulator: switch OpImmUByte decode to decode_imm()
KVM: x86 emulator: free up some flag bits near src, dst
KVM: x86 emulator: switch src2 to generic decode_operand()
KVM: x86 emulator: expand decode flags to 64 bits
KVM: x86 emulator: split dst decode to a generic decode_operand()
KVM: x86 emulator: move memop, memopp into emulation context
...
With per-bus iommu_ops the iommu_found function needs to
work on a bus_type too. This patch adds a bus_type parameter
to that function and converts all call-places.
The function is also renamed to iommu_present because the
function now checks if an iommu is present for a given bus
and does not check for a global iommu anymore.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This is necessary to store a pointer to the bus-specific
iommu_ops in the iommu-domain structure. It will be used
later to call into bus-specific iommu-ops.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The threaded IRQ handler for MSI-X has almost nothing in common with the
INTx/MSI handler. Move its code into a dedicated handler.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We only perform work in kvm_assigned_dev_ack_irq if the guest IRQ is of
INTx type. This completely avoids the callback invocation in non-INTx
cases by registering the IRQ ack notifier only for INTx.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently the method of dealing with an IO operation on a bus (PIO/MMIO)
is to call the read or write callback for each device registered
on the bus until we find a device which handles it.
Since the number of devices on a bus can be significant due to ioeventfds
and coalesced MMIO zones, this leads to a lot of overhead on each IO
operation.
Instead of registering devices, we now register ranges which points to
a device. Lookup is done using an efficient bsearch instead of a linear
search.
Performance test was conducted by comparing exit count per second with
200 ioeventfds created on one byte and the guest is trying to access a
different byte continuously (triggering usermode exits).
Before the patch the guest has achieved 259k exits per second, after the
patch the guest does 274k exits per second.
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch changes coalesced mmio to create one mmio device per
zone instead of handling all zones in one device.
Doing so enables us to take advantage of existing locking and prevents
a race condition between coalesced mmio registration/unregistration
and lookups.
Suggested-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Move the check whether there are available entries to within the spinlock.
This allows working with larger amount of VCPUs and reduces premature
exits when using a large number of VCPUs.
Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Device drivers that create and destroy SR-IOV virtual functions via
calls to pci_enable_sriov() and pci_disable_sriov can cause catastrophic
failures if they attempt to destroy VFs while they are assigned to
guest virtual machines. By adding a flag for use by the KVM module
to indicate that a device is assigned a device driver can check that
flag and avoid destroying VFs while they are assigned and avoid system
failures.
CC: Ian Campbell <ijc@hellion.org.uk>
CC: Konrad Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
IOMMU interrupt remapping support provides a further layer of
isolation for device assignment by preventing arbitrary interrupt
block DMA writes by a malicious guest from reaching the host. By
default, we should require that the platform provides interrupt
remapping support, with an opt-in mechanism for existing behavior.
Both AMD IOMMU and Intel VT-d2 hardware support interrupt
remapping, however we currently only have software support on
the Intel side. Users wishing to re-enable device assignment
when interrupt remapping is not supported on the platform can
use the "allow_unsafe_assigned_interrupts=1" module option.
[avi: break long lines]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The idea is from Avi:
| We could cache the result of a miss in an spte by using a reserved bit, and
| checking the page fault error code (or seeing if we get an ept violation or
| ept misconfiguration), so if we get repeated mmio on a page, we don't need to
| search the slot list/tree.
| (https://lkml.org/lkml/2011/2/22/221)
When the page fault is caused by mmio, we cache the info in the shadow page
table, and also set the reserved bits in the shadow page table, so if the mmio
is caused again, we can quickly identify it and emulate it directly
Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it
can be reduced by this feature, and also avoid walking guest page table for
soft mmu.
[jan: fix operator precedence issue]
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If the page fault is caused by mmio, the gfn can not be found in memslots, and
'bad_pfn' is returned on gfn_to_hva path, so we can use 'bad_pfn' to identify
the mmio page fault.
And, to clarify the meaning of mmio pfn, we return fault page instead of bad
page when the gfn is not allowd to prefetch
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce kvm_read_guest_cached() function in addition to write one we
already have.
[ by glauber: export function signature in kvm header ]
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Eric Munson <emunson@mgebm.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
KVM_MAX_MSIX_PER_DEV implies that up to that many MSI-X entries can be
requested. But the kernel so far rejected already the upper limit.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
KVM has an ioctl to define which signal mask should be used while running
inside VCPU_RUN. At least for big endian systems, this mask is different
on 32-bit and 64-bit systems (though the size is identical).
Add a compat wrapper that converts the mask to whatever the kernel accepts,
allowing 32-bit kvm user space to set signal masks.
This patch fixes qemu with --enable-io-thread on ppc64 hosts when running
32-bit user land.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
So far kvm_arch_vcpu_setup is responsible for freeing the vcpu struct if
it fails. Move this confusing resonsibility back into the hands of
kvm_vm_ioctl_create_vcpu. Only kvm_arch_vcpu_setup of x86 is affected,
all other archs cannot fail.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Simply use __copy_to_user/__clear_user to write guest page since we have
already verified the user address when the memslot is set
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* 'kvm-updates/3.0' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Initialize kvm before registering the mmu notifier
KVM: x86: use proper port value when checking io instruction permission
KVM: add missing void __user * cast to access_ok() call
It doesn't make sense to ever see a half-initialized kvm structure on
mmu notifier callbacks. Previously, 85722cda changed the ordering to
ensure that the mmu_lock was initialized before mmu notifier
registration, but there is still a race where the mmu notifier could
come in and try accessing other portions of struct kvm before they are
intialized.
Solve this by moving the mmu notifier registration to occur after the
structure is completely initialized.
Google-Bug-Id: 452199
Signed-off-by: Mike Waychison <mikew@google.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
fa3d315a "KVM: Validate userspace_addr of memslot when registered" introduced
this new warning onn s390:
kvm_main.c: In function '__kvm_set_memory_region':
kvm_main.c:654:7: warning: passing argument 1 of '__access_ok' makes pointer from integer without a cast
arch/s390/include/asm/uaccess.h:53:19: note: expected 'const void *' but argument is of type '__u64'
Add the missing cast to get rid of it again...
Cc: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (27 commits)
PCI: Don't use dmi_name_in_vendors in quirk
PCI: remove unused AER functions
PCI/sysfs: move bus cpuaffinity to class dev_attrs
PCI: add rescan to /sys/.../pci_bus/.../
PCI: update bridge resources to get more big ranges when allocating space (again)
KVM: Use pci_store/load_saved_state() around VM device usage
PCI: Add interfaces to store and load the device saved state
PCI: Track the size of each saved capability data area
PCI/e1000e: Add and use pci_disable_link_state_locked()
x86/PCI: derive pcibios_last_bus from ACPI MCFG
PCI: add latency tolerance reporting enable/disable support
PCI: add OBFF enable/disable support
PCI: add ID-based ordering enable/disable support
PCI hotplug: acpiphp: assume device is in state D0 after powering on a slot.
PCI: Set PCIE maxpayload for card during hotplug insertion
PCI/ACPI: Report _OSC control mask returned on failure to get control
x86/PCI: irq and pci_ids patch for Intel Panther Point DeviceIDs
PCI: handle positive error codes
PCI: check pci_vpd_pci22_wait() return
PCI: Use ICH6_GPIO_EN in ich6_lpc_acpi_gpio
...
Fix up trivial conflicts in include/linux/pci_ids.h: commit a6e5e2be44
moved the intel SMBUS ID definitons to the i2c-i801.c driver.
This way, we can avoid checking the user space address many times when
we read the guest memory.
Although we can do the same for write if we check which slots are
writable, we do not care write now: reading the guest memory happens
more often than writing.
[avi: change VERIFY_READ to VERIFY_WRITE]
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
Function ioapic_debug() in the ioapic_deliver() misnames
one filed by reference. This patch correct it.
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Store the device saved state so that we can reload the device back
to the original state when it's unassigned. This has the benefit
that the state survives across pci_reset_function() calls via
the PCI sysfs reset interface while the VM is using the device.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
We can get memslot id from memslot->id directly
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If asynchronous hva_to_pfn() is requested call GUP with FOLL_NOWAIT to
avoid sleeping on IO. Check for hwpoison is done at the same time,
otherwise check_user_page_hwpoison() will call GUP again and will put
vcpu to sleep.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
irqfd in kvm used flush_work incorrectly: it assumed that work scheduled
previously can't run after flush_work, but since kvm uses a non-reentrant
workqueue (by means of schedule_work) we need flush_work_sync to get that
guarantee.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reported-by: Jean-Philippe Menil <jean-philippe.menil@univ-nantes.fr>
Tested-by: Jean-Philippe Menil <jean-philippe.menil@univ-nantes.fr>
Signed-off-by: Avi Kivity <avi@redhat.com>
* 'syscore' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
Introduce ARCH_NO_SYSDEV_OPS config option (v2)
cpufreq: Use syscore_ops for boot CPU suspend/resume (v2)
KVM: Use syscore_ops instead of sysdev class and sysdev
PCI / Intel IOMMU: Use syscore_ops instead of sysdev class and sysdev
timekeeping: Use syscore_ops instead of sysdev class and sysdev
x86: Use syscore_ops instead of sysdev classes and sysdevs
As a preparation for removing ext2 non-atomic bit operations from
asm/bitops.h. This converts ext2 non-atomic bit operations to
little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
asm-generic/bitops/le.h is only intended to be included directly from
asm-generic/bitops/ext2-non-atomic.h or asm-generic/bitops/minix-le.h
which implements generic ext2 or minix bit operations.
This stops including asm-generic/bitops/le.h directly and use ext2
non-atomic bit operations instead.
It seems odd to use ext2_set_bit() on kvm, but it will replaced with
__set_bit_le() after introducing little endian bit operations for all
architectures. This indirect step is necessary to maintain bisectability
for some architectures which have their own little-endian bit operations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KVM uses a sysdev class and a sysdev for executing kvm_suspend()
after interrupts have been turned off on the boot CPU (during system
suspend) and for executing kvm_resume() before turning on interrupts
on the boot CPU (during system resume). However, since both of these
functions ignore their arguments, the entire mechanism may be
replaced with a struct syscore_ops object which is simpler.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Avi Kivity <avi@redhat.com>
The RCU use in kvm_irqfd_deassign is tricky: we have rcu_assign_pointer
but no synchronize_rcu: synchronize_rcu is done by kvm_irq_routing_update
which we share a spinlock with.
Fix up a comment in an attempt to make this clearer.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Code under this lock requires non-preemptibility. Ensure this also over
-rt by converting it to raw spinlock.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic
slowdowns of certain workloads, we instead use yield_to to get
another VCPU in the same KVM guest to run sooner.
This seems to give a 10-15% speedup in certain workloads.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Keep track of which task is running a KVM vcpu. This helps us
figure out later what task to wake up if we want to boost a
vcpu that got preempted.
Unfortunately there are no guarantees that the same task
always keeps the same vcpu, so we can only track the task
across a single "run" of the vcpu.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
is_hwpoison_address only checks whether the page table entry is
hwpoisoned, regardless the memory page mapped. While __get_user_pages
will check both.
QEMU will clear the poisoned page table entry (via unmap/map) to make
it possible to allocate a new memory page for the virtual address
across guest rebooting. But it is also possible that the underlying
memory page is kept poisoned even after the corresponding page table
entry is cleared, that is, a new memory page can not be allocated.
__get_user_pages can catch these situations.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Now, we have 'vcpu->mode' to judge whether need to send ipi to other
cpus, this way is very exact, so checking request bit is needless,
then we can drop the spinlock let it's collateral
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently we keep track of only two states: guest mode and host
mode. This patch adds an "exiting guest mode" state that tells
us that an IPI will happen soon, so unless we need to wait for the
IPI, we can avoid it completely.
Also
1: No need atomically to read/write ->mode in vcpu's thread
2: reorganize struct kvm_vcpu to make ->mode and ->requests
in the same cache line explicitly
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Get rid of this warning:
CC arch/s390/kvm/../../../virt/kvm/kvm_main.o
arch/s390/kvm/../../../virt/kvm/kvm_main.c:596:12: warning: 'kvm_create_dirty_bitmap' defined but not used
The only caller of the function is within a !CONFIG_S390 section, so add the
same ifdef around kvm_create_dirty_bitmap() as well.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Instead, drop large mappings, which were the reason we dropped shadow.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Cleanup some code with common compound_trans_head helper.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For GRU and EPT, we need gup-fast to set referenced bit too (this is why
it's correct to return 0 when shadow_access_mask is zero, it requires
gup-fast to set the referenced bit). qemu-kvm access already sets the
young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow
paging EPT minor fault we relay on gup-fast to signal the page is in
use...
We also need to check the young bits on the secondary pagetables for NPT
and not nested shadow mmu as the data may never get accessed again by the
primary pte.
Without this closer accuracy, we'd have to remove the heuristic that
avoids collapsing hugepages in hugepage virtual regions that have not even
a single subpage in use.
->test_young is full backwards compatible with GRU and other usages that
don't have young bits in pagetables set by the hardware and that should
nuke the secondary mmu mappings when ->clear_flush_young runs just like
EPT does.
Removing the heuristic that checks the young bit in
khugepaged/collapse_huge_page completely isn't so bad either probably but
I thought it was worth it and this makes it reliable.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This should work for both hugetlbfs and transparent hugepages.
[akpm@linux-foundation.org: bring forward PageTransCompound() addition for bisectability]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since vmx blocks INIT signals, we disable virtualization extensions during
reboot. This leads to virtualization instructions faulting; we trap these
faults and spin while the reboot continues.
Unfortunately spinning on a non-preemptible kernel may block a task that
reboot depends on; this causes the reboot to hang.
Fix by skipping over the instruction and hoping for the best.
Signed-off-by: Avi Kivity <avi@redhat.com>
Quote from Avi:
| I don't think we need to flush immediately; set a "tlb dirty" bit somewhere
| that is cleareded when we flush the tlb. kvm_mmu_notifier_invalidate_page()
| can consult the bit and force a flush if set.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Store irq routing table pointer in the irqfd object,
and use that to inject MSI directly without bouncing out to
a kernel thread.
While we touch this structure, rearrange irqfd fields to make fastpath
better packed for better cache utilization.
This also adds some comments about locking rules and rcu usage in code.
Some notes on the design:
- Use pointer into the rt instead of copying an entry,
to make it possible to use rcu, thus side-stepping
locking complexities. We also save some memory this way.
- Old workqueue code is still used for level irqs.
I don't think we DTRT with level anyway, however,
it seems easier to keep the code around as
it has been thought through and debugged, and fix level later than
rip out and re-instate it later.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The naming convension of hardware_[dis|en]able family is little bit confusing
because only hardware_[dis|en]able_all are using _nolock suffix.
Renaming current hardware_[dis|en]able() to *_nolock() and using
hardware_[dis|en]able() as wrapper functions which take kvm_lock for them
reduces extra confusion.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
In kvm_cpu_hotplug(), only CPU_STARTING case is protected by kvm_lock.
This patch adds missing protection for CPU_DYING case.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Any arch not supporting device assigment will also not build
assigned-dev.c. So testing for KVM_CAP_DEVICE_DEASSIGNMENT is pointless.
KVM_CAP_ASSIGN_DEV_IRQ is unconditinally set. Moreover, add a default
case for dispatching the ioctl.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The guest may change states that pci_reset_function does not touch. So
we better save/restore the assigned device across guest usage.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Cosmetic change, but it helps to correlate IRQs with PCI devices.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This improves the IRQ forwarding for assigned devices: By using the
kernel's threaded IRQ scheme, we can get rid of the latency-prone work
queue and simplify the code in the same run.
Moreover, we no longer have to hold assigned_dev_lock while raising the
guest IRQ, which can be a lenghty operation as we may have to iterate
over all VCPUs. The lock is now only used for synchronizing masking vs.
unmasking of INTx-type IRQs, thus is renames to intx_lock.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
When we deassign a guest IRQ, clear the potentially asserted guest line.
There might be no chance for the guest to do this, specifically if we
switch from INTx to MSI mode.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
IA64 support forces us to abstract the allocation of the kvm structure.
But instead of mixing this up with arch-specific initialization and
doing the same on destruction, split both steps. This allows to move
generic destruction calls into generic code.
It also fixes error clean-up on failures of kvm_create_vm for IA64.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In kvm_async_pf_wakeup_all(), we add a dummy apf to vcpu->async_pf.done
without holding vcpu->async_pf.lock, it will break if we are handling apfs
at this time.
Also use 'list_empty_careful()' instead of 'list_empty()'
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If it's no need to inject async #PF to PV guest we can handle
more completed apfs at one time, so we can retry guest #PF
as early as possible
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Fixes this:
CC arch/s390/kvm/../../../virt/kvm/kvm_main.o
arch/s390/kvm/../../../virt/kvm/kvm_main.c: In function 'kvm_clear_guest_page':
arch/s390/kvm/../../../virt/kvm/kvm_main.c:1224:2: warning: passing argument 3 of 'kvm_write_guest_page' makes pointer from integer without a cast
arch/s390/kvm/../../../virt/kvm/kvm_main.c:1185:5: note: expected 'const void *' but argument is of type 'long unsigned int'
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently we are using vmalloc() for all dirty bitmaps even if
they are small enough, say less than K bytes.
We use kmalloc() if dirty bitmap size is less than or equal to
PAGE_SIZE so that we can avoid vmalloc area usage for VGA.
This will also make the logging start/stop faster.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently x86's kvm_vm_ioctl_get_dirty_log() needs to allocate a bitmap by
vmalloc() which will be used in the next logging and this has been causing
bad effect to VGA and live-migration: vmalloc() consumes extra systime,
triggers tlb flush, etc.
This patch resolves this issue by pre-allocating one more bitmap and switching
between two bitmaps during dirty logging.
Performance improvement:
I measured performance for the case of VGA update by trace-cmd.
The result was 1.5 times faster than the original one.
In the case of live migration, the improvement ratio depends on the workload
and the guest memory size. In general, the larger the memory size is the more
benefits we get.
Note:
This does not change other architectures's logic but the allocation size
becomes twice. This will increase the actual memory consumption only when
the new size changes the number of pages allocated by vmalloc().
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This makes it easy to change the way of allocating/freeing dirty bitmaps.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
As suggested by Andrea, pass r/w error code to gup(), upgrading read fault
to writable if host pte allows it.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Send async page fault to a PV guest if it accesses swapped out memory.
Guest will choose another task to run upon receiving the fault.
Allow async page fault injection only when guest is in user mode since
otherwise guest may be in non-sleepable context and will not be able
to reschedule.
Vcpu will be halted if guest will fault on the same page again or if
vcpu executes kernel code.
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Keep track of memslots changes by keeping generation number in memslots
structure. Provide kvm_write_guest_cached() function that skips
gfn_to_hva() translation if memslots was not changed since previous
invocation.
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
When page is swapped in it is mapped into guest memory only after guest
tries to access it again and generate another fault. To save this fault
we can map it immediately since we know that guest is going to access
the page. Do it only when tdp is enabled for now. Shadow paging case is
more complicated. CR[034] and EFER registers should be switched before
doing mapping and then switched back.
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If a guest accesses swapped out memory do not swap it in from vcpu thread
context. Schedule work to do swapping and put vcpu into halted state
instead.
Interrupts will still be delivered to the guest and if interrupt will
cause reschedule guest will continue to run another task.
[avi: remove call to get_user_pages_noio(), nacked by Linus; this
makes everything synchrnous again]
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* 'kvm-updates/2.6.37' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (321 commits)
KVM: Drop CONFIG_DMAR dependency around kvm_iommu_map_pages
KVM: Fix signature of kvm_iommu_map_pages stub
KVM: MCE: Send SRAR SIGBUS directly
KVM: MCE: Add MCG_SER_P into KVM_MCE_CAP_SUPPORTED
KVM: fix typo in copyright notice
KVM: Disable interrupts around get_kernel_ns()
KVM: MMU: Avoid sign extension in mmu_alloc_direct_roots() pae root address
KVM: MMU: move access code parsing to FNAME(walk_addr) function
KVM: MMU: audit: check whether have unsync sps after root sync
KVM: MMU: audit: introduce audit_printk to cleanup audit code
KVM: MMU: audit: unregister audit tracepoints before module unloaded
KVM: MMU: audit: fix vcpu's spte walking
KVM: MMU: set access bit for direct mapping
KVM: MMU: cleanup for error mask set while walk guest page table
KVM: MMU: update 'root_hpa' out of loop in PAE shadow path
KVM: x86 emulator: Eliminate compilation warning in x86_decode_insn()
KVM: x86: Fix constant type in kvm_get_time_scale
KVM: VMX: Add AX to list of registers clobbered by guest switch
KVM guest: Move a printk that's using the clock before it's ready
KVM: x86: TSC catchup mode
...
We also have to call kvm_iommu_map_pages for CONFIG_AMD_IOMMU. So drop
the dependency on Intel IOMMU, kvm_iommu_map_pages will be a nop anyway
if CONFIG_IOMMU_API is not defined.
KVM-Stable-Tag.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
It doesn't really matter, but if we spin, we should spin in a more relaxed
manner. This way, if something goes wrong at least it won't contribute to
global warming.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
There is a bugs in this function, we call gfn_to_pfn() and kvm_mmu_gva_to_gpa_read() in
atomic context(kvm_mmu_audit() is called under the spinlock(mmu_lock)'s protection).
This patch fix it by:
- introduce gfn_to_pfn_atomic instead of gfn_to_pfn
- get the mapping gfn from kvm_mmu_page_get_gfn()
And it adds 'notrap' ptes check in unsync/direct sps
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce this function to get consecutive gfn's pages, it can reduce
gup's overload, used by later patch
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Introduce hva_to_pfn_atomic(), it's the fast path and can used in atomic
context, the later patch will use it
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If there are active VCPUs which are marked as belonging to
a particular hardware CPU, request a clock sync for them when
enabling hardware; the TSC could be desynchronized on a newly
arriving CPU, and we need to recompute guests system time
relative to boot after a suspend event.
This covers both cases.
Note that it is acceptable to take the spinlock, as either
no other tasks will be running and no locks held (BSP after
resume), or other tasks will be guaranteed to drop the lock
relatively quickly (AP on CPU_STARTING).
Noting we now get clock synchronization requests for VCPUs
which are starting up (or restarting), it is tempting to
attempt to remove the arch/x86/kvm/x86.c CPU hot-notifiers
at this time, however it is not correct to do so; they are
required for systems with non-constant TSC as the frequency
may not be known immediately after the processor has started
until the cpufreq driver has had a chance to run and query
the chipset.
Updated: implement better locking semantics for hardware_enable
Removed the hack of dropping and retaking the lock by adding the
semantic that we always hold kvm_lock when hardware_enable is
called. The one place that doesn't need to worry about it is
resume, as resuming a frozen CPU, the spinlock won't be taken.
Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
vfs: make no_llseek the default
vfs: don't use BKL in default_llseek
llseek: automatically add .llseek fop
libfs: use generic_file_llseek for simple_attr
mac80211: disallow seeks in minstrel debug code
lirc: make chardev nonseekable
viotape: use noop_llseek
raw: use explicit llseek file operations
ibmasmfs: use generic_file_llseek
spufs: use llseek in all file operations
arm/omap: use generic_file_llseek in iommu_debug
lkdtm: use generic_file_llseek in debugfs
net/wireless: use generic_file_llseek in debugfs
drm: use noop_llseek
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
When we reboot, we disable vmx extensions or otherwise INIT gets blocked.
If a task on another cpu hits a vmx instruction, it will fault if vmx is
disabled. We trap that to avoid a nasty oops and spin until the reboot
completes.
Problem is, we sleep with interrupts disabled. This blocks smp_send_stop()
from running, and the reboot process halts.
Fix by enabling interrupts before spinning.
KVM-Stable-Tag.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
I think I see the following (theoretical) race:
During irqfd assign, we drop irqfds lock before we
schedule inject work. Therefore, deassign running
on another CPU could cause shutdown and flush to run
before inject, causing user after free in inject.
A simple fix it to schedule inject under the lock.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The CPU_STARTING callback was added upstream with the intention
of being used for KVM, specifically for the hardware enablement
that must be done before we can run in hardware virt. It had
bugs on the x86_64 architecture at the time, where it was called
after CPU_ONLINE. The arches have since merged and the bug is
gone.
It might be noted other features should probably start making
use of this callback; microcode updates in particular which
might be fixing important erratums would be best applied before
beginning to run user tasks.
Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Devices register mask notifier using gsi, but irqchip knows about
irqchip/pin, so conversion from irqchip/pin to gsi should be done before
looking for mask notifier to call.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently if guest access address that belongs to memory slot but is not
backed up by page or page is read only KVM treats it like MMIO access.
Remove that capability. It was never part of the interface and should
not be relied upon.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
For 32bit machines where the physical address width is
larger than the virtual address width the frame number types
in KVM may overflow. Fix this by changing them to u64.
[sfr: fix build on 32-bit ppc]
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch converts unnecessary divide and modulo operations
in the KVM large page related code into logical operations.
This allows to convert gfn_t to u64 while not breaking 32
bit builds.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
is_hwpoison_address accesses the page table, so the caller must hold
current->mm->mmap_sem in read mode. So fix its usage in hva_to_pfn of
kvm accordingly.
Comment is_hwpoison_address to remind other users.
Reported-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
As advertised in feature-removal-schedule.txt. Equivalent support is provided
by overlapping memory regions.
Signed-off-by: Avi Kivity <avi@redhat.com>
Otherwise we might try to deliver a timer interrupt to a cpu that
can't possibly handle it.
Signed-off-by: Chris Lalancette <clalance@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
When the user passed in a NULL mask pass this on from the ioctl
handler.
Found by gcc 4.6's new warnings.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The type of '*new.rmap' is not 'struct page *', fix it
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Now that all arch specific ioctls have centralized locking, it is easy to
move it to the central dispatcher.
Signed-off-by: Avi Kivity <avi@redhat.com>
All vcpu ioctls need to be locked, so instead of locking each one specifically
we lock at the generic dispatcher.
This patch only updates generic ioctls and leaves arch specific ioctls alone.
Signed-off-by: Avi Kivity <avi@redhat.com>
Remove this check in an effort to allow kvm guests to run without
root privileges. This capability check doesn't seem to add any
security since the device needs to have already been added via the
assign device ioctl and the io actually occurs through the pci
sysfs interface.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
In common cases, guest SRAO MCE will cause corresponding poisoned page
be un-mapped and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay
the MCE to guest OS.
But it is reported that if the poisoned page is accessed in guest
after unmapping and before MCE is relayed to guest OS, userspace will
be killed.
The reason is as follows. Because poisoned page has been un-mapped,
guest access will cause guest exit and kvm_mmu_page_fault will be
called. kvm_mmu_page_fault can not get the poisoned page for fault
address, so kernel and user space MMIO processing is tried in turn. In
user MMIO processing, poisoned page is accessed again, then userspace
is killed by force_sig_info.
To fix the bug, kvm_mmu_page_fault send HWPOISON signal to QEMU-KVM
and do not try kernel and user space MMIO processing for poisoned
page.
[xiao: fix warning introduced by avi]
Reported-by: Max Asbock <masbock@linux.vnet.ibm.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This is obviously a left-over from the the old interface taking the
size. Apparently a mostly harmless issue with the current iommu_unmap
implementation.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
* 'kvm-updates/2.6.35' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (269 commits)
KVM: x86: Add missing locking to arch specific vcpu ioctls
KVM: PPC: Add missing vcpu_load()/vcpu_put() in vcpu ioctls
KVM: MMU: Segregate shadow pages with different cr0.wp
KVM: x86: Check LMA bit before set_efer
KVM: Don't allow lmsw to clear cr0.pe
KVM: Add cpuid.txt file
KVM: x86: Tell the guest we'll warn it about tsc stability
x86, paravirt: don't compute pvclock adjustments if we trust the tsc
x86: KVM guest: Try using new kvm clock msrs
KVM: x86: export paravirtual cpuid flags in KVM_GET_SUPPORTED_CPUID
KVM: x86: add new KVMCLOCK cpuid feature
KVM: x86: change msr numbers for kvmclock
x86, paravirt: Add a global synchronization point for pvclock
x86, paravirt: Enable pvclock flags in vcpu_time_info structure
KVM: x86: Inject #GP with the right rip on efer writes
KVM: SVM: Don't allow nested guest to VMMCALL into host
KVM: x86: Fix exception reinjection forced to true
KVM: Fix wallclock version writing race
KVM: MMU: Don't read pdptrs with mmu spinlock held in mmu_alloc_roots
KVM: VMX: enable VMXON check with SMX enabled (Intel TXT)
...
vmx and svm vcpus have different contents and therefore may have different
alignmment requirements. Let each specify its required alignment.
Signed-off-by: Avi Kivity <avi@redhat.com>
* 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86/amd-iommu: Add amd_iommu=off command line option
iommu-api: Remove iommu_{un}map_range functions
x86/amd-iommu: Implement ->{un}map callbacks for iommu-api
x86/amd-iommu: Make amd_iommu_iova_to_phys aware of multiple page sizes
x86/amd-iommu: Make iommu_unmap_page and fetch_pte aware of page sizes
x86/amd-iommu: Make iommu_map_page and alloc_pte aware of page sizes
kvm: Change kvm_iommu_map_pages to map large pages
VT-d: Change {un}map_range functions to implement {un}map interface
iommu-api: Add ->{un}map callbacks to iommu_ops
iommu-api: Add iommu_map and iommu_unmap functions
iommu-api: Rename ->{un}map function pointers to ->{un}map_range
As Avi pointed out, testing bit part in mark_page_dirty() was important
in the days of shadow paging, but currently EPT and NPT has already become
common and the chance of faulting a page more that once per iteration is
small. So let's remove the test bit to avoid extra access.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
When CPU_UP_CANCELED, hardware_enable() has not been called at the CPU
which is going up because raw_notifier_call_chain(CPU_ONLINE)
has not been called for this cpu.
Drop the handling for CPU_UP_CANCELED.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch limits the number of pages per memory slot to make
us free from extra care about type issues.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
kvm_coalesced_mmio_init() keeps to hold the addresses of a coalesced
mmio ring page and dev even after it has freed them.
Also, if this function fails, though it might be rare, it seems to be
suggesting the system's serious state: so we'd better stop the works
following the kvm_creat_vm().
This patch clears these problems.
We move the coalesced mmio's initialization out of kvm_create_vm().
This seems to be natural because it includes a registration which
can be done only when vm is successfully created.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch change the errno of ioctl KVM_[UN]REGISTER_COALESCED_MMIO
from -EINVAL to -ENXIO if no coalesced mmio dev exists.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch does:
- no need call tracepoint_synchronize_unregister() when kvm module
is unloaded since ftrace can handle it
- cleanup ftrace's macro
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm_set_irq is used from non sleepable contexes, so convert ioapic from
mutex to spinlock.
KVM-Stable-Tag.
Tested-by: Ralf Bonenkamp <ralf.bonenkamp@swyx.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Marcelo introduced gfn_to_hva_memslot() when he implemented
gfn_to_pfn_memslot(). Let's use this for gfn_to_hva() too.
Note: also remove parentheses next to return as checkpatch said to do.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
Int is not long enough to store the size of a dirty bitmap.
This patch fixes this problem with the introduction of a wrapper
function to calculate the sizes of dirty bitmaps.
Note: in mark_page_dirty(), we have to consider the fact that
__set_bit() takes the offset as int, not long.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
This patch changes the implementation of of
kvm_iommu_map_pages to map the pages with the host page size
into the io virtual address space.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-By: Avi Kivity <avi@redhat.com>
wqh is unused, so we do not need to store it in irqfd anymore
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If we fail to init ioapic device or the fail to setup the default irq
routing, the device register by kvm_create_pic() and kvm_ioapic_init()
remain unregister. This patch fixed to do this.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm->arch.vioapic should be NULL in case of kvm_ioapic_init() failure
due to cannot register io dev.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch introduces a generic function to find out the
host page size for a given gfn. This function is needed by
the kvm iommu code. This patch also simplifies the x86
host_mapping_level function.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The commit 0953ca73 "KVM: Simplify coalesced mmio initialization"
allocate kvm_coalesced_mmio_ring in the kvm_coalesced_mmio_init(), but
didn't discard the original allocation...
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
When the guest acknowledges an interrupt, it sends an EOI message to the local
apic, which broadcasts it to the ioapic. To handle the EOI, we need to take
the ioapic mutex.
On large guests, this causes a lot of contention on this mutex. Since large
guests usually don't route interrupts via the ioapic (they use msi instead),
this is completely unnecessary.
Avoid taking the mutex by introducing a handled_vectors bitmap. Before taking
the mutex, check if the ioapic was actually responsible for the acked vector.
If not, we can return early.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Use two steps for memslot deletion: mark the slot invalid (which stops
instantiation of new shadow pages for that slot, but allows destruction),
then instantiate the new empty slot.
Also simplifies kvm_handle_hva locking.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Which takes a memslot pointer instead of using kvm->memslots.
To be used by SRCU convertion later.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Have a pointer to an allocated region inside struct kvm.
[alex: fix ppc book 3s]
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
kvm didn't clear irqfd counter on deassign, as a result we could get a
spurious interrupt when irqfd is assigned back. this leads to poor
performance and, in theory, guest crash.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Otherwise memory beyond irq_states[16] might be accessed.
Noticed by Juan Quintela.
Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Looks like repeatedly binding same fd to multiple gsi's with irqfd can
use up a ton of kernel memory for irqfd structures.
A simple fix is to allow each fd to only trigger one gsi: triggering a
storm of interrupts in guest is likely useless anyway, and we can do it
by binding a single gsi to many interrupts if we really want to.
Cc: stable@kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Acked-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
arch/s390/kvm/../../../virt/kvm/kvm_main.c: In function 'kvm_create_vm':
arch/s390/kvm/../../../virt/kvm/kvm_main.c:409: warning: label 'out_err' defined but not used
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
One possible order is:
KVM_CREATE_IRQCHIP ioctl(took kvm->lock) -> kvm_iobus_register_dev() ->
down_write(kvm->slots_lock).
The other one is in kvm_vm_ioctl_assign_device(), which take kvm->slots_lock
first, then kvm->lock.
Update the comment of lock order as well.
Observe it due to kernel locking debug warnings.
Cc: stable@kernel.org
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
It seems a couple places such as arch/ia64/kernel/perfmon.c and
drivers/infiniband/core/uverbs_main.c could use anon_inode_getfile()
instead of a private pseudo-fs + alloc_file(), if only there were a way
to get a read-only file. So provide this by having anon_inode_getfile()
create a read-only file if we pass O_RDONLY in flags.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Usually userspace will freeze the guest so we can inspect it, but some
internal state is not available. Add extra data to internal error
reporting so we can expose it to the debugger. Extra data is specific
to the suberror.
Signed-off-by: Avi Kivity <avi@redhat.com>
With big endian userspace, we can't quite figure out if a pointer
is 32 bit (shifted >> 32) or 64 bit when we read a 64 bit pointer.
This is what happens with dirty logging. To get the pointer interpreted
correctly, we thus need Arnd's patch to implement a compat layer for
the ioctl:
A better way to do this is to add a separate compat_ioctl() method that
converts this for you.
Based on initial patch from Arnd Bergmann.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce kvm_vcpu_on_spin, to be used by VMX/SVM to yield processing
once the cpu detects pause-based looping.
Signed-off-by: "Zhai, Edwin" <edwin.zhai@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Stanse found 2 lock imbalances in kvm_request_irq_source_id and
kvm_free_irq_source_id. They omit to unlock kvm->irq_lock on fail paths.
Fix that by adding unlock labels at the end of the functions and jump
there from the fail paths.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
X86 CPUs need to have some magic happening to enable the virtualization
extensions on them. This magic can result in unpleasant results for
users, like blocking other VMMs from working (vmx) or using invalid TLB
entries (svm).
Currently KVM activates virtualization when the respective kernel module
is loaded. This blocks us from autoloading KVM modules without breaking
other VMMs.
To circumvent this problem at least a bit, this patch introduces on
demand activation of virtualization. This means, that instead
virtualization is enabled on creation of the first virtual machine
and disabled on destruction of the last one.
So using this, KVM can be easily autoloaded, while keeping other
hypervisors usable.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The only thing it protects now is interrupt injection into lapic and
this can work lockless. Even now with kvm->irq_lock in place access
to lapic is not entirely serialized since vcpu access doesn't take
kvm->irq_lock.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Maintain back mapping from irqchip/pin to gsi to speedup
interrupt acknowledgment notifications.
[avi: build fix on non-x86/ia64]
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Use gsi indexed array instead of scanning all entries on each interrupt
injection.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This removes assumptions that max GSIs is smaller than number of pins.
Sharing is tracked on pin level not GSI level.
[avi: no PIC on ia64]
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We currently use host endian long types to store information
in the dirty bitmap.
This works reasonably well on Little Endian targets, because the
u32 after the first contains the next 32 bits. On Big Endian this
breaks completely though, forcing us to be inventive here.
So Ben suggested to always use Little Endian, which looks reasonable.
We only have dirty bitmap implemented in Little Endian targets so far
and since PowerPC would be the first Big Endian platform, we can just
as well switch to Little Endian always with little effort without
breaking existing targets.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
I'm seeing an oops condition when kvm-intel and kvm-amd are modprobe'd
during boot (say on an Intel system) and then rmmod'd:
# modprobe kvm-intel
kvm_init()
kvm_init_debug()
kvm_arch_init() <-- stores debugfs dentries internally
(success, etc)
# modprobe kvm-amd
kvm_init()
kvm_init_debug() <-- second initialization clobbers kvm's
internal pointers to dentries
kvm_arch_init()
kvm_exit_debug() <-- and frees them
# rmmod kvm-intel
kvm_exit()
kvm_exit_debug() <-- double free of debugfs files!
*BOOM*
If execution gets to the end of kvm_init(), then the calling module has been
established as the kvm provider. Move the debugfs initialization to the end of
the function, and remove the now-unnecessary call to kvm_exit_debug() from the
error path. That way we avoid trampling on the debugfs entries and freeing
them twice.
Cc: stable@kernel.org
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
this is needed for kvm if it want ksm to directly map pages into its
shadow page tables.
[marcelo: cast pfn assignment to u64]
Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code
But leave TTM code alone, something is fishy there with global vm_ops
being used.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove open-coded zalloc_cpumask_var() and zalloc_cpumask_var_node().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This code is not executed before file has been initialized to the result of
calling eventfd_fget. This function returns an ERR_PTR value in an error
case instead of NULL. Thus the test that file is not NULL is always true.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@match exists@
expression x, E;
statement S1, S2;
@@
x = eventfd_fget(...)
... when != x = E
(
* if (x == NULL || ...) S1 else S2
|
* if (x == NULL && ...) S1 else S2
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Avi Kivity <avi@redhat.com>
This bug was introduced by b4a2f5e723.
Cc: stable@kernel.org
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
spin_lock disables preemption, so we can simply read the current cpu.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Remove kvm_cpu_has_interrupt() and kvm_arch_interrupt_allowed() from
interface between general code and arch code. kvm_arch_vcpu_runnable()
checks for interrupts instead.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Today kvm_io_bus_regsiter_dev() returns void and will internally BUG_ON
if it fails. We want to create dynamic MMIO/PIO entries driven from
userspace later in the series, so we need to enhance the code to be more
robust with the following changes:
1) Add a return value to the registration function
2) Fix up all the callsites to check the return code, handle any
failures, and percolate the error up to the caller.
3) Add an unregister function that collapses holes in the array
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Add tracepoint in msi/ioapic/pic set_irq() functions,
in IPI sending and in the point where IRQ is placed into
apic's IRR.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Irqfd sets level for interrupt to 1 and then to 0.
For MSI, check level so that a single message is sent.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
There is a missing unlock on one fail path in ioapic_mmio_write,
fix that.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This changes bus accesses to use high-level kvm_io_bus_read/kvm_io_bus_write
functions. in_range now becomes unused so it is removed from device ops in
favor of read/write callbacks performing range checks internally.
This allows aliasing (mostly for in-kernel virtio), as well as better error
handling by making it possible to pass errors up to userspace.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Use slots_lock to protect device list on the bus. slots_lock is already
taken for read everywhere, so we only need to take it for write when
registering devices. This is in preparation to removing in_range and
kvm->lock around it.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
switch coalesced mmio slots_lock. slots_lock is already taken for read
everywhere, so we only need to take it for write when changing zones.
This is in preparation to removing in_range and kvm->lock around it.
[avi: fix build]
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
slots_lock is taken everywhere when device ops are called.
Document this as we will use this to rework locking for io.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Correct missing locking in a few places in x86's vm_ioctl handling path.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Since
commit 854b5338196b1175706e99d63be43a4f8d8ab607
Author: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
KVM: s390: streamline memslot handling
s390 uses the values of the memslot instead of doing everything in the arch
ioctl handler of the KVM_SET_USER_MEMORY_REGION. Unfortunately we missed to
set the userspace_addr of our memslot due to our s390 ifdef in
__kvm_set_memory_region.
Old s390 userspace launchers did not notice, since they started the guest at
userspace address 0.
Because of CONFIG_DEFAULT_MMAP_MIN_ADDR we now put the guest at 1M userspace,
which does not work. This patch makes sure that new.userspace_addr is set
on s390.
This fix should go in quickly. Nevertheless, looking at the code we should
clean up that ifdef in the long term. Any kernel janitors?
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This allows use of the powerful ftrace infrastructure.
See Documentation/trace/ for usage information.
[avi, stephen: various build fixes]
[sheng: fix control register breakage]
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Disable usage of 2M pages if VMX_EPT_2MB_PAGE_BIT (bit 16) is clear
in MSR_IA32_VMX_EPT_VPID_CAP and EPT is enabled.
[avi: s/largepages_disabled/largepages_enabled/ to avoid negative logic]
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Archs are free to use vcpu_id as they see fit. For x86 it is used as
vcpu's apic id. New ioctl is added to configure boot vcpu id that was
assumed to be 0 till now.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Protect irq injection/acking data structures with a separate irq_lock
mutex. This fixes the following deadlock:
CPU A CPU B
kvm_vm_ioctl_deassign_dev_irq()
mutex_lock(&kvm->lock); worker_thread()
-> kvm_deassign_irq() -> kvm_assigned_dev_interrupt_work_handler()
-> deassign_host_irq() mutex_lock(&kvm->lock);
-> cancel_work_sync() [blocked]
[gleb: fix ia64 path]
Reported-by: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce irq_lock, and use to protect ioapic data structures.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Move coalesced_mmio locking to its own device, instead of relying on
kvm->lock.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Instead of checking whether we'll wrap around, calculate how many entries
are available, and check whether we have enough (just one) for the pending
mmio.
By itself, this doesn't change anything, but it paves the way for making
this function lockless.
Signed-off-by: Avi Kivity <avi@redhat.com>
We modernize the io_device code so that we use container_of() instead of
dev->private, and move the vtable to a separate ops structure
(theoretically allows better caching for multiple instances of the same
ops structure)
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
We invoke kfree() on a data member instead of the structure. This works today
because the kvm_io_device is the first element of the private structure, but
this could change in the future, so lets clean this up.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
Disable interrupt at interrupt handler and enable it when guest ack is for
the level triggered interrupt, to prevent reinjected interrupt. MSI/MSI-X don't
need it.
One possible problem is multiply same vector interrupt injected between irq
handler and scheduled work handler would be merged as one for MSI/MSI-X.
But AFAIK, the drivers handle it well.
The patch fixed the oplin card performance issue(MSI-X performance is half of
MSI/INTx).
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
KVM provides a complete virtual system environment for guests, including
support for injecting interrupts modeled after the real exception/interrupt
facilities present on the native platform (such as the IDT on x86).
Virtual interrupts can come from a variety of sources (emulated devices,
pass-through devices, etc) but all must be injected to the guest via
the KVM infrastructure. This patch adds a new mechanism to inject a specific
interrupt to a guest using a decoupled eventfd mechnanism: Any legal signal
on the irqfd (using eventfd semantics from either userspace or kernel) will
translate into an injected interrupt in the guest at the next available
interrupt window.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The check for an edge is broken in current ioapic code. ioapic->irr is
cleared on each edge interrupt by ioapic_service() and this makes
old_irr != ioapic->irr condition in kvm_ioapic_set_irq() to be always
true. The patch fixes the code to properly recognise edge.
Some HW emulation calls set_irq() without level change. If each such
call is propagated to an OS it may confuse a device driver. This is the
case with keyboard device emulation and Windows XP x64 installer on SMP VM.
Each keystroke produce two interrupts (down/up) one interrupt is
submitted to CPU0 and another to CPU1. This confuses Windows somehow
and it ignores keystrokes.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm_notify_acked_irq does not check irq type, so that it sometimes
interprets msi vector as irq. As a result, ack notifiers are not
called, which typially hangs the guest. The fix is to track and
check irq type.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
make_all_cpus_request contains a race condition which can
trigger false request completed status, as follows:
CPU0 CPU1
if (test_and_set_bit(req,&vcpu->requests))
.... if (test_and_set_bit(req,&vcpu->requests))
.. return
proceed to smp_call_function_many(wait=1)
Use a spinlock to serialize concurrent CPUs.
Cc: stable@kernel.org
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
When slot is already allocated and being asked to be tracked we need
to break the large pages.
This code flush the mmu when someone ask a slot to start dirty bit
tracking.
Cc: stable@kernel.org
Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If userspace specifies a memory slot that is larger than 8 petabytes, it
could overflow the largepages variable.
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
If a slots guest physical address and host virtual address unequal (mod
large page size), then we would erronously try to back guest large pages
with host large pages. Detect this misalignment and diable large page
support for the trouble slot.
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm_handle_hva relies on mmu_lock protection to safely access
the memslot structures.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm_assigned_dev_ack_irq is vulnerable to a race condition with the
interrupt handler function. It does:
if (dev->host_irq_disabled) {
enable_irq(dev->host_irq);
dev->host_irq_disabled = false;
}
If an interrupt triggers before the host->dev_irq_disabled assignment,
it will disable the interrupt and set dev->host_irq_disabled to true.
On return to kvm_assigned_dev_ack_irq, dev->host_irq_disabled is set to
false, and the next kvm_assigned_dev_ack_irq call will fail to reenable
it.
Other than that, having the interrupt handler and work handlers run in
parallel sounds like asking for trouble (could not spot any obvious
problem, but better not have to, its fragile).
CC: sheng.yang@intel.com
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Intel TXT(Trusted Execution Technology) required VMX off for all cpu to work
when system shutdown.
CC: Joseph Cihula <joseph.cihula@intel.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Memory aliases with different memory type is a problem for guest. For the guest
without assigned device, the memory type of guest memory would always been the
same as host(WB); but for the assigned device, some part of memory may be used
as DMA and then set to uncacheable memory type(UC/WC), which would be a conflict of
host memory type then be a potential issue.
Snooping control can guarantee the cache correctness of memory go through the
DMA engine of VT-d.
[avi: fix build on ia64]
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm_vcpu_block() unhalts vpu on an interrupt/timer without checking
if interrupt window is actually opened.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently timer events are processed before entering guest mode. Move it
to main vcpu event loop since timer events should be processed even while
vcpu is halted. Timer may cause interrupt/nmi to be injected and only then
vcpu will be unhalted.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
free_mmu_pages() should only undo what alloc_mmu_pages() does.
Free mmu pages from the generic VM destruction function, kvm_destroy_vm().
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
After discussion with Marcelo, we decided to rework device assignment framework
together. The old problems are kernel logic is unnecessary complex. So Marcelo
suggest to split it into a more elegant way:
1. Split host IRQ assign and guest IRQ assign. And userspace determine the
combination. Also discard msi2intx parameter, userspace can specific
KVM_DEV_IRQ_HOST_MSI | KVM_DEV_IRQ_GUEST_INTX in assigned_irq->flags to
enable MSI to INTx convertion.
2. Split assign IRQ and deassign IRQ. Import two new ioctls:
KVM_ASSIGN_DEV_IRQ and KVM_DEASSIGN_DEV_IRQ.
This patch also fixed the reversed _IOR vs _IOW in definition(by deprecated the
old interface).
[avi: replace homemade bitcount() by hweight_long()]
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The new way does not require additional loop over vcpus to calculate
the one with lowest priority as one is chosen during delivery bitmap
construction.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Use kvm_apic_match_dest() in kvm_get_intr_delivery_bitmask() instead
of duplicating the same code. Use kvm_get_intr_delivery_bitmask() in
apic_send_ipi() to figure out ipi destination instead of reimplementing
the logic.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
ioapic_deliver() and kvm_set_msi() have code duplication. Move
the code into ioapic_deliver_entry() function and call it from
both places.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Get rid of ioapic_inj_irq() and ioapic_inj_nmi() functions.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Gleb fixed bitmap ops usage in kvm_ioapic_get_delivery_bitmask.
Sheng merged two functions, as well as fixed several issues in
kvm_get_intr_delivery_bitmask
1. deliver_bitmask is a bitmap rather than a unsigned long intereger.
2. Lowest priority target bitmap wrong calculated by mistake.
3. Prevent potential NULL reference.
4. Declaration in include/kvm_host.h caused powerpc compilation warning.
5. Add warning for guest broadcast interrupt with lowest priority delivery mode.
6. Removed duplicate bitmap clean up in caller of kvm_get_intr_delivery_bitmask.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch finally enable MSI-X.
What we need for MSI-X:
1. Intercept one page in MMIO region of device. So that we can get guest desired
MSI-X table and set up the real one. Now this have been done by guest, and
transfer to kernel using ioctl KVM_SET_MSIX_NR and KVM_SET_MSIX_ENTRY.
2. Information for incoming interrupt. Now one device can have more than one
interrupt, and they are all handled by one workqueue structure. So we need to
identify them. The previous patch enable gsi_msg_pending_bitmap get this done.
3. Mapping from host IRQ to guest gsi as well as guest gsi to real MSI/MSI-X
message address/data. We used same entry number for the host and guest here, so
that it's easy to find the correlated guest gsi.
What we lack for now:
1. The PCI spec said nothing can existed with MSI-X table in the same page of
MMIO region, except pending bits. The patch ignore pending bits as the first
step (so they are always 0 - no pending).
2. The PCI spec allowed to change MSI-X table dynamically. That means, the OS
can enable MSI-X, then mask one MSI-X entry, modify it, and unmask it. The patch
didn't support this, and Linux also don't work in this way.
3. The patch didn't implement MSI-X mask all and mask single entry. I would
implement the former in driver/pci/msi.c later. And for single entry, userspace
should have reposibility to handle it.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We have to handle more than one interrupt with one handler for MSI-X. Avi
suggested to use a flag to indicate the pending. So here is it.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce KVM_SET_MSIX_NR and KVM_SET_MSIX_ENTRY two ioctls.
This two ioctls are used by userspace to specific guest device MSI-X entry
number and correlate MSI-X entry with GSI during the initialization stage.
MSI-X should be well initialzed before enabling.
Don't support change MSI-X entry number for now.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
It's also convenient when we extend KVM supported vcpu number in the future.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Would be used with bit ops, and would be easily extended if KVM_MAX_VCPUS is
increased.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Under CONFIG_MAXSMP, cpus_hardware_enabled is allocated from the heap and
not statically initialized. This causes a crash on reboot when kvm thinks
vmx is enabled on random nonexistent cpus and accesses nonexistent percpu
lists.
Fix by explicitly clearing the variable.
Cc: stable@kernel.org
Reported-and-tested-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
When checking for overlapping slots on registration of a new one, kvm
currently also considers zero-length (ie. deleted) slots and rejects
requests incorrectly. This finally denies user space from joining slots.
Fix the check by skipping deleted slots and advertise this via a
KVM_CAP_JOIN_MEMORY_REGIONS_WORKS.
Cc: stable@kernel.org
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The large page initialization code concludes there are two large pages spanned
by a slot covering 1 (small) page starting at gfn 1. This is incorrect, and
also results in incorrect write_count initialization in some cases (base = 1,
npages = 513 for example).
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
Impact: Make symbols static.
Fix this sparse warnings:
arch/x86/kvm/mmu.c:992:5: warning: symbol 'mmu_pages_add' was not declared. Should it be static?
arch/x86/kvm/mmu.c:1124:5: warning: symbol 'mmu_pages_next' was not declared. Should it be static?
arch/x86/kvm/mmu.c:1144:6: warning: symbol 'mmu_pages_clear_parents' was not declared. Should it be static?
arch/x86/kvm/x86.c:2037:5: warning: symbol 'kvm_read_guest_virt' was not declared. Should it be static?
arch/x86/kvm/x86.c:2067:5: warning: symbol 'kvm_write_guest_virt' was not declared. Should it be static?
virt/kvm/irq_comm.c:220:5: warning: symbol 'setup_routing_entry' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
Include the newly introduced msidef.h to solve the build issues.
Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
only need to set assigned_dev_id for deassignment, use
match->flags to judge and deassign it.
Acked-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Weidong Han <weidong.han@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Return number of CPUs interrupt was successfully injected into or -1 if
none.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
IRQ injection status is either -1 (if there was no CPU found
that should except the interrupt because IRQ was masked or
ioapic was misconfigured or ...) or >= 0 in that case the
number indicates to how many CPUs interrupt was injected.
If the value is 0 it means that the interrupt was coalesced
and probably should be reinjected.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The function kvm_is_mmio_pfn is called before put_page is called on a
page by KVM. This is a problem when when this function is called on some
struct page which is part of a compund page. It does not test the
reserved flag of the compound page but of the struct page within the
compount page. This is a problem when KVM works with hugepages allocated
at boot time. These pages have the reserved bit set in all tail pages.
Only the flag in the compount head is cleared. KVM would not put such a
page which results in a memory leak.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Merge MSI userspace interface with IRQ routing table. Notice the API have been
changed, and using IRQ routing table would be the only interface kvm-userspace
supported.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
IRQ ack notifications assume an identity mapping between pin->gsi,
which might not be the case with, for example, HPET.
Translate before acking.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Currently KVM has a static routing from GSI numbers to interrupts (namely,
0-15 are mapped 1:1 to both PIC and IOAPIC, and 16:23 are mapped 1:1 to
the IOAPIC). This is insufficient for several reasons:
- HPET requires non 1:1 mapping for the timer interrupt
- MSIs need a new method to assign interrupt numbers and dispatch them
- ACPI APIC mode needs to be able to reassign the PCI LINK interrupts to the
ioapics
This patch implements an interrupt routing table (as a linked list, but this
can be easily changed) and a userspace interface to replace the table. The
routing table is initialized according to the current hardwired mapping.
Signed-off-by: Avi Kivity <avi@redhat.com>
Allow clients to request notifications when the guest masks or unmasks a
particular irq line. This complements irq ack notifications, as the guest
will not ack an irq line that is masked.
Currently implemented for the ioapic only.
Signed-off-by: Avi Kivity <avi@redhat.com>
MSI is always enabled by default for msi2intx=1. But if msi2intx=0, we
have to disable MSI if guest require to do so.
The patch also discard unnecessary msi2intx judgment if guest want to update
MSI state.
Notice KVM_DEV_IRQ_ASSIGN_MSI_ACTION is a mask which should cover all MSI
related operations, though we only got one for now.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This rips out the support for KVM_DEBUG_GUEST and introduces a new IOCTL
instead: KVM_SET_GUEST_DEBUG. The IOCTL payload consists of a generic
part, controlling the "main switch" and the single-step feature. The
arch specific part adds an x86 interface for intercepting both types of
debug exceptions separately and re-injecting them when the host was not
interested. Moveover, the foundation for guest debugging via debug
registers is layed.
To signal breakpoint events properly back to userland, an arch-specific
data block is now returned along KVM_EXIT_DEBUG. For x86, the arch block
contains the PC, the debug exception, and relevant debug registers to
tell debug events properly apart.
The availability of this new interface is signaled by
KVM_CAP_SET_GUEST_DEBUG. Empty stubs for not yet supported archs are
provided.
Note that both SVM and VTX are supported, but only the latter was tested
yet. Based on the experience with all those VTX corner case, I would be
fairly surprised if SVM will work out of the box.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm->slots_lock is outer to kvm->lock, so take slots_lock
in kvm_vm_ioctl_assign_device() before taking kvm->lock,
rather than taking it in kvm_iommu_map_memslots().
Cc: stable@kernel.org
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In the past, kvm_get_kvm() and kvm_put_kvm() was called in assigned device irq
handler and interrupt_work, in order to prevent cancel_work_sync() in
kvm_free_assigned_irq got a illegal state when waiting for interrupt_work done.
But it's tricky and still got two problems:
1. A bug ignored two conditions that cancel_work_sync() would return true result
in a additional kvm_put_kvm().
2. If interrupt type is MSI, we would got a window between cancel_work_sync()
and free_irq(), which interrupt would be injected again...
This patch discard the reference count used for irq handler and interrupt_work,
and ensure the legal state by moving the free function at the very beginning of
kvm_destroy_vm(). And the patch fix the second bug by disable irq before
cancel_work_sync(), which may result in nested disable of irq but OK for we are
going to free it.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm_arch_sync_events is introduced to quiet down all other events may happen
contemporary with VM destroy process, like IRQ handler and work struct for
assigned device.
For kvm_arch_sync_events is called at the very beginning of kvm_destroy_vm(), so
the state of KVM here is legal and can provide a environment to quiet down other
events.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The destructor for huge pages uses the backing inode for adjusting
hugetlbfs accounting.
Hugepage mappings are destroyed by exit_mmap, after
mmu_notifier_release, so there are no notifications through
unmap_hugepage_range at this point.
The hugetlbfs inode can be freed with pages backed by it referenced
by the shadow. When the shadow releases its reference, the huge page
destructor will access a now freed inode.
Implement the release operation for kvm mmu notifiers to release page
refs before the hugetlbfs inode is gone.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Impact: file renamed
The code in the vtd.c file can be reused for other IOMMUs as well. So
rename it to make it clear that it handle more than VT-d.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
In kvm_iommu_unmap_memslots(), assigned_dev_head is already empty.
Signed-off-by: Weidong Han <weidong.han@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Support device deassignment, it can be used in device hotplug.
Signed-off-by: Weidong Han <weidong.han@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
intel iommu APIs are updated, use the new APIs.
In addition, change kvm_iommu_map_guest() to just create the domain, let kvm_iommu_assign_device() assign device.
Signed-off-by: Weidong Han <weidong.han@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
If an assigned device shares a guest irq with an emulated
device then we currently interpret an ack generated by the
emulated device as originating from the assigned device
leading to e.g. "Unbalanced enable for IRQ 4347" from the
enable_irq() in kvm_assigned_dev_ack_irq().
The fix is fairly simple - don't enable the physical device
irq unless it was previously disabled.
Of course, this can still lead to a situation where a
non-assigned device ACK can cause the physical device irq to
be reenabled before the device was serviced. However, being
level sensitive, the interrupt will merely be regenerated.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This changes cpus_hardware_enabled from a cpumask_t to a cpumask_var_t:
equivalent for CONFIG_CPUMASKS_OFFSTACK=n, otherwise dynamically allocated.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@redhat.com>
We're getting rid on on-stack cpumasks for large NR_CPUS.
1) Use cpumask_var_t/alloc_cpumask_var.
2) smp_call_function_mask -> smp_call_function_many
3) cpus_clear, cpus_empty, cpu_set -> cpumask_clear, cpumask_empty,
cpumask_set_cpu.
This actually generates slightly smaller code than the old one with
CONFIG_CPUMASKS_OFFSTACK=n. (gcc knows that cpus cannot be NULL in
that case, where cpumask_var_t is cpumask_t[1]).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@redhat.com>
Avi said:
> Wow, code duplication from Rusty. Things must be bad.
Something about glass houses comes to mind. But instead, a patch.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@redhat.com>
There is a race between a "close of the file descriptors" and module
unload in the kvm module.
You can easily trigger this problem by applying this debug patch:
>--- kvm.orig/virt/kvm/kvm_main.c
>+++ kvm/virt/kvm/kvm_main.c
>@@ -648,10 +648,14 @@ void kvm_free_physmem(struct kvm *kvm)
> kvm_free_physmem_slot(&kvm->memslots[i], NULL);
> }
>
>+#include <linux/delay.h>
> static void kvm_destroy_vm(struct kvm *kvm)
> {
> struct mm_struct *mm = kvm->mm;
>
>+ printk("off1\n");
>+ msleep(5000);
>+ printk("off2\n");
> spin_lock(&kvm_lock);
> list_del(&kvm->vm_list);
> spin_unlock(&kvm_lock);
and killing the userspace, followed by an rmmod.
The problem is that kvm_destroy_vm can run while the module count
is 0. That means, you can remove the module while kvm_destroy_vm
is running. But kvm_destroy_vm is part of the module text. This
causes a kerneloops. The race exists without the msleep but is much
harder to trigger.
This patch requires the fix for anon_inodes (anon_inodes: use fops->owner
for module refcount).
With this patch, we can set the owner of all anonymous KVM inodes file
operations. The VFS will then control the KVM module refcount as long as there
is an open file. kvm_destroy_vm will be called by the release function of the
last closed file - before the VFS drops the module refcount.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Right now, KVM does not remove a slot when we do a
register ioctl for size 0 (would be the expected behaviour).
Instead, we only mark it as empty, but keep all bitmaps
and allocated data structures present. It completely
nullifies our chances of reusing that same slot again
for mapping a different piece of memory.
In this patch, we destroy rmaps, and vfree() the
pointers that used to hold the dirty bitmap, rmap
and lpage_info structures.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Split out the logic corresponding to undoing assign_irq() and
clean it up a bit.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Make sure kvm_request_irq_source_id() never returns
KVM_USERSPACE_IRQ_SOURCE_ID.
Likewise, check that kvm_free_irq_source_id() never accepts
KVM_USERSPACE_IRQ_SOURCE_ID.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Set assigned_dev->irq_source_id to -1 so that we can avoid freeing
a source ID which we never allocated.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We never pass a NULL notifier pointer here, but we may well
pass a notifier struct which hasn't previously been
registered.
Guard against this by using hlist_del_init() which will
not do anything if the node hasn't been added to the list
and, when removing the node, will ensure that a subsequent
call to hlist_del_init() will be fine too.
Fixes an oops seen when an assigned device is freed before
and IRQ is assigned to it.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We will obviously never pass a NULL struct kvm_irq_ack_notifier* to
this functions. They are always embedded in the assigned device
structure, so the assertion add nothing.
The irqchip_in_kernel() assertion is very out of place - clearly
this little abstraction needs to know nothing about the upper
layer details.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Impact: make global function static
virt/kvm/kvm_main.c:85:6: warning: symbol 'kvm_rebooting' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
Add marker_synchronize_unregister() before module unloading.
This prevents possible trace calls into unloaded module text.
Signed-off-by: Wu Fengguang <wfg@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Now we use MSI as default one, and translate MSI to INTx when guest need
INTx rather than MSI. For legacy device, we provide support for non-sharing
host IRQ.
Provide a parameter msi2intx for this method. The value is true by default in
x86 architecture.
We can't guarantee this mode can work on every device, but for most of us
tested, it works. If your device encounter some trouble with this mode, you can
try set msi2intx modules parameter to 0. If the device is OK with msi2intx=0,
then please report it to KVM mailing list or me. We may prepare a blacklist for
the device that can't work in this mode.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We enable guest MSI and host MSI support in this patch. The userspace want to
enable MSI should set KVM_DEV_IRQ_ASSIGN_ENABLE_MSI in the assigned_irq's flag.
Function would return -ENOTTY if can't enable MSI, userspace shouldn't set MSI
Enable bit when KVM_ASSIGN_IRQ return -ENOTTY with
KVM_DEV_IRQ_ASSIGN_ENABLE_MSI.
Userspace can tell the support of MSI device from #ifdef KVM_CAP_DEVICE_MSI.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The function is used to dispatch MSI to lapic according to MSI message
address and message data.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
It would be used for MSI in device assignment, for MSI dispatch.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>