IOMMU interrupt remapping support provides a further layer of
isolation for device assignment by preventing arbitrary interrupt
block DMA writes by a malicious guest from reaching the host. By
default, we should require that the platform provides interrupt
remapping support, with an opt-in mechanism for existing behavior.
Both AMD IOMMU and Intel VT-d2 hardware support interrupt
remapping, however we currently only have software support on
the Intel side. Users wishing to re-enable device assignment
when interrupt remapping is not supported on the platform can
use the "allow_unsafe_assigned_interrupts=1" module option.
[avi: break long lines]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The idea is from Avi:
| We could cache the result of a miss in an spte by using a reserved bit, and
| checking the page fault error code (or seeing if we get an ept violation or
| ept misconfiguration), so if we get repeated mmio on a page, we don't need to
| search the slot list/tree.
| (https://lkml.org/lkml/2011/2/22/221)
When the page fault is caused by mmio, we cache the info in the shadow page
table, and also set the reserved bits in the shadow page table, so if the mmio
is caused again, we can quickly identify it and emulate it directly
Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it
can be reduced by this feature, and also avoid walking guest page table for
soft mmu.
[jan: fix operator precedence issue]
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Use rcu to protect shadow pages table to be freed, so we can safely walk it,
it should run fastly and is needed by mmio page fault
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Now, the spte is just from nonprsent to present or present to nonprsent, so
we can use some trick to set/clear spte non-atomicly as linux kernel does
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce some interfaces to modify spte as linux kernel does:
- mmu_spte_clear_track_bits, it set the spte from present to nonpresent, and
track the stat bits(accessed/dirty) of spte
- mmu_spte_clear_no_track, the same as mmu_spte_clear_track_bits except
tracking the stat bits
- mmu_spte_set, set spte from nonpresent to present
- mmu_spte_update, only update the stat bits
Now, it does not allowed to set spte from present to present, later, we can
drop the atomicly opration for X86_32 host, and it is the preparing work to
get spte on X86_32 host out of the mmu lock
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce handle_abnormal_pfn to handle fault pfn on page fault path,
introduce mmu_invalid_pfn to handle fault pfn on prefetch path
It is the preparing work for mmio page fault support
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If the page fault is caused by mmio, the gfn can not be found in memslots, and
'bad_pfn' is returned on gfn_to_hva path, so we can use 'bad_pfn' to identify
the mmio page fault.
And, to clarify the meaning of mmio pfn, we return fault page instead of bad
page when the gfn is not allowd to prefetch
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The idea is from Avi:
| Maybe it's time to kill off bypass_guest_pf=1. It's not as effective as
| it used to be, since unsync pages always use shadow_trap_nonpresent_pte,
| and since we convert between the two nonpresent_ptes during sync and unsync.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Split kvm_mmu_free_page to kvm_mmu_isolate_page and
kvm_mmu_free_page
One is used to remove the page from cache under mmu lock and the other is
used to free page table out of mmu lock
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Move counting used shadow pages from commiting path to preparing path to
reduce tlb flush on some paths
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If 'pt_write' is true, we need to emulate the fault. And in later patch, we
need to emulate the fault even though it is not a pt_write event, so rename
it to better fit the meaning
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
gw->pte_access is the final access permission, since it is unified with
gw->pt_access when we walked guest page table:
FNAME(walk_addr_generic):
pte_access = pt_access & FNAME(gpte_access)(vcpu, pte, true);
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If dirty bit is not set, we can make the pte access read-only to avoid handing
dirty bit everywhere
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If the page fault is caused by mmio, we can cache the mmio info, later, we do
not need to walk guest page table and quickly know it is a mmio fault while we
emulate the mmio instruction
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce vcpu_mmio_gva_to_gpa to translate the gva to gpa, we can use it
to cleanup the code between read emulation and write emulation
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Properly check the last mapping, and do not walk to the next level if last spte
is met
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch implements the kvm bits of the steal time infrastructure.
The most important part of it, is the steal time clock. It is an
continuous clock that shows the accumulated amount of steal time
since vcpu creation. It is supposed to survive cpu offlining/onlining.
[marcelo: fix build with CONFIG_KVM_GUEST=n]
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Eric B Munson <emunson@mgebm.net>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Avi Kivity <avi@redhat.com>
CC: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regulator: Convert tps65023 to use regmap API
regmap: Add SPI bus support
regmap: Add I2C bus support
regmap: Add generic non-memory mapped register access API
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (77 commits)
[SCSI] fix crash in scsi_dispatch_cmd()
[SCSI] sr: check_events() ignore GET_EVENT when TUR says otherwise
[SCSI] bnx2i: Fixed kernel panic due to illegal usage of sc->request->cpu
[SCSI] bfa: Update the driver version to 3.0.2.1
[SCSI] bfa: Driver and BSG enhancements.
[SCSI] bfa: Added support to query PHY.
[SCSI] bfa: Added HBA diagnostics support.
[SCSI] bfa: Added support for flash configuration
[SCSI] bfa: Added support to obtain SFP info.
[SCSI] bfa: Added support for CEE info and stats query.
[SCSI] bfa: Extend BSG interface.
[SCSI] bfa: FCS bug fixes.
[SCSI] bfa: DMA memory allocation enhancement.
[SCSI] bfa: Brocade-1860 Fabric Adapter vHBA support.
[SCSI] bfa: Brocade-1860 Fabric Adapter PLL init fixes.
[SCSI] bfa: Added Fabric Assigned Address(FAA) support
[SCSI] bfa: IOC bug fixes.
[SCSI] bfa: Enable ASIC block configuration and query.
[SCSI] bnx2i: Updated copyright and bump version
[SCSI] bnx2i: Modified to skip CNIC registration if iSCSI is not supported
...
Fix up some trivial conflicts in:
- drivers/scsi/bnx2fc/{bnx2fc.h,bnx2fc_fcoe.c}:
Crazy broadcom version number conflicts
- drivers/target/tcm_fc/tfc_cmd.c
Just trivial cleanups done on adjacent lines
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6: (297 commits)
ALSA: asihpi - Replace with snd_ctl_boolean_mono_info()
ALSA: asihpi - HPI version 4.08
ALSA: asihpi - Add volume mute controls
ALSA: asihpi - Control name updates
ALSA: asihpi - Use size_t for sizeof result
ALSA: asihpi - Explicitly include mutex.h
ALSA: asihpi - Add new node and message defines
ALSA: asihpi - Make local function static
ALSA: asihpi - Fix minor typos and spelling
ALSA: asihpi - Remove unused structures, macros and functions
ALSA: asihpi - Remove spurious adapter index check
ALSA: asihpi - Revise snd_pcm_debug_name, get rid of DEBUG_NAME macro
ALSA: asihpi - DSP code loader API now independent of OS
ALSA: asihpi - Remove controlex structs and associated special data transfer code
ALSA: asihpi - Increase request and response buffer sizes
ALSA: asihpi - Give more meaningful name to hpi request message type
ALSA: usb-audio - Add quirk for Roland / BOSS BR-800
ALSA: hda - Remove a superfluous argument of via_auto_init_output()
ALSA: hda - Fix indep-HP path (de-)activation for VT1708* codecs
ALSA: hda - Add documentation for codec-specific mixer controls
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
net/9p: Fix the msize calculation.
fs/9p: add 9P2000.L unlinkat operation
fs/9p: add 9P2000.L renameat operation
fs/9p: Always ask new inode in create
fs/9p: Clean-up get_protocol_version() to use strcmp
fs/9p: Fix invalid mount options/args
fs/9p: When doing inode lookup compare qid details and inode mode bits.
fs/9p: Fid is not valid after a failed clunk.
net/9p: Remove structure not used in the code
VirtIO can transfer VIRTQUEUE_NUM of pages.
Fix the size of receive buffer packing onto VirtIO ring.
9p: clean up packet dump code
fs/9p: remove rename work around in 9p
net/9p: fix client code to fail more gracefully on protocol error
Refresh sysctl/kernel.txt. More specifically,
- drop stale index entries
- sync and sort index and entries
- reflow sticking out paragraphs to colwidth 72
- correct typos
- cleanup whitespace
Signed-off-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only the root cpuset has cpuset.memory_pressure_enabled flag,
but not the only one.
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Must echo a task id to the cgroups' tasks file, but not to a directory.
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-detect-hyper-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, hyper: Change hypervisor detection order
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-32, fpu: Fix DNA exception during check_fpu()
* 'x86-kexec-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
kexec, x86: Fix incorrect jump back address if not preserving context
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, config: Introduce an INTEL_MID configuration
* 'x86-quirks-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, quirks: Use pci_dev->revision
* 'x86-tsc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: tsc: Remove unneeded DMI-based blacklisting
* 'x86-smpboot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, boot: Wait for boot cpu to show up if nr_cpus limit is about to hit
* 'timers-clocksource-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
clocksource: apb: Share APB timer code with other platforms
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
um: Make rwsem.S depend on CONFIG_RWSEM_XCHGADD_ALGORITHM
* 'core-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
debug: Make CONFIG_EXPERT select CONFIG_DEBUG_KERNEL to unhide debug options
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq: Remove unused CHECK_IRQ_PER_CPU()
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf tools, x86: Fix 32-bit compile on 64-bit system
msize represents the maximum PDU size that includes P9_IOHDRSZ.
Signed-off-by: Venkateswararao Jujjuri "<jvrao@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
unlinkat - Remove a directory entry
size[4] Tunlinkat tag[2] dirfid[4] name[s] flag[4]
size[4] Runlinkat tag[2]
older Tremove have the below request format
size[4] Tremove tag[2] fid[4]
The remove message is used to remove a directory entry either file or directory
The remove opreation is actually a directory opertation and should ideally have
dirfid, if not we cannot represent the fid on server with anything other than
name. We will have to derive the directory name from fid in the Tremove request.
NOTE: The operation doesn't clunk the unlink fid.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
renameat - change name of file or directory
size[4] Trenameat tag[2] olddirfid[4] oldname[s] newdirfid[4] newname[s]
size[4] Rrenameat tag[2]
older Trename have the below request format
size[4] Trename tag[2] fid[4] newdirfid[4] name[s]
The rename message is used to change the name of a file, possibly moving it
to a new directory. The rename opreation is actually a directory opertation
and should ideally have olddirfid, if not we cannot represent the fid on server
with anything other than name. We will have to derive the old directory name
from fid in the Trename request.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This make sure we don't end up reusing the unlinked inode object.
The ideal way is to use inode i_generation. But i_generation is
not available in userspace always.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Without this fix, if any invalid mount options/args are passed while mouting
the 9p fs, no error (-EINVAL) is returned and default arg value is assigned.
This fix returns -EINVAL when an invalid arguement is found while parsing
mount options.
Signed-off-by: Prem Karat <prem.karat@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This make sure we don't use wrong inode from the inode hash. The inode number
of the file deleted is reused by the next file system object created
and if we only use inode number for inode hash lookup we could end up
with wrong struct inode.
Also compare inode generation number. Not all Linux file system provide
st_gen in userspace. So it could be 0;
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
free the fid even in case of failed clunk.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Switch to generic kernel hexdump library and cleanup macros to
be more consistent with the way we do normal debug prints.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>