If #PF happens during delivery of an exception into L2 and L1 also do
not have the page mapped in its shadow page table then L0 needs to
generate vmexit to L2 with original event in IDT_VECTORING_INFO, but
current code combines both exception and generates #DF instead. Fix that
by providing nVMX specific function to handle page faults during page
table walk that handles this case correctly.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
All exceptions should be checked for intercept during delivery to L2,
but we check only #PF currently. Drop nested_run_pending while we are
at it since exception cannot be injected during vmentry anyway.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
[Renamed the nested_vmx_check_exception function. - Paolo]
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If an exception causes vmexit directly it should not be reported in
IDT_VECTORING_INFO during the exit. For that we need to be able to
distinguish between exception that is injected into nested VM and one that
is reinjected because its delivery failed. Fortunately we already have
mechanism to do so for nested SVM, so here we just use correct function
to requeue exceptions and make sure that reinjected exception is not
moved to IDT_VECTORING_INFO during vmexit emulation and not re-checked
for interception during delivery.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
EXIT_REASON_VMLAUNCH/EXIT_REASON_VMRESUME exit does not mean that nested
VM will actually run during next entry. Move setting nested_run_pending
closer to vmentry emulation code and move its clearing close to vmexit to
minimize amount of code that will erroneously run with nested_run_pending
set.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Interception of the SET CLOCK instruction is mandatory, so this patch
provides a simple handler for this instruction (by setting up the
"epoch" field in the sie_block).
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch provides a simple version for the mandatory TEST BLOCK
instruction interception, so that guests that use this instruction
do not crash anymore.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Added a separate helper function that translates guest real addresses
to guest absolute addresses by applying the prefix of the guest CPU.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We're not always interested in both registers that are specified
for an RRE instruction. So allow NULL as parameter, too, to indicate
that we do not need the corresponding value.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The kvm->srcu lock has to be held while accessing the memory of
guests and during certain other actions. This patch now adds
the locks to the __vcpu_run function so that all affected code
is protected now (and additionally to the KVM_S390_STORE_STATUS
ioctl, which can be called out-of-band and needs a separate lock).
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Moved the do-while loop from kvm_arch_vcpu_ioctl_run into __vcpu_run
and the calling of kvm_handle_sie_intercept() into vcpu_post_run()
(so we can add the srcu locks in a proper way in the next patch).
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation for the following patch (which will change the indentation
of __vcpu_run quite a bit), this patch puts most of the code from __vcpu_run
into separate functions. The first function handles the code that runs
before the SIE instruction and the other one handles the code that runs
afterwards.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The need for SIE_INTERCEPT_RERUNVCPU has been removed long ago already,
with the following commit:
f7850c9288
[S390] remove kvm mmu reload on s390
Since the remainders are dead code, they are now removed by this patch.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Thanks Michael S Tsirkin for rewriting the description and suggestions.
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Now that we provide EPT support, there is no reason to torture our
guests by hiding the relieving unrestricted guest mode feature. We just
need to relax CR0 checks for always-on bits as PE and PG can now be
switched off.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Implement and advertise VM_EXIT_SAVE_IA32_EFER. L0 traps EFER writes
unconditionally, so we always find the current L2 value in the
architectural state.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fiddling with CR3 for L2 is L1's job. It may set its own, different
identity map or simple leave it alone if unrestricted guest mode is
enabled. This also fixes reading back the current CR3 on L2 exits for
reporting it to L1.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_set_cr0 performs checks on the state transition that may prevent
loading L1's cr0. For now we rely on the hardware to catch invalid
states loaded by L1 into its VMCS. Still, consistency checks on the host
state part of the VMCS on guest entry will have to be improved later on.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
'.done' is used to mark the completion of 'async_pf_execute()', but
'cancel_work_sync()' returns true when the work was canceled, so we
use it instead.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch documents the kvm->srcu lock (using the information from
a mail which has been posted by Marcelo Tosatti to the kvm mailing
list some months ago, see the following URL for details:
http://www.mail-archive.com/kvm@vger.kernel.org/msg90040.html )
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Here are a number of small staging tree and iio driver fixes. Nothing major,
just lots of little things.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlJAdRIACgkQMUfUDdst+ymS6QCfYTqZCbIvEbdSmkbaVFkMDsoE
J4gAoKeQAL9ltn1fh65XDKlQSwLJaML3
=USmd
-----END PGP SIGNATURE-----
Merge tag 'staging-3.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging fixes from Greg KH:
"Here are a number of small staging tree and iio driver fixes. Nothing
major, just lots of little things"
* tag 'staging-3.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (34 commits)
iio:buffer_cb: Add missing iio_buffer_init()
iio: Prevent race between IIO chardev opening and IIO device free
iio: fix: Keep a reference to the IIO device for open file descriptors
iio: Stop sampling when the device is removed
iio: Fix crash when scan_bytes is computed with active_scan_mask == NULL
iio: Fix mcp4725 dev-to-indio_dev conversion in suspend/resume
iio: Fix bma180 dev-to-indio_dev conversion in suspend/resume
iio: Fix tmp006 dev-to-indio_dev conversion in suspend/resume
iio: iio_device_add_event_sysfs() bugfix
staging: iio: ade7854-spi: Fix return value
staging:iio:hmc5843: Fix measurement conversion
iio: isl29018: Fix uninitialized value
staging:iio:dummy fix kfifo_buf kconfig dependency issue if kfifo modular and buffer enabled for built in dummy driver.
iio: at91: fix adc_clk overflow
staging: line6: add bounds check in snd_toneport_source_put()
Staging: comedi: Fix dependencies for drivers misclassified as PCI
staging: r8188eu: Adjust RX gain
staging: r8188eu: Fix smatch warning in core/rtw_ieee80211.
staging: r8188eu: Fix smatch error in core/rtw_mlme_ext.c
staging: r8188eu: Fix Smatch off-by-one warning in hal/rtl8188e_hal_init.c
...
Here are a number of small USB fixes for 3.12-rc2.
One is a revert of a EHCI change that isn't quite ready for 3.12. Others are
minor things, gadget fixes, Kconfig fixes, and some quirks and documentation
updates.
All have been in linux-next for a bit.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlJAapIACgkQMUfUDdst+yksHQCfeYJVtsWa5aG1OgLGmVC7HzGW
SNMAniUFk9Cg9AazfBNfURsfuucEmv6w
=kf93
-----END PGP SIGNATURE-----
Merge tag 'usb-3.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are a number of small USB fixes for 3.12-rc2.
One is a revert of a EHCI change that isn't quite ready for 3.12.
Others are minor things, gadget fixes, Kconfig fixes, and some quirks
and documentation updates.
All have been in linux-next for a bit"
* tag 'usb-3.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: pl2303: distinguish between original and cloned HX chips
USB: Faraday fotg210: fix email addresses
USB: fix typo in usb serial simple driver Kconfig
Revert "USB: EHCI: support running URB giveback in tasklet context"
usb: s3c-hsotg: do not disconnect gadget when receiving ErlySusp intr
usb: s3c-hsotg: fix unregistration function
usb: gadget: f_mass_storage: reset endpoint driver data when disabled
usb: host: fsl-mph-dr-of: Staticize local symbols
usb: gadget: f_eem: Staticize eem_alloc
usb: gadget: f_ecm: Staticize ecm_alloc
usb: phy: omap-usb3: Fix return value
usb: dwc3: gadget: avoid memory leak when failing to allocate all eps
usb: dwc3: remove extcon dependency
usb: gadget: add '__ref' for rndis_config_register() and cdc_config_register()
usb: dwc3: pci: add support for BayTrail
usb: gadget: cdc2: fix conversion to new interface of f_ecm
usb: gadget: fix a bug and a WARN_ON in dummy-hcd
usb: gadget: mv_u3d_core: fix violation of locking discipline in mv_u3d_ep_disable()
Pull drm fixes from Dave Airlie:
- some small fixes for msm and exynos
- a regression revert affecting nouveau users with old userspace
- intel pageflip deadlock and gpu hang fixes, hsw modesetting hangs
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (22 commits)
Revert "drm: mark context support as a legacy subsystem"
drm/i915: Don't enable the cursor on a disable pipe
drm/i915: do not update cursor in crtc mode set
drm/exynos: fix return value check in lowlevel_buffer_allocate()
drm/exynos: Fix address space warnings in exynos_drm_fbdev.c
drm/exynos: Fix address space warning in exynos_drm_buf.c
drm/exynos: Remove redundant OF dependency
drm/msm: drop unnecessary set_need_resched()
drm/i915: kill set_need_resched
drm/msm: fix potential NULL pointer dereference
drm/i915/dvo: set crtc timings again for panel fixed modes
drm/i915/sdvo: Robustify the dtd<->drm_mode conversions
drm/msm: workaround for missing irq
drm/msm: return -EBUSY if bo still active
drm/msm: fix return value check in ERR_PTR()
drm/msm: fix cmdstream size check
drm/msm: hangcheck harder
drm/msm: handle read vs write fences
drm/i915/sdvo: Fully translate sync flags in the dtd->mode conversion
drm/i915: Use proper print format for debug prints
...
Pull block IO fixes from Jens Axboe:
"After merge window, no new stuff this time only a collection of neatly
confined and simple fixes"
* 'for-3.12/core' of git://git.kernel.dk/linux-block:
cfq: explicitly use 64bit divide operation for 64bit arguments
block: Add nr_bios to block_rq_remap tracepoint
If the queue is dying then we only call the rq->end_io callout. This leaves bios setup on the request, because the caller assumes when the blk_execute_rq_nowait/blk_execute_rq call has completed that the rq->bios have been cleaned up.
bio-integrity: Fix use of bs->bio_integrity_pool after free
blkcg: relocate root_blkg setting and clearing
block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)
block: trace all devices plug operation
Pull btrfs fixes from Chris Mason:
"These are mostly bug fixes and a two small performance fixes. The
most important of the bunch are Josef's fix for a snapshotting
regression and Mark's update to fix compile problems on arm"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
Btrfs: create the uuid tree on remount rw
btrfs: change extent-same to copy entire argument struct
Btrfs: dir_inode_operations should use btrfs_update_time also
btrfs: Add btrfs: prefix to kernel log output
btrfs: refuse to remount read-write after abort
Btrfs: btrfs_ioctl_default_subvol: Revert back to toplevel subvolume when arg is 0
Btrfs: don't leak transaction in btrfs_sync_file()
Btrfs: add the missing mutex unlock in write_all_supers()
Btrfs: iput inode on allocation failure
Btrfs: remove space_info->reservation_progress
Btrfs: kill delay_iput arg to the wait_ordered functions
Btrfs: fix worst case calculator for space usage
Revert "Btrfs: rework the overcommit logic to be based on the total size"
Btrfs: improve replacing nocow extents
Btrfs: drop dir i_size when adding new names on replay
Btrfs: replay dir_index items before other items
Btrfs: check roots last log commit when checking if an inode has been logged
Btrfs: actually log directory we are fsync()'ing
Btrfs: actually limit the size of delalloc range
Btrfs: allocate the free space by the existed max extent size when ENOSPC
...
'samples' is 64bit operant, but do_div() second parameter is 32.
do_div silently truncates high 32 bits and calculated result
is invalid.
In case if low 32bit of 'samples' are zeros then do_div() produces
kernel crash.
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A series of wrong 'struct dev' assumptions in suspend/resume callbacks
following on from this issue being identified in a new driver review.
One to watch out for in future.
A number of driver specific fixes
1) at91 - fix a overflow in clock rate computation
2) dummy - Kconfig dependency issue
3) isl29018 - uninitialized value
4) hmc5843 - measurement conversion bug introduced by recent cleanup.
5) ade7854-spi - wrong return value.
Some IIO core fixes
1) Wrong value picked up for event code creation for a modified channel
2) A null dereference on failure to initialize a buffer after no buffer has
been in use, when using the available_scan_masks approach.
3) Sampling not stopped when a device is removed. Effects forced removal
such as hot unplugging.
4) Prevent device going away if a chrdev is still open in userspace.
5) Prevent race on chardev opening and device being freed.
6) Add a missing iio_buffer_init in the call back buffer.
These last few are the first part of a set from Lars-Peter Clausen who
has been taking a closer look at our removal paths and buffer handling
than anyone has for quite some time.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.21 (GNU/Linux)
iQIcBAABAgAGBQJSPbfaAAoJEFSFNJnE9BaIMjkP/2wItue2mPvaZbw53NcZZFhA
lbYUBzzZ0pg1L7zENx91HXsfv2loWcyDzf/PcJWpEuU9QxIm53mQQdRsv8BACcIM
NZOCNrBz28T0tSc2LzHVg/MWKeN2G/2n0AVgxxOtDpwLipeXhTp331qIqaM4JXex
JT+PiK2Nt2FQgmRJTtdwfplgSTi4+kazberlS+xtWNB891X8JjInO1/ABTTMtS6F
QmutbAjjButNMOGV7bfNaLkU+4IMIA0khzi745s9t2fS0JiQ6Xh9AUOtGjyXU0Dp
srShsM7gJWWNoBORTrQZydbLM3faPLdDDRIPutwo0G/0uVCLwGoekI60gjORHRwC
aPHvgDw+Dqe018sQxkCVWshNOi0KyIanvnN8wCKW81XZy9M2GYWbQYCb/Vw01cpT
FPUfElKeCVKCULLANE0SCUzGDpVjdiUwu956RDwmkgHEBXm7SgkVlTPEI22BmruF
Tvs5vF4dxQtbsB3sttGU8ulxBgupUvr3QLdark79bbmu3aKZ86TZ9Jd+NREnQGKe
6UXB/HaY8GQNe1xMukdhQ83/KtC8Au+r/xiVyjJxkVYrn0cOIMrAfQDNBLIPOziS
Qj5h6S4z2Yrl6h5Lak6eoEhzsdiB/3uFuLq6MNlEzfxbmRPmMO1Mm5/Hi57Jo2L/
78FDXqf4PaEKLEuXM322
=GQoN
-----END PGP SIGNATURE-----
Merge tag 'iio-fixes-for-3.12a' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-linus
Jonathan writes:
First round of IIO fixes for 3.12
A series of wrong 'struct dev' assumptions in suspend/resume callbacks
following on from this issue being identified in a new driver review.
One to watch out for in future.
A number of driver specific fixes
1) at91 - fix a overflow in clock rate computation
2) dummy - Kconfig dependency issue
3) isl29018 - uninitialized value
4) hmc5843 - measurement conversion bug introduced by recent cleanup.
5) ade7854-spi - wrong return value.
Some IIO core fixes
1) Wrong value picked up for event code creation for a modified channel
2) A null dereference on failure to initialize a buffer after no buffer has
been in use, when using the available_scan_masks approach.
3) Sampling not stopped when a device is removed. Effects forced removal
such as hot unplugging.
4) Prevent device going away if a chrdev is still open in userspace.
5) Prevent race on chardev opening and device being freed.
6) Add a missing iio_buffer_init in the call back buffer.
These last few are the first part of a set from Lars-Peter Clausen who
has been taking a closer look at our removal paths and buffer handling
than anyone has for quite some time.
- Fix a regression due to incorrect sharing of gss auth caches
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABAgAGBQJSPev8AAoJEGcL54qWCgDy+2UP/3ZyJbL8qIpbgdOdlCIFYHZ7
+8Z87XH31i+9hzKejSnmTURdRgWl7f0ehEOMG4soPF/4vpc1Ji03Xo0Iunq6AR3R
0JiKSPHJR+j1IiTdQR+HL127+ymUEECsWKubm0ZYOgphxhGy2e95sbU3C7w3wRry
kIul7+oVdmp6bXUbKpGqRx3SiT5H0YunGO0dBD7SWHJP4cQIPVNKd1ErRF9EAMqc
MhOTdy04hoYbL4AdZH95MGW8/l6t+djO8DRwI89Gfw1g2ybqZjbU72Ur+FotU09H
dQmyFuoiyRazO+VEInfvngYdtZ3w3ZfBqxQdq7rEhbrDnH0tLi+e49VjUFeNavr+
c+xeC169hzggplARaeCtMQkvulV5ucI6pQJyVZiOqiIpiXJmwAlhZzkuYmqdfISx
uZy43dgD64APoOeDcGmvqCnPfhl2gtGfEO36hGZVru4sZ5YaomgqA9gFUtkNnP3S
YUQovg/g8zCZ4F35AHFvYaRmbBUbqjhzEIamX4gAotd71+zFevrX6R00oFWluhJD
8WUdp4GzfSlIE9Ns6Ef9jIEba8M/1l9Vwzaa3MpbdtaDRSBVNhuz7Tm+3K/KeE8e
QL8zqyQbgVihPQyBpMl1ukxnTwrTvEkVDPQtsP+5J9ArasYjU5kTgjVAzmeTgOdR
3YxnAh7LMTKAoyItXta9
=44+y
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.12-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfix from Trond Myklebust:
"Fix a regression due to incorrect sharing of gss auth caches"
* tag 'nfs-for-3.12-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
RPCSEC_GSS: fix crash on destroying gss auth
Adding the number of bios in a remapped request to 'block_rq_remap'
tracepoint.
Request remapper clones bios in a request to track the completion
status of each bio. So the number of bios can be useful information
for investigation.
Related discussions:
http://www.redhat.com/archives/dm-devel/2013-August/msg00084.htmlhttp://www.redhat.com/archives/dm-devel/2013-September/msg00024.html
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Users have been complaining of the uuid tree stuff warning that there is no uuid
root when trying to do snapshot operations. This is because if you mount -o ro
we will not create the uuid tree. But then if you mount -o rw,remount we will
still not create it and then any subsequent snapshot/subvol operations you try
to do will fail gloriously. Fix this by creating the uuid_root on remount rw if
it was not already there. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
btrfs_ioctl_file_extent_same() uses __put_user_unaligned() to copy some data
back to it's argument struct. Unfortunately, not all architectures provide
__put_user_unaligned(), so compiles break on them if btrfs is selected.
Instead, just copy the whole struct in / out at the start and end of
operations, respectively.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Commit 2bc5565286 (Btrfs: don't update atime on
RO subvolumes) ensures that the access time of an inode is not updated when
the inode lives in a read-only subvolume.
However, if a directory on a read-only subvolume is accessed, the atime is
updated. This results in a write operation to a read-only subvolume. I
believe that access times should never be updated on read-only subvolumes.
To reproduce:
# mkfs.btrfs -f /dev/dm-3
(...)
# mount /dev/dm-3 /mnt
# btrfs subvol create /mnt/sub
Create subvolume '/mnt/sub'
# mkdir /mnt/sub/dir
# echo "abc" > /mnt/sub/dir/file
# btrfs subvol snapshot -r /mnt/sub /mnt/rosnap
Create a readonly snapshot of '/mnt/sub' in '/mnt/rosnap'
# stat /mnt/rosnap/dir
File: `/mnt/rosnap/dir'
Size: 8 Blocks: 0 IO Block: 4096 directory
Device: 16h/22d Inode: 257 Links: 1
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2013-09-11 07:21:49.389157126 -0400
Modify: 2013-09-11 07:22:02.330156079 -0400
Change: 2013-09-11 07:22:02.330156079 -0400
# ls /mnt/rosnap/dir
file
# stat /mnt/rosnap/dir
File: `/mnt/rosnap/dir'
Size: 8 Blocks: 0 IO Block: 4096 directory
Device: 16h/22d Inode: 257 Links: 1
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2013-09-11 07:22:56.797151670 -0400
Modify: 2013-09-11 07:22:02.330156079 -0400
Change: 2013-09-11 07:22:02.330156079 -0400
Reported-by: Koen De Wit <koen.de.wit@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The kernel log entries for device label %s and device fsid %pU
are missing the btrfs: prefix. Add those here.
Signed-off-by: Frank Holton <fholton@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It's still possible to flip the filesystem into RW mode after it's
remounted RO due to an abort. There are lots of places that check for
the superblock error bit and will not write data, but we should not let
the filesystem appear read-write.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch makes it possible to set BTRFS_FS_TREE_OBJECTID as the default
subvolume by passing a subvolume id of 0.
Signed-off-by: chandan <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In btrfs_sync_file(), if the call to btrfs_log_dentry_safe() returns
a negative error (for e.g. -ENOMEM via btrfs_log_inode()), we would
return without ending/freeing the transaction.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The BUG() was replaced by btrfs_error() and return -EIO with the
patch "get rid of one BUG() in write_all_supers()", but the missing
mutex_unlock() was overlooked.
The 0-DAY kernel build service from Intel reported the missing
unlock which was found by the coccinelle tool:
fs/btrfs/disk-io.c:3422:2-8: preceding lock on line 3374
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We don't do the iput when we fail to allocate our delayed delalloc work in
__start_delalloc_inodes, fix this.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This isn't used for anything anymore, just remove it.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is a left over of how we used to wait for ordered extents, which was to
grab the inode and then run filemap flush on it. However if we have an ordered
extent then we already are holding a ref on the inode, and we just use
btrfs_start_ordered_extent anyway, so there is no reason to have an extra ref on
the inode to start work on the ordered extent. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Forever ago I made the worst case calculator say that we could potentially split
into 3 blocks for every level on the way down, which isn't right. If we split
we're only going to get two new blocks, the one we originally cow'ed and the new
one we're going to split. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This reverts commit 70afa3998c. It is causing
performance issues and wasn't actually correct. There were problems with the
way we flushed delalloc and that was the real cause of the early enospc.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Various people have hit a deadlock when running btrfs/011. This is because when
replacing nocow extents we will take the i_mutex to make sure nobody messes with
the file while we are replacing the extent. The problem is we are already
holding a transaction open, which is a locking inversion, so instead we need to
save these inodes we find and then process them outside of the transaction.
Further we can't just lock the inode and assume we are good to go. We need to
lock the extent range and then read back the extent cache for the inode to make
sure the extent really still points at the physical block we want. If it
doesn't we don't have to copy it. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
So if we have dir_index items in the log that means we also have the inode item
as well, which means that the inode's i_size is correct. However when we
process dir_index'es we call btrfs_add_link() which will increase the
directory's i_size for the new entry. To fix this we need to just set the dir
items i_size to 0, and then as we find dir_index items we adjust the i_size.
btrfs_add_link() will do it for new entries, and if the entry already exists we
can just add the name_len to the i_size ourselves. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
A user reported a bug where his log would not replay because he was getting
-EEXIST back. This was because he had a file moved into a directory that was
logged. What happens is the file had a lower inode number, and so it is
processed first when replaying the log, and so we add the inode ref in for the
directory it was moved to. But then we process the directories DIR_INDEX item
and try to add the inode ref for that inode and it fails because we already
added it when we replayed the inode. To solve this problem we need to just
process any DIR_INDEX items we have in the log first so this all is taken care
of, and then we can replay the rest of the items. With this patch my reproducer
can remount the file system properly instead of erroring out. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Liu introduced a local copy of the last log commit for an inode to make sure we
actually log an inode even if a log commit has already taken place. In order to
make sure we didn't relog the same inode multiple times he set this local copy
to the current trans when we log the inode, because usually we log the inode and
then sync the log. The exception to this is during rename, we will relog an
inode if the name changed and it is already in the log. The problem with this
is then we go to sync the inode, and our check to see if the inode has already
been logged is tripped and we don't sync the log. To fix this we need to _also_
check against the roots last log commit, because it could be less than what is
in our local copy of the log commit. This fixes a bug where we rename a file
into a directory and then fsync the directory and then on remount the directory
is no longer there. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If you just create a directory and then fsync that directory and then pull the
power plug you will come back up and the directory will not be there. That is
because we won't actually create directories if we've logged files inside of
them since they will be created on replay, but in this check we will set our
logged_trans of our current directory if it happens to be a directory, making us
think it doesn't need to be logged. Fix the logic to only do this to parent
directories. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
So forever we have had this thing to limit the amount of delalloc pages we'll
setup to be written out to 128mb. This is because we have to lock all the pages
in this range, so anything above this gets a bit unweildly, and also without a
limit we'll happily allocate gigantic chunks of disk space. Turns out our check
for this wasn't quite right, we wouldn't actually limit the chunk we wanted to
write out, we'd just stop looking for more space after we went over the limit.
So if you do a giant 20gb dd on my box with lots of ram I could get 2gig
extents. This is fine normally, except when you go to relocate these extents
and we can't find enough space to relocate these moster extents, since we have
to be able to allocate exactly the same sized extent to move it around. So fix
this by actually enforcing the limit. With this patch I'm no longer seeing
giant 1.5gb extents. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
By the current code, if the requested size is very large, and all the extents
in the free space cache are small, we will waste lots of the cpu time to cut
the requested size in half and search the cache again and again until it gets
down to the size the allocator can return. In fact, we can know the max extent
size in the cache after the first search, so we needn't cut the size in half
repeatedly, and just use the max extent size directly. This way can save
lots of cpu time and make the performance grow up when there are only fragments
in the free space cache.
According to my test, if there are only 4KB free space extents in the fs,
and the total size of those extents are 256MB, we can reduce the execute
time of the following test from 5.4s to 1.4s.
dd if=/dev/zero of=<testfile> bs=1MB count=1 oflag=sync
Changelog v2 -> v3:
- fix the problem that we skip the block group with the space which is
less than we need.
Changelog v1 -> v2:
- address the problem that we return a wrong start position when searching
the free space in a bitmap.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>